Canonical App State — Pattern and Scale Review

Date: 2026-04-22
Reviewer: Independent audit (Claude / Opus 4.7)
Scope: Is AppState + AppStateStore an industry best-practice? Will it survive a real wafer inspection machine's data rates and functionality? Are the simulators producing data at anything near real-machine scale?
Verdict in one line: The pattern is right-shaped, but the implementation is sized for a demo. It will not survive a real machine as-is. The simulators are 2–4 orders of magnitude below real-machine load.

1. The short answer

Question	Answer
Is "canonical central app state" a legitimate industry pattern?	Yes. It is the desktop cousin of Redux / MVU / Elm / Fluxor, and it maps well onto SCADA "tag database" and OPC UA "address space" patterns.
Is this specific implementation production-grade?	No. It is a textbook-correct tutorial implementation. It trades performance and scalability for clarity.
Will it survive a real-machine workload unchanged?	No. GC pressure, O(n) list rebuilds, whole-state fan-out, and single-lock contention will all hurt once telemetry/frame rates rise, collections grow, or a second view subscribes.
Can the simulators emit real-machine data?	Not even close. 2 fps null-payload frames and 5 Hz two-tag telemetry vs. real tools' 30–200 fps multi-megapixel frames and 50–500 tag telemetry at up to kHz rates.
Is the architecture evolvable to production scale?	Yes. The boundaries are clean enough that the store can be split / re-implemented behind `IAppStateStore` without touching Domain or Presentation. That's the whole point of ADR-001, and it pays off here.

2. Is this an industry best practice?

2.1 Where the pattern comes from

The architecture in src/InspectionPrototype.Application/Services/AppStateStore.cs is recognisably one of the "single immutable store" families:

Redux / Flux (web UI) — single store, pure reducer, subscribers re-render.
Elm / MVU — model, messages, update function; ports to C# as Elmish / Fabulous.
Fluxor / Redux.NET / TinyStore — same pattern for Blazor/WPF.
ReactiveUI + WhenAnyValue — push-based, but still one logical model.

In industrial desktop software the nearest siblings are:

OPC UA address spaces — a typed tree of nodes; clients subscribe with sampling interval and deadband. Nodes are the canonical state.
SCADA tag databases (Ignition, WinCC, Wonderware, FactoryTalk) — a flat namespace of typed tags, with scan groups, change-of-value events, and historian bridges.
PLC process image / symbol table — the PLC's own canonical state, mirrored by HMIs via OPC / Modbus.
MTConnect / SECS-GEM equipment models — again, a shared vocabulary that the controller publishes and everyone else consumes.

So: a single, authoritative, machine-shaped state model that the UI and services project from is not only legitimate, it's the dominant pattern in this industry. The repo's instinct is correct.

2.2 Where this implementation sits on that spectrum

Property	Redux/Fluxor	OPC UA / SCADA	This repo
Single source of truth	✅	✅	✅
Immutable snapshots	✅	❌ (mutable nodes)	✅
Per-node / per-tag subscription	❌ (whole-store event)	✅ (per-node sampling)	❌ (whole-state event)
Typed schema for fields	⚠️ (string keys)	✅	✅ (records)
Event sourcing / time travel	✅ (middleware)	⚠️ (via historian)	❌
Per-subsystem isolation	❌	✅ (namespace-scoped)	❌
Multiple simultaneous consumers	✅	✅	⚠️ (single VM today)
Selector / memoization layer	✅ (Reselect, createSelector)	✅ (deadband/filter)	❌ (ad-hoc `ReferenceEquals` in VM)
Backpressure between producer and store	❌	✅ (sampling interval)	⚠️ (channels stop before the store)

So the implementation is closer to "single-subscriber Redux for WPF" than to an industrial tag system. That is fine for a training prototype and for slices 1–4. It starts failing when you want: multiple windows, rich per-panel selectors, per-node subscription rates, or multi-MB payloads.

2.3 What "best practice at scale" looks like in this vertical

Real wafer inspection desktop code tends to use one of these in combination with a central model:

Reactive streams per subsystem (ReactiveUI + IObservable<T> + DynamicData.SourceCache<T,TKey>). The central model becomes a read-only projection derived from streams. Per-cell .DistinctUntilChanged() suppresses redraws.
Actor model (Akka.NET or Orleans). Each subsystem is an actor with a private state; messages are the public API. Cross-process scaling is free.
Event sourcing (FrameCaptured, DefectDetected, AlarmRaised, RunCompleted) with an in-memory projection for the UI and a historian-style log for audit and replay.
OPC UA server-in-process. The app is a UA server; its own UI is just another client. Expensive to set up, but gives you every SCADA/MES tool for free.
Hybrid: control-plane state + data-plane streams — small canonical state for slow, structured things (connection, recipe, workflow); dedicated channels/buffers for high-rate data (frames, telemetry, defect records).

The last one is the right fit for evolving this repo. The current AppState is already trying to be both a control-plane and a data-plane store; that is what will fail first.

3. Concrete scaling concerns in the current implementation

All line references are at the time of review; search by symbol if they drift.

3.1 GC pressure from whole-record mutation

Every _store.Update(...) returns a new AppState record. AppState has 20 fields, six of which are collections (RecipeCatalog, RunHistory, ActiveAlarms, RecentDiagnostics, SimulatorProfileCatalog, plus LoadedRecipe.ScanPoints). The with expression allocates a new top-level record regardless of which field changed.

At 40 call sites, the update graph already touches AppState ~7 times per second at demo rates (~5 Hz telemetry + ~2 Hz frames + sparse workflow events).
At a modest real-machine rate (100 Hz telemetry + 30 Hz frames + per-tick position updates + per-frame defect increments + per-scan-point workflow events) you are easily at 500–1,500 allocations/sec of ~200-byte records plus their reducer closures.
The allocations themselves are fine for Gen-0, but each one fires StateChanged which triggers Dispatcher.Invoke on the UI thread, which allocates, which touches every [ObservableProperty] setter, which raises PropertyChanged events, which allocate PropertyChangedEventArgs... This compounds.

This is not a theoretical concern; it is the single most common reason Redux-style desktop apps become laggy at scale.

Mitigations that fit the existing architecture:

Keep the record, but split AppState into sub-records and only allocate the sub-record that changed (you already do this partially for PipelineCounters and OperationalCounters). Extract ConnectionSlice, MotionSlice, RunSlice, TelemetrySlice, DiagnosticsSlice, AlarmSlice, RecipeSlice.
Introduce a batched update API (UpdateBatch(Action<Builder>)) that applies multiple mutations and fires StateChanged once.
Use ImmutableList<T> / ImmutableArray<T> for the collection fields (they share structure across versions) or better, a dedicated high-rate buffer outside the record.

3.2 O(n) collection rebuilds

Every append to RunHistory, ActiveAlarms, and RecentDiagnostics allocates a new List<T>. In AppStateExtensions.WithDiagnosticsEntry this is already mitigated by the 200-entry cap, but RunHistory is unbounded and ActiveAlarms grows with every fault-then-clear cycle.

RunHistory can easily be 1,000s of rows per day in a real fab.
ActiveAlarms can spike to hundreds during a cascading fault storm.
Each _store.Update that touches these creates a fresh List<T> of the full current length.

In MainViewModel.Project the corresponding ObservableCollection.Clear(); foreach Add(...) patterns (line 337–339 for alarms, 367–369 for history, 379–381 for diagnostics) make this worse: every full rebuild causes WPF to tear down and rebuild item containers, losing selection and scroll position. The ReferenceEquals shortcut is applied to three of them but not to Alarms.

Best practice at scale: DynamicData.SourceCache<T, TKey> or ObservableChangeSet for each append-only/update collection; BindingList<T> or VirtualizingCollection for the UI; never Clear + AddAll.

3.3 Whole-state fan-out with no selectors

AppStateStore.StateChanged?.Invoke(next) hands every subscriber the whole state, every time. Today there is one subscriber (MainViewModel), so this is cheap. As soon as you add a second window, a dedicated defects view, a live chart, an alarm banner component — each gets every frame, every telemetry tick, every diagnostic entry, and has to re-check whether its slice actually changed.

Best practice at scale:

Publish typed event streams alongside StateChanged (IObservable<FrameEvent>, IObservable<AlarmEvent>, etc.), so a chart subscribes only to telemetry.
Or provide a selector API: IObservable<T> Select<T>(Func<AppState, T>) with .DistinctUntilChanged() built in. Libraries like DynamicData or ReactiveUI do this natively.
Or split into sub-stores (see §3.1) and raise slice-specific events.

3.4 Single lock across all writes

AppStateStore holds one Lock for every update. At demo rates, contention is invisible. At real-machine rates the following producers all contend on that one lock:

Workflow commands (low rate).
Motion position updates — currently 20 Hz, would be 100–1,000 Hz with a real encoder.
Telemetry — currently 5 Hz, real 10–500 Hz × dozens of tags.
Frame pipeline — currently 2 Hz, real 30–200 fps.
Fault injector and signal changes — rare but safety-critical.
UI read of Current — unbounded.

The lock serializes all of these. It also holds the lock while the reducer runs — reducers are user code, and at some call sites they allocate lists and iterate the alarm collection. A slow reducer stalls every other producer.

Best practice at scale: per-slice locks, lock-free CAS on Volatile.Read/Write<AppState> with retry (the whole reducer is cheap and pure, so retry is acceptable), or Channels as the writer input with a single dispatcher thread draining them serially.

3.5 Store mutations inside event handlers

WorkflowService.OnPositionChanged(x, y) fires whenever SimulatedMotionController raises PositionChanged, which happens every 50 ms during a move. Each call allocates a closure, takes the store lock, creates a new AppState, fires StateChanged, marshals to the UI thread, and re-projects. At 50 ms ticks this is fine; at a real encoder's 1 kHz it is untenable.

Best practice: coalesce high-rate position updates (System.Reactive.Linq.Sample(TimeSpan.FromMilliseconds(33)) or a dedicated channel with DropOldest) so the store only sees UI-relevant updates.

3.6 `RecentDiagnostics` is a hard cap, not a rotating buffer

AppStateExtensions.WithDiagnosticsEntry removes updated[0] when count exceeds 200. This:

Allocates a new List<DiagnosticsEntry> of size ≈200 on every entry.
Is O(n) because RemoveAt(0) shifts the array.
Silently drops older entries with no spill to disk.

At real fault-storm rates (hundreds of entries per second) you lose your forensic trail in under 2 seconds and the UI is allocating ~40 KB/sec just for this list.

Best practice: a Deque<T> / ring buffer outside AppState, exposed via IObservable<DiagnosticsEntry>, with a durable file/database sink.

3.7 Mixing control-plane and data-plane in one record

AppState currently holds:

control-plane fields (connection/workflow/motion/camera states, loaded recipe, command guards input) — correct place;
data-plane fields (LatestTelemetry, LatestFrame, PipelineCounters, OperationalCounters, RecentDiagnostics, RunHistory) — wrong place at scale.

Data-plane data changes fast and is usually consumed by specialised UI (charts, grids with virtualization, image viewers). Forcing it through the same reducer/event path as slow control-plane state is exactly what makes Redux apps feel sluggish. Real industrial apps keep these physically separate: a tag store for slow configuration/state, a time-series buffer for telemetry, a frame ring buffer for images, a defect database for results, a rolling log for diagnostics.

The prototype already has the right primitives (bounded Channel<T> for frames and telemetry); it just routes them into the central record once they leave the channel. The next refactor is to have the channels feed dedicated read models that the UI binds to directly, and keep AppState for control-plane only (plus a few "latest-value" snapshots used by guards).

3.8 No schema versioning, no persisted snapshot

If you ever want to: reopen the app into its previous state, snapshot-and-restore, export to an operator/service engineer, replay an incident — you need a versioned serialisation of AppState. Today: none.

3.9 Single-window assumption

MainViewModel captures Dispatcher.CurrentDispatcher at construction. A second window on a second thread, or an operator/engineer split across two monitors with independent dispatchers, would not receive state correctly.

3.10 No selector cache

MainViewModel.Project re-derives MachineReadyLabel, IsMachineReady, DefectBreakdownText, etc. on every update. For a demo this is fine. For a high-rate system a memoized selector (Reselect-style) would eliminate redundant string allocations and binding updates.

4. Can the simulators produce real-machine data?

Real wafer-inspection tools vary wildly — optical edge inspection, macro, bright-field, dark-field, E-beam, review SEM — but here are conservative, low-end numbers for a single modern optical inspection station:

Axis	Low-end real tool	Today's simulator
Camera frame rate	30–60 fps (often higher)	2 fps
Camera count	1–4 simultaneous	1
Frame payload	2–12 MP × 8–12 bit × 1–3 channels ≈ 2–48 MB/frame	`null` (`Frame.PreviewPayload = null`)
Frame byte rate	50–1,000+ MB/s aggregate	~0 B/s
Telemetry tags	50–500 (temps, pressures, motion encoders, lamp hours, vacuum, RF, chuck, gas flows...)	2 (temperature, pressure)
Telemetry rate	1–100 Hz per tag, some at kHz for encoder feedback	5 Hz combined
Motion encoder feedback	500 Hz – 4 kHz	20 Hz (50 ms tick)
Scan points per wafer	10–50,000	recipe-limited, sample recipes max 5
Defects per wafer	10s–100,000s	stochastic, ~1/s at 60% profile × 2 fps
Wafer throughput	60–200 wafers/hour	run-by-run operator action
Alarm rate during fault storm	10–100 /s burst	single manual injection
Recipe catalog size	100–1,000s	2 sample files
Run-history rows	10,000s/month	unbounded JSON array
Per-frame defect record size	bbox + class + confidence + image ref ≈ 100–500 B	string summary

4.1 Approximate under-scale factor

Frame byte rate: ≥ 10,000–100,000× below a real tool (because payload is null).
Telemetry bandwidth: ~100–500× below a modest SCADA tag scan.
Motion feedback: 25–200× below a real servo.
Alarms under fault storm: 10–100× below.
Defect-record volume and richness: untested entirely — schema today is a single string.

4.2 What the simulator is good at

The simulator is deliberately coarse, and that is the right call for a training / docs prototype. Where it shines:

Teaching the shape of the system: connection → home → run → stop/abort/fault/recover.
Exercising channel backpressure policies (drop-oldest, coalesce-latest) in a test-verifiable way.
Exercising the UI's threading discipline.
Exercising workflow state transitions end-to-end in tests.

These are the exact things a first-slice simulator should teach, and this one does.

4.3 What it cannot yet exercise

Because the data volumes are small and the payloads are empty, the simulator does not yet put any pressure on the parts of the system that real machines stress hardest:

Memory lifecycle of preview frames. No real byte[] or WriteableBitmap ever flows. GC pauses, LOH allocations, WriteableBitmap.Lock/AddDirtyRect/Unlock discipline, pixel-format conversions — all untested.
Image codec and disk bandwidth. Real tools write PNG/TIFF/BIN frames; no disk path is exercised.
Multi-channel telemetry. There is no abstraction for more than the two hard-coded fields in MachineTelemetry. Adding a 50-tag bag requires schema changes throughout AppState.
Encoder-rate motion. Task.Delay(50ms) is not a 1 kHz encoder. Any UI that tries to live-plot position would be fine at 20 Hz but not at 1 kHz.
Burst / storm profiles. Defect showers (a bad wafer producing 5,000 defects in 10 seconds), alarm cascades (one trip triggering 30 interlock alarms), telemetry glitches (a sensor dropping out for 2 s then returning).
Network jitter. Real SDK calls are PInvoke / TCP / OPC UA with latency, reorder, timeout, disconnect. The simulator always succeeds in 1.5 s ± 0.
Multi-wafer cadence. There is no notion of a load / align / run / unload cycle repeating at 30 s intervals for hours.
Long-soak behavior. Nothing runs for 8 hours. GC generations, file growth, event handler leaks — not probed.

4.4 What a scale-exercising simulator would add

Sketched as future slices, in increasing difficulty:

Frame payload generator — synthesise a real byte[] per frame (e.g. WriteableBitmap with Perlin noise or a checkerboard). Expose size, stride, pixel format. Add LOH stress test.
Multi-tag telemetry bag — replace MachineTelemetry with a keyed dictionary of TagSample(string Name, DateTimeOffset Ts, double Value, Quality Quality). Configure 50 tags in appsettings.json with per-tag intervals (5 Hz to 500 Hz) and noise models.
High-rate encoder — add a 1 kHz position feedback stream with its own bounded channel, separate from the UI-rate position events.
Storm profiles — extend SimulatorProfile with DefectShowerRate, AlarmBurstEvery, TelemetryDropoutChance, NetworkLatencyMean/Stddev. A new ChaosMonkey profile.
Rich defect model — Defect(Guid Id, Guid FrameId, BoundingBox Box, string ClassLabel, double Confidence, string? ImageBlobRef). Emit at realistic per-wafer volumes.
Wafer loop — a scenario scheduler that runs N wafers back-to-back with configurable inter-wafer delay, simulating a production cassette.
Time-compression soak mode — a flag that runs the whole wafer loop at 100× speed so a day of operation fits in 15 minutes; useful to catch leaks, runaway allocations, file growth.
SDK-flakiness injector — wrap each simulator method so it can return timeouts, C++-style exceptions, out-of-band callbacks, cancellation that actually doesn't cancel.

Items 1–2 alone would put meaningful pressure on the current AppState design and surface the problems in §3 before a real SDK shows up.

5. So… will the canonical app state survive real machines?

Not without changes. It will survive real machines in principle (the shape is right) but not in this implementation (the sizing is not). Specifically:

The control-plane core (connection, workflow, motion, camera, safety, loaded recipe, command guards) is already production-shaped. Keep it.
The data-plane currently embedded in AppState (LatestFrame, LatestTelemetry, RecentDiagnostics, RunHistory, ActiveAlarms, pipeline counters) needs to be lifted out into dedicated stores/streams before a real machine's data rates go through it. That is the single biggest refactor standing between this prototype and production.
The fan-out mechanism (StateChanged) needs to grow selector/observable semantics before a second panel or window subscribes.
The mutation mechanism (whole-record with, single lock) needs either slicing or a lock-free / per-slice strategy before high-rate producers are wired in.
The simulator needs a step-change in payload size and rate before any of the above concerns can be measured, not just theorised.

The good news, and it's a real piece of good news: because IAppStateStore is an interface, because the Application layer never mutates state outside of it, and because Command Guards are pure functions, every one of these refactors can be done without touching Domain, without touching Presentation, and (mostly) without touching test fakes. That is ADR-001 paying real dividends — and it is a better starting point than most shipped industrial desktop code ever gets.

6. Recommended evolution path (ordered)

Each step preserves the IAppStateStore contract and can land as its own slice. No big-bang rewrite.

Diff-safe Alarms projection in MainViewModel (match the ReferenceEquals pattern used for history/catalog/diagnostics). Two-line fix, big perceptual win.
Slice AppState into typed sub-records (ConnectionSlice, MotionSlice, RunSlice, AlarmSlice, DiagnosticsSlice). Reducers touch one slice each. AppState becomes a composition of slices with a sub-record with. Guard functions get correspondingly narrow input types.
Switch collection fields to ImmutableArray<T> / ImmutableList<T>, particularly RunHistory, ActiveAlarms, RecentDiagnostics. Eliminate RemoveAt(0) allocations.
Per-slice IObservable<T> published alongside StateChanged. ViewModels / panels subscribe only to what they need. Keep StateChanged for backwards compatibility or remove it once callers migrate.
Move high-rate data out of AppState: introduce ITelemetryBuffer, IFrameBuffer, IDiagnosticsJournal. Each owns its own ring buffer and IObservable feed. AppState keeps only "latest value" snapshots used for guards.
Batched/coalesced updates for position and telemetry so the store sees UI-rate, not sensor-rate.
Lock-free or per-slice locks once slice types are stable.
Event log / snapshot serialisation for schema-versioned persistence and incident replay.
Selector cache / memoization for derived labels.
Multi-window / multi-dispatcher support in the StateChanged bridge.

And in the simulator, in parallel:

Real byte[] frame payloads at configurable fps.
Multi-tag telemetry dictionary.
High-rate encoder channel separated from UI-rate position.
Storm / chaos / soak profiles.
Rich defect model with realistic per-wafer volumes.
Wafer loop / time-compression mode.

7. Bottom line

The canonical-app-state pattern here is the right tool for the job and places the project firmly inside the mainstream of both web-style single-store architectures and industrial control-plane modelling. It is correctly bounded by ADR-001 and cleanly testable.

But a canonical app state for one operator's current run is not the same thing as a canonical app state for a real inspection tool's live data-plane, and the current implementation conflates the two. Until that split happens, and until the simulator is fat enough to exercise it, calling this "production-ready" would be premature.

Treat the pattern as a load-bearing foundation. Treat the current data-plane embedding as scaffolding. Scale the simulators first so the right pressure exists; then let the design evolve under that pressure.

— End of review

Domains

Terms

1 Machine Control and Motion Systems

2 Hardware Integration and Device Control

3 Industrial Software Architecture

4 Industrial Communication and Integration

5 Vision, Imaging and Inspection Systems

6 UI HMI Operator Experience

7 Reliability Safety and Production Readiness

Industrial Desktop Systems

Streaming Pipelines Dotnet Real World

Canonical App State — Pattern and Scale Review

1. The short answer

2. Is this an industry best practice?

2.1 Where the pattern comes from

2.2 Where this implementation sits on that spectrum

2.3 What "best practice at scale" looks like in this vertical

3. Concrete scaling concerns in the current implementation

3.1 GC pressure from whole-record mutation

3.2 O(n) collection rebuilds

3.3 Whole-state fan-out with no selectors

3.4 Single lock across all writes

3.5 Store mutations inside event handlers

3.6 `RecentDiagnostics` is a hard cap, not a rotating buffer

3.7 Mixing control-plane and data-plane in one record

3.8 No schema versioning, no persisted snapshot

3.9 Single-window assumption

3.10 No selector cache

4. Can the simulators produce real-machine data?

4.1 Approximate under-scale factor

4.2 What the simulator is good at

4.3 What it cannot yet exercise

4.4 What a scale-exercising simulator would add

5. So… will the canonical app state survive real machines?

6. Recommended evolution path (ordered)

7. Bottom line

Streaming Pipelines Dotnet Real World

Canonical App State — Pattern and Scale Review ​

1. The short answer ​

2. Is this an industry best practice? ​

2.1 Where the pattern comes from ​

2.2 Where this implementation sits on that spectrum ​

2.3 What "best practice at scale" looks like in this vertical ​

3. Concrete scaling concerns in the current implementation ​

3.1 GC pressure from whole-record mutation ​

3.2 O(n) collection rebuilds ​

3.3 Whole-state fan-out with no selectors ​

3.4 Single lock across all writes ​

3.5 Store mutations inside event handlers ​

3.6 RecentDiagnostics is a hard cap, not a rotating buffer ​

3.7 Mixing control-plane and data-plane in one record ​

3.8 No schema versioning, no persisted snapshot ​

3.9 Single-window assumption ​

3.10 No selector cache ​

4. Can the simulators produce real-machine data? ​

4.1 Approximate under-scale factor ​

4.2 What the simulator is good at ​

4.3 What it cannot yet exercise ​

4.4 What a scale-exercising simulator would add ​

5. So… will the canonical app state survive real machines? ​

6. Recommended evolution path (ordered) ​

7. Bottom line ​

Canonical App State — Pattern and Scale Review

1. The short answer

2. Is this an industry best practice?

2.1 Where the pattern comes from

2.2 Where this implementation sits on that spectrum

2.3 What "best practice at scale" looks like in this vertical

3. Concrete scaling concerns in the current implementation

3.1 GC pressure from whole-record mutation

3.2 O(n) collection rebuilds

3.3 Whole-state fan-out with no selectors

3.4 Single lock across all writes

3.5 Store mutations inside event handlers

3.6 `RecentDiagnostics` is a hard cap, not a rotating buffer

3.7 Mixing control-plane and data-plane in one record

3.8 No schema versioning, no persisted snapshot

3.9 Single-window assumption

3.10 No selector cache

4. Can the simulators produce real-machine data?

4.1 Approximate under-scale factor

4.2 What the simulator is good at

4.3 What it cannot yet exercise

4.4 What a scale-exercising simulator would add

5. So… will the canonical app state survive real machines?

6. Recommended evolution path (ordered)

7. Bottom line