Canonical App State — Pattern and Scale Review
- Date: 2026-04-22
- Reviewer: Independent audit (Claude / Opus 4.7)
- Scope: Is
AppState+AppStateStorean industry best-practice? Will it survive a real wafer inspection machine's data rates and functionality? Are the simulators producing data at anything near real-machine scale? - Verdict in one line: The pattern is right-shaped, but the implementation is sized for a demo. It will not survive a real machine as-is. The simulators are 2–4 orders of magnitude below real-machine load.
1. The short answer
| Question | Answer |
|---|---|
| Is "canonical central app state" a legitimate industry pattern? | Yes. It is the desktop cousin of Redux / MVU / Elm / Fluxor, and it maps well onto SCADA "tag database" and OPC UA "address space" patterns. |
| Is this specific implementation production-grade? | No. It is a textbook-correct tutorial implementation. It trades performance and scalability for clarity. |
| Will it survive a real-machine workload unchanged? | No. GC pressure, O(n) list rebuilds, whole-state fan-out, and single-lock contention will all hurt once telemetry/frame rates rise, collections grow, or a second view subscribes. |
| Can the simulators emit real-machine data? | Not even close. 2 fps null-payload frames and 5 Hz two-tag telemetry vs. real tools' 30–200 fps multi-megapixel frames and 50–500 tag telemetry at up to kHz rates. |
| Is the architecture evolvable to production scale? | Yes. The boundaries are clean enough that the store can be split / re-implemented behind IAppStateStore without touching Domain or Presentation. That's the whole point of ADR-001, and it pays off here. |
2. Is this an industry best practice?
2.1 Where the pattern comes from
The architecture in src/InspectionPrototype.Application/Services/AppStateStore.cs is recognisably one of the "single immutable store" families:
- Redux / Flux (web UI) — single store, pure reducer, subscribers re-render.
- Elm / MVU — model, messages, update function; ports to C# as Elmish / Fabulous.
- Fluxor / Redux.NET / TinyStore — same pattern for Blazor/WPF.
- ReactiveUI + WhenAnyValue — push-based, but still one logical model.
In industrial desktop software the nearest siblings are:
- OPC UA address spaces — a typed tree of nodes; clients subscribe with sampling interval and deadband. Nodes are the canonical state.
- SCADA tag databases (Ignition, WinCC, Wonderware, FactoryTalk) — a flat namespace of typed tags, with scan groups, change-of-value events, and historian bridges.
- PLC process image / symbol table — the PLC's own canonical state, mirrored by HMIs via OPC / Modbus.
- MTConnect / SECS-GEM equipment models — again, a shared vocabulary that the controller publishes and everyone else consumes.
So: a single, authoritative, machine-shaped state model that the UI and services project from is not only legitimate, it's the dominant pattern in this industry. The repo's instinct is correct.
2.2 Where this implementation sits on that spectrum
| Property | Redux/Fluxor | OPC UA / SCADA | This repo |
|---|---|---|---|
| Single source of truth | ✅ | ✅ | ✅ |
| Immutable snapshots | ✅ | ❌ (mutable nodes) | ✅ |
| Per-node / per-tag subscription | ❌ (whole-store event) | ✅ (per-node sampling) | ❌ (whole-state event) |
| Typed schema for fields | ⚠️ (string keys) | ✅ | ✅ (records) |
| Event sourcing / time travel | ✅ (middleware) | ⚠️ (via historian) | ❌ |
| Per-subsystem isolation | ❌ | ✅ (namespace-scoped) | ❌ |
| Multiple simultaneous consumers | ✅ | ✅ | ⚠️ (single VM today) |
| Selector / memoization layer | ✅ (Reselect, createSelector) | ✅ (deadband/filter) | ❌ (ad-hoc ReferenceEquals in VM) |
| Backpressure between producer and store | ❌ | ✅ (sampling interval) | ⚠️ (channels stop before the store) |
So the implementation is closer to "single-subscriber Redux for WPF" than to an industrial tag system. That is fine for a training prototype and for slices 1–4. It starts failing when you want: multiple windows, rich per-panel selectors, per-node subscription rates, or multi-MB payloads.
2.3 What "best practice at scale" looks like in this vertical
Real wafer inspection desktop code tends to use one of these in combination with a central model:
- Reactive streams per subsystem (ReactiveUI +
IObservable<T>+DynamicData.SourceCache<T,TKey>). The central model becomes a read-only projection derived from streams. Per-cell.DistinctUntilChanged()suppresses redraws. - Actor model (Akka.NET or Orleans). Each subsystem is an actor with a private state; messages are the public API. Cross-process scaling is free.
- Event sourcing (
FrameCaptured,DefectDetected,AlarmRaised,RunCompleted) with an in-memory projection for the UI and a historian-style log for audit and replay. - OPC UA server-in-process. The app is a UA server; its own UI is just another client. Expensive to set up, but gives you every SCADA/MES tool for free.
- Hybrid: control-plane state + data-plane streams — small canonical state for slow, structured things (connection, recipe, workflow); dedicated channels/buffers for high-rate data (frames, telemetry, defect records).
The last one is the right fit for evolving this repo. The current AppState is already trying to be both a control-plane and a data-plane store; that is what will fail first.
3. Concrete scaling concerns in the current implementation
All line references are at the time of review; search by symbol if they drift.
3.1 GC pressure from whole-record mutation
Every _store.Update(...) returns a new AppState record. AppState has 20 fields, six of which are collections (RecipeCatalog, RunHistory, ActiveAlarms, RecentDiagnostics, SimulatorProfileCatalog, plus LoadedRecipe.ScanPoints). The with expression allocates a new top-level record regardless of which field changed.
- At 40 call sites, the update graph already touches
AppState~7 times per second at demo rates (~5 Hz telemetry + ~2 Hz frames + sparse workflow events). - At a modest real-machine rate (100 Hz telemetry + 30 Hz frames + per-tick position updates + per-frame defect increments + per-scan-point workflow events) you are easily at 500–1,500 allocations/sec of ~200-byte records plus their reducer closures.
- The allocations themselves are fine for Gen-0, but each one fires
StateChangedwhich triggersDispatcher.Invokeon the UI thread, which allocates, which touches every[ObservableProperty]setter, which raisesPropertyChangedevents, which allocatePropertyChangedEventArgs... This compounds.
This is not a theoretical concern; it is the single most common reason Redux-style desktop apps become laggy at scale.
Mitigations that fit the existing architecture:
- Keep the record, but split AppState into sub-records and only allocate the sub-record that changed (you already do this partially for
PipelineCountersandOperationalCounters). ExtractConnectionSlice,MotionSlice,RunSlice,TelemetrySlice,DiagnosticsSlice,AlarmSlice,RecipeSlice. - Introduce a batched update API (
UpdateBatch(Action<Builder>)) that applies multiple mutations and firesStateChangedonce. - Use
ImmutableList<T>/ImmutableArray<T>for the collection fields (they share structure across versions) or better, a dedicated high-rate buffer outside the record.
3.2 O(n) collection rebuilds
Every append to RunHistory, ActiveAlarms, and RecentDiagnostics allocates a new List<T>. In AppStateExtensions.WithDiagnosticsEntry this is already mitigated by the 200-entry cap, but RunHistory is unbounded and ActiveAlarms grows with every fault-then-clear cycle.
RunHistorycan easily be 1,000s of rows per day in a real fab.ActiveAlarmscan spike to hundreds during a cascading fault storm.- Each
_store.Updatethat touches these creates a freshList<T>of the full current length.
In MainViewModel.Project the corresponding ObservableCollection.Clear(); foreach Add(...) patterns (line 337–339 for alarms, 367–369 for history, 379–381 for diagnostics) make this worse: every full rebuild causes WPF to tear down and rebuild item containers, losing selection and scroll position. The ReferenceEquals shortcut is applied to three of them but not to Alarms.
Best practice at scale: DynamicData.SourceCache<T, TKey> or ObservableChangeSet for each append-only/update collection; BindingList<T> or VirtualizingCollection for the UI; never Clear + AddAll.
3.3 Whole-state fan-out with no selectors
AppStateStore.StateChanged?.Invoke(next) hands every subscriber the whole state, every time. Today there is one subscriber (MainViewModel), so this is cheap. As soon as you add a second window, a dedicated defects view, a live chart, an alarm banner component — each gets every frame, every telemetry tick, every diagnostic entry, and has to re-check whether its slice actually changed.
Best practice at scale:
- Publish typed event streams alongside
StateChanged(IObservable<FrameEvent>,IObservable<AlarmEvent>, etc.), so a chart subscribes only to telemetry. - Or provide a selector API:
IObservable<T> Select<T>(Func<AppState, T>)with.DistinctUntilChanged()built in. Libraries like DynamicData or ReactiveUI do this natively. - Or split into sub-stores (see §3.1) and raise slice-specific events.
3.4 Single lock across all writes
AppStateStore holds one Lock for every update. At demo rates, contention is invisible. At real-machine rates the following producers all contend on that one lock:
- Workflow commands (low rate).
- Motion position updates — currently 20 Hz, would be 100–1,000 Hz with a real encoder.
- Telemetry — currently 5 Hz, real 10–500 Hz × dozens of tags.
- Frame pipeline — currently 2 Hz, real 30–200 fps.
- Fault injector and signal changes — rare but safety-critical.
- UI read of
Current— unbounded.
The lock serializes all of these. It also holds the lock while the reducer runs — reducers are user code, and at some call sites they allocate lists and iterate the alarm collection. A slow reducer stalls every other producer.
Best practice at scale: per-slice locks, lock-free CAS on Volatile.Read/Write<AppState> with retry (the whole reducer is cheap and pure, so retry is acceptable), or Channels as the writer input with a single dispatcher thread draining them serially.
3.5 Store mutations inside event handlers
WorkflowService.OnPositionChanged(x, y) fires whenever SimulatedMotionController raises PositionChanged, which happens every 50 ms during a move. Each call allocates a closure, takes the store lock, creates a new AppState, fires StateChanged, marshals to the UI thread, and re-projects. At 50 ms ticks this is fine; at a real encoder's 1 kHz it is untenable.
Best practice: coalesce high-rate position updates (System.Reactive.Linq.Sample(TimeSpan.FromMilliseconds(33)) or a dedicated channel with DropOldest) so the store only sees UI-relevant updates.
3.6 RecentDiagnostics is a hard cap, not a rotating buffer
AppStateExtensions.WithDiagnosticsEntry removes updated[0] when count exceeds 200. This:
- Allocates a new
List<DiagnosticsEntry>of size ≈200 on every entry. - Is O(n) because
RemoveAt(0)shifts the array. - Silently drops older entries with no spill to disk.
At real fault-storm rates (hundreds of entries per second) you lose your forensic trail in under 2 seconds and the UI is allocating ~40 KB/sec just for this list.
Best practice: a Deque<T> / ring buffer outside AppState, exposed via IObservable<DiagnosticsEntry>, with a durable file/database sink.
3.7 Mixing control-plane and data-plane in one record
AppState currently holds:
- control-plane fields (connection/workflow/motion/camera states, loaded recipe, command guards input) — correct place;
- data-plane fields (
LatestTelemetry,LatestFrame,PipelineCounters,OperationalCounters,RecentDiagnostics,RunHistory) — wrong place at scale.
Data-plane data changes fast and is usually consumed by specialised UI (charts, grids with virtualization, image viewers). Forcing it through the same reducer/event path as slow control-plane state is exactly what makes Redux apps feel sluggish. Real industrial apps keep these physically separate: a tag store for slow configuration/state, a time-series buffer for telemetry, a frame ring buffer for images, a defect database for results, a rolling log for diagnostics.
The prototype already has the right primitives (bounded Channel<T> for frames and telemetry); it just routes them into the central record once they leave the channel. The next refactor is to have the channels feed dedicated read models that the UI binds to directly, and keep AppState for control-plane only (plus a few "latest-value" snapshots used by guards).
3.8 No schema versioning, no persisted snapshot
If you ever want to: reopen the app into its previous state, snapshot-and-restore, export to an operator/service engineer, replay an incident — you need a versioned serialisation of AppState. Today: none.
3.9 Single-window assumption
MainViewModel captures Dispatcher.CurrentDispatcher at construction. A second window on a second thread, or an operator/engineer split across two monitors with independent dispatchers, would not receive state correctly.
3.10 No selector cache
MainViewModel.Project re-derives MachineReadyLabel, IsMachineReady, DefectBreakdownText, etc. on every update. For a demo this is fine. For a high-rate system a memoized selector (Reselect-style) would eliminate redundant string allocations and binding updates.
4. Can the simulators produce real-machine data?
Real wafer-inspection tools vary wildly — optical edge inspection, macro, bright-field, dark-field, E-beam, review SEM — but here are conservative, low-end numbers for a single modern optical inspection station:
| Axis | Low-end real tool | Today's simulator |
|---|---|---|
| Camera frame rate | 30–60 fps (often higher) | 2 fps |
| Camera count | 1–4 simultaneous | 1 |
| Frame payload | 2–12 MP × 8–12 bit × 1–3 channels ≈ 2–48 MB/frame | null (Frame.PreviewPayload = null) |
| Frame byte rate | 50–1,000+ MB/s aggregate | ~0 B/s |
| Telemetry tags | 50–500 (temps, pressures, motion encoders, lamp hours, vacuum, RF, chuck, gas flows...) | 2 (temperature, pressure) |
| Telemetry rate | 1–100 Hz per tag, some at kHz for encoder feedback | 5 Hz combined |
| Motion encoder feedback | 500 Hz – 4 kHz | 20 Hz (50 ms tick) |
| Scan points per wafer | 10–50,000 | recipe-limited, sample recipes max 5 |
| Defects per wafer | 10s–100,000s | stochastic, ~1/s at 60% profile × 2 fps |
| Wafer throughput | 60–200 wafers/hour | run-by-run operator action |
| Alarm rate during fault storm | 10–100 /s burst | single manual injection |
| Recipe catalog size | 100–1,000s | 2 sample files |
| Run-history rows | 10,000s/month | unbounded JSON array |
| Per-frame defect record size | bbox + class + confidence + image ref ≈ 100–500 B | string summary |
4.1 Approximate under-scale factor
- Frame byte rate: ≥ 10,000–100,000× below a real tool (because payload is
null). - Telemetry bandwidth: ~100–500× below a modest SCADA tag scan.
- Motion feedback: 25–200× below a real servo.
- Alarms under fault storm: 10–100× below.
- Defect-record volume and richness: untested entirely — schema today is a single string.
4.2 What the simulator is good at
The simulator is deliberately coarse, and that is the right call for a training / docs prototype. Where it shines:
- Teaching the shape of the system: connection → home → run → stop/abort/fault/recover.
- Exercising channel backpressure policies (drop-oldest, coalesce-latest) in a test-verifiable way.
- Exercising the UI's threading discipline.
- Exercising workflow state transitions end-to-end in tests.
These are the exact things a first-slice simulator should teach, and this one does.
4.3 What it cannot yet exercise
Because the data volumes are small and the payloads are empty, the simulator does not yet put any pressure on the parts of the system that real machines stress hardest:
- Memory lifecycle of preview frames. No real
byte[]orWriteableBitmapever flows. GC pauses, LOH allocations,WriteableBitmap.Lock/AddDirtyRect/Unlockdiscipline, pixel-format conversions — all untested. - Image codec and disk bandwidth. Real tools write PNG/TIFF/BIN frames; no disk path is exercised.
- Multi-channel telemetry. There is no abstraction for more than the two hard-coded fields in
MachineTelemetry. Adding a 50-tag bag requires schema changes throughout AppState. - Encoder-rate motion.
Task.Delay(50ms)is not a 1 kHz encoder. Any UI that tries to live-plot position would be fine at 20 Hz but not at 1 kHz. - Burst / storm profiles. Defect showers (a bad wafer producing 5,000 defects in 10 seconds), alarm cascades (one trip triggering 30 interlock alarms), telemetry glitches (a sensor dropping out for 2 s then returning).
- Network jitter. Real SDK calls are PInvoke / TCP / OPC UA with latency, reorder, timeout, disconnect. The simulator always succeeds in 1.5 s ± 0.
- Multi-wafer cadence. There is no notion of a load / align / run / unload cycle repeating at 30 s intervals for hours.
- Long-soak behavior. Nothing runs for 8 hours. GC generations, file growth, event handler leaks — not probed.
4.4 What a scale-exercising simulator would add
Sketched as future slices, in increasing difficulty:
- Frame payload generator — synthesise a real
byte[]per frame (e.g.WriteableBitmapwith Perlin noise or a checkerboard). Expose size, stride, pixel format. Add LOH stress test. - Multi-tag telemetry bag — replace
MachineTelemetrywith a keyed dictionary ofTagSample(string Name, DateTimeOffset Ts, double Value, Quality Quality). Configure 50 tags inappsettings.jsonwith per-tag intervals (5 Hz to 500 Hz) and noise models. - High-rate encoder — add a 1 kHz position feedback stream with its own bounded channel, separate from the UI-rate position events.
- Storm profiles — extend
SimulatorProfilewithDefectShowerRate,AlarmBurstEvery,TelemetryDropoutChance,NetworkLatencyMean/Stddev. A newChaosMonkeyprofile. - Rich defect model —
Defect(Guid Id, Guid FrameId, BoundingBox Box, string ClassLabel, double Confidence, string? ImageBlobRef). Emit at realistic per-wafer volumes. - Wafer loop — a scenario scheduler that runs N wafers back-to-back with configurable inter-wafer delay, simulating a production cassette.
- Time-compression soak mode — a flag that runs the whole wafer loop at 100× speed so a day of operation fits in 15 minutes; useful to catch leaks, runaway allocations, file growth.
- SDK-flakiness injector — wrap each simulator method so it can return timeouts, C++-style exceptions, out-of-band callbacks, cancellation that actually doesn't cancel.
Items 1–2 alone would put meaningful pressure on the current AppState design and surface the problems in §3 before a real SDK shows up.
5. So… will the canonical app state survive real machines?
Not without changes. It will survive real machines in principle (the shape is right) but not in this implementation (the sizing is not). Specifically:
- The control-plane core (connection, workflow, motion, camera, safety, loaded recipe, command guards) is already production-shaped. Keep it.
- The data-plane currently embedded in
AppState(LatestFrame,LatestTelemetry,RecentDiagnostics,RunHistory,ActiveAlarms, pipeline counters) needs to be lifted out into dedicated stores/streams before a real machine's data rates go through it. That is the single biggest refactor standing between this prototype and production. - The fan-out mechanism (
StateChanged) needs to grow selector/observable semantics before a second panel or window subscribes. - The mutation mechanism (whole-record
with, single lock) needs either slicing or a lock-free / per-slice strategy before high-rate producers are wired in. - The simulator needs a step-change in payload size and rate before any of the above concerns can be measured, not just theorised.
The good news, and it's a real piece of good news: because IAppStateStore is an interface, because the Application layer never mutates state outside of it, and because Command Guards are pure functions, every one of these refactors can be done without touching Domain, without touching Presentation, and (mostly) without touching test fakes. That is ADR-001 paying real dividends — and it is a better starting point than most shipped industrial desktop code ever gets.
6. Recommended evolution path (ordered)
Each step preserves the IAppStateStore contract and can land as its own slice. No big-bang rewrite.
- Diff-safe
Alarmsprojection inMainViewModel(match theReferenceEqualspattern used for history/catalog/diagnostics). Two-line fix, big perceptual win. - Slice
AppStateinto typed sub-records (ConnectionSlice,MotionSlice,RunSlice,AlarmSlice,DiagnosticsSlice). Reducers touch one slice each.AppStatebecomes a composition of slices with a sub-recordwith. Guard functions get correspondingly narrow input types. - Switch collection fields to
ImmutableArray<T>/ImmutableList<T>, particularlyRunHistory,ActiveAlarms,RecentDiagnostics. EliminateRemoveAt(0)allocations. - Per-slice
IObservable<T>published alongsideStateChanged. ViewModels / panels subscribe only to what they need. KeepStateChangedfor backwards compatibility or remove it once callers migrate. - Move high-rate data out of
AppState: introduceITelemetryBuffer,IFrameBuffer,IDiagnosticsJournal. Each owns its own ring buffer andIObservablefeed.AppStatekeeps only "latest value" snapshots used for guards. - Batched/coalesced updates for position and telemetry so the store sees UI-rate, not sensor-rate.
- Lock-free or per-slice locks once slice types are stable.
- Event log / snapshot serialisation for schema-versioned persistence and incident replay.
- Selector cache / memoization for derived labels.
- Multi-window / multi-dispatcher support in the
StateChangedbridge.
And in the simulator, in parallel:
- Real
byte[]frame payloads at configurable fps. - Multi-tag telemetry dictionary.
- High-rate encoder channel separated from UI-rate position.
- Storm / chaos / soak profiles.
- Rich defect model with realistic per-wafer volumes.
- Wafer loop / time-compression mode.
7. Bottom line
The canonical-app-state pattern here is the right tool for the job and places the project firmly inside the mainstream of both web-style single-store architectures and industrial control-plane modelling. It is correctly bounded by ADR-001 and cleanly testable.
But a canonical app state for one operator's current run is not the same thing as a canonical app state for a real inspection tool's live data-plane, and the current implementation conflates the two. Until that split happens, and until the simulator is fat enough to exercise it, calling this "production-ready" would be premature.
Treat the pattern as a load-bearing foundation. Treat the current data-plane embedding as scaffolding. Scale the simulators first so the right pressure exists; then let the design evolve under that pressure.
— End of review