Production-Readiness Review

Date: 2026-04-22
Reviewer: Independent audit (Claude / Opus 4.7)
Scope: Whole repository — is this "production-level" and ready to grow into real, real-world wafer inspection software?
Verdict: Excellent demo / training artifact. Not production-ready. Strong architectural bones; multiple entire categories of production concern are intentionally out of scope and remain absent.

1. TL;DR

The repo is honestly and carefully positioned as a docs-first, AI-assisted desktop prototype that imitates an industrial wafer inspection workstation. Within that framing it is well above average:

The architecture is layered, disciplined, and enforced. Domain is pure; Application owns orchestration; Infrastructure hides the "vendor SDK"; Presentation is a thin projection.
State is centralized and strictly mutated through one store (ADR-001). Command guards are pure functions derived from canonical state.
Streaming pipelines use bounded System.Threading.Channels with documented, test-verified DropOldest backpressure policies.
Workflow semantics (Stop vs. Abort vs. Fault, acknowledgement vs. recovery) are encoded correctly and tested.
216 xUnit tests across 14 files plus reusable fakes; the suite covers the areas a first-slice review would ask about.
Documentation (requirements, ADRs, specs, tasks, scenarios, architecture diagrams, course material) is unusually thorough and self-consistent.

However, the requirements themselves (§3 Non-Goals; §16 Future Expansion Areas) explicitly exclude almost every axis that separates a demo from a production machine-control application:

real hardware integration
production deployment
MES / SCADA / historian / factory integration
user management, authentication, authorization, auditability
polished industrial UI and accessibility
real image processing / vision algorithms
reporting and analytics
distributed / multi-station orchestration

So "production level" here means production-grade prototype engineering practice, not a product you can ship to a fab. That distinction matters and this review is written around it.

2. What makes it feel production-level (the good)

2.1 Architecture and layering

Five projects with clean dependency direction: App → Presentation/Application/Infrastructure → Domain. No layering violations found.
Domain is pure records (no framework references, no logic).
Application depends only on Domain + Microsoft.Extensions.* abstractions.
Infrastructure hides the "vendor SDK" behind interfaces (IMachineConnection, IMotionController, ICameraController, ILightController, IMachineSignals, IFaultInjector, ITelemetrySource, IFrameSource, IRunHistoryStore, IRecipeCatalog).
Presentation has no Infrastructure reference. ViewModels talk only to IWorkflowService, IAppStateStore, IFaultInjector, ISimulatorProfileService.

This is the single most important thing a real industrial app must get right to survive SDK swaps and hardware changes, and this repo gets it right.

2.2 Central app state

AppStateStore (src/InspectionPrototype.Application/Services/AppStateStore.cs) is the single mutation point. Every update is locked, applied via a reducer function, and broadcast via StateChanged.
AppState (src/InspectionPrototype.Application/State/AppState.cs) is a pure record with an Initial factory — good for predictability, diffing, and testing.
CommandGuards (src/InspectionPrototype.Application/Guards/CommandGuards.cs) are pure functions over AppState. ViewModels re-query them after every state change (MainViewModel.NotifyAllCommandsCanExecuteChanged).
This matches ADR-001 and the requirements' §9.5. In a real machine-control app this pattern makes operator-command behaviour deterministic and auditable.

2.3 Workflow semantics

WorkflowService (src/InspectionPrototype.Application/Services/WorkflowService.cs:9-786) encodes the three-way distinction the requirements demand:

Stop — cooperative; a flag is set and the run exits at the next scan-point boundary.
Abort — immediate; run CTS is cancelled, loop unwinds via OperationCanceledException.
Fault — critical; alarm added to state, workflow forced to Faulted, active run and homing cancelled, explicit RecoverAsync required after the condition clears.

The finally-block in RunLoopAsync maps _terminationReason to a RunTerminalStatus and constructs exactly one RunSummary. Acknowledgement and recovery are separated per §12.6. Concurrency around _runCts/_homeCts is guarded by _ctsLock.

2.4 Streaming / backpressure

Telemetry: capacity 1, DropOldest (coalesce-to-latest).
Frames: capacity 3, DropOldest (sliding recent-window).
Drop and coalesce counters are incremented via Interlocked and surfaced through AppState.PipelineCounters, with diagnostic entries when events occur.
StreamingPipelineTests asserts both channel policies and the pipeline-to-state bridge.

This is a textbook-correct answer to the requirements §9.6 and is the right shape for a real telemetry/frame pipeline.

2.5 Async/threading

Host.CreateDefaultBuilder() + IHostedService + BackgroundService for pipelines.
UI marshalling is centralized: MainViewModel captures Dispatcher.CurrentDispatcher in its constructor and every OnStateChanged goes through _dispatcher.Invoke. This is the correct pattern for WPF + background producers.
Long-running simulator tasks (SimulatedCamera, SimulatedTelemetrySource, SimulatedMotionController.InterpolateAsync) honour cancellation tokens.
No .Result / .Wait() / .GetAwaiter().GetResult() on the UI thread.

2.6 Persistence

JsonRunHistoryStore uses temp-file-then-File.Move for atomic-ish writes, swallows parse errors, logs, and returns empty history.
JsonRecipeCatalog validates each file, returns per-file Valid/Invalid results with reasons, detects duplicate recipeIds, and is deterministic across runs via sorted filenames.
SampleRecipeProvisioningService seeds starter recipes on first launch without ever overwriting.
HistoryHydrationService / RecipeCatalogHydrationService / SimulatorProfileHydrationService are IHostedServices so the UI sees stable data before the first frame is painted.

2.7 Tests

216 [Fact]/[Theory] cases across 14 files; WorkflowServiceTestContext plus fakes for every vendor-SDK abstraction.
Covers: command guards, start preconditions, stop-vs-abort, fault transition, acknowledgement, recovery, bounded streaming, recipe JSON validation and duplicate handling, sample provisioning, history round-trip, history hydration, simulator profiles (incl. live switching guard), Slice-004 regression suite, alarm lifecycle, run metrics.
Includes a ThrowingRunHistoryStore stub — the tests actively exercise infrastructure-failure paths.

2.8 Documentation

Requirements are split into seven sections with a hub page.
Four ADRs (one still "Proposed"), each linked to the sections it implements.
Five architecture pages (system context, project/layer map, domain, workflow, runtime sequences).
Four slice specs and matching TASK-### plans, plus scenarios, sample recipes, and a course track.
VitePress site builds from docs/ with a package.json script.

This is a better paper trail than most commercial projects actually maintain.

3. What is missing for real production (the gaps)

These are not criticisms of the prototype — most are explicitly out of scope per §3 / §16. They are listed here so the bar to "real-world production" is visible.

3.1 Safety, certification, and determinism (the biggest gap)

A real wafer inspection workstation is adjacent to capital equipment and human operators. It needs to address:

Safety-function separation: E-Stop, interlocks, and door-closed logic must be implemented (or at minimum asserted) by certified hardware / PLC and never by a WPF process. Today the C# layer owns safety state; on a real machine this code is a monitor, not an authority.
Functional-safety classification (IEC 61508 / SEMI S2 / S8 considerations) — not addressed.
Deterministic timing: Task.Delay + ThreadPool is acceptable for simulation; not acceptable for anything motion-critical. On a real stage, motion is driven by a deterministic motion controller.
Watchdogs, heartbeats, E-Stop latching in software — none present.

3.2 Real vendor SDK integration

Everything behind I*Controller interfaces is simulated. The interface shapes are reasonable but have not survived contact with a real SDK. Expect real SDKs to surface:

Async cancellation semantics that don't match CancellationToken.
Out-of-band error callbacks / C++ exceptions crossing PInvoke.
Long-duration blocking calls that are not actually cancellable.
Licensing / hardware dongles / per-machine key provisioning.

None of these are modelled. SimulatedMachineConnection throws no exceptions; it returns bool. Real SDKs rarely cooperate that nicely.

3.3 Process resilience and crash handling

No App.DispatcherUnhandledException, AppDomain.CurrentDomain.UnhandledException, or TaskScheduler.UnobservedTaskException handler. A single unhandled exception on a background Task.Run in WorkflowService.DoConnectAsync / DoHomeAsync / RunLoopAsync could terminate the process silently.
No crash reporting (e.g. Sentry, AppCenter, Watson minidump hookup).
No auto-restart / supervisor policy for hosted services that fault.
Single-instance enforcement is not implemented (two instances could fight over %LocalAppData%\LcnWaferInspection\run-history.json).
JsonRunHistoryStore.SaveAsync does a read-modify-write on the whole history file with no cross-process lock. Two app instances can corrupt it.

3.4 Observability

Uses Microsoft.Extensions.Logging, but nothing is configured in App.xaml.cs / appsettings.json — no sinks (file, Seq, Serilog, OpenTelemetry, Event Log). Logs vanish at process exit.
No structured log correlation IDs (e.g., the run Guid could be a scope; right now it is only included in diagnostic entries).
No metrics (System.Diagnostics.Metrics), no OpenTelemetry tracing, no health-check endpoints.
RecentDiagnostics is capped at 200 entries in memory and never persisted. A real workstation needs a durable audit log.

3.5 Data persistence at scale

JSON file for run history is fine for demos; at >~1,000 runs it becomes slow to read-parse-write on every completion. No pagination, no rollover, no archiving policy.
No schema migration story for RunSummary / Recipe (JsonStringEnumConverter helps; a missing field would silently deserialize to default).
No database abstraction — SQLite at minimum is expected at production scale (defect records, per-frame results, wafer maps, operator IDs).
No backup / export / restore commands.

3.6 Security and auditability

No authentication or operator identity. RunSummary records what happened, not who did it.
No role separation between Operator, Engineer, Maintenance, Administrator — the diagnostics panel exposes fault injection to anyone with the UI.
No code signing configured (signtool, Authenticode). Installers must be signed for acceptance in most fabs.
appsettings.json has no secret handling story — fine today because there are none, but the shape will need Key Vault / DPAPI eventually.

3.7 Deployment

No installer project (MSIX, WiX, Inno Setup, ClickOnce).
No versioning / update mechanism. AssemblyInfo.cs exists but nothing increments it.
No packaging of appsettings overrides per environment (dev / QA / factory).
No uninstall / clean-up of %LocalAppData%\LcnWaferInspection.

3.8 CI / build / quality gates

No CI configured. There is no .github/workflows/, azure-pipelines.yml, or equivalent. Tests are present but nothing runs them on push.
No Directory.Build.props / Directory.Packages.props for centralized versions. Hosting is 9.0.4 while target framework is net10.0-windows — functional but drifts easily.
No static analysis (<TreatWarningsAsErrors>, <AnalysisLevel>latest-all</AnalysisLevel>, .editorconfig for diagnostics, Roslynator, StyleCop).
No coverage gate despite coverlet.collector being referenced.
No dependency-vulnerability scanning (Dependabot, dotnet list package --vulnerable).
.NET 10 is a very fresh target — pinning / supported-LTS discussion is missing.

3.9 UI and UX

Single MainWindow.xaml is ~640 lines of inline layout and data triggers. Fine for a prototype; a real operator UI would split into UserControls, use a design system, and include keyboard-first operation, high-contrast / accessibility support, and localization (no .resx, no RESW, all strings are inline).
No virtualization on Alarms or DiagnosticsLog beyond the default ListView behaviour; at high event rates this will get sluggish.
Live Preview area is a placeholder — no actual BitmapSource / WriteableBitmap wiring, which is a major real-world concern (memory churn, GC pauses).
No dpi/scaling assertions, no per-monitor DPI configured in the app manifest.

3.10 Concurrency correctness (minor, worth tightening)

Most of the threading is fine, but two points to watch:

MainViewModel.Project calls Alarms.Clear() + foreach Add(...) on every state change, even when alarms haven't changed. At high event rates this is both wasteful and causes ListView selection churn. The VM already uses ReferenceEquals tracking for other collections (RunHistoryItems, RecipeCatalog, SimulatorProfileCatalog, RecentDiagnostics). Apply the same to Alarms.
SimulatorFaultInjector mutates a plain HashSet<string> without a lock. Fault injection is currently invoked only from the UI thread, but nothing prevents a future caller from doing it elsewhere. Wrap it in a Lock or ConcurrentDictionary<string, byte>.
RunLoopAsync checks _terminationReason without the lock at line 553. Reads of a volatile enum are atomic, but the pattern is inconsistent with the rest of the file where every read uses the lock. Harmonize.
SimulatedCamera.ProduceFramesAsync increments _droppedCount based on _frameChannel.Reader.Count >= ChannelCapacity before TryWrite, which is a race: the count can drop between the check and the write, causing occasional false-positive drop counters. In practice harmless, but mention it.

3.11 Domain modelling gaps

Frame.PreviewPayload is byte[]? with no codec / size / stride — real preview frames need stride, pixel format, bit depth, capture timestamp (hardware clock), camera id, ROI.
InspectionResult carries a single string summary; a real detector emits bounding boxes, defect classes, confidence, and a reference to the raw image.
Recipe has only ScanPoints; no focus, exposure, lighting, wafer map, coordinate system, calibration.
SafetySignals is a fixed record — real machines add signals over time. Consider a dictionary/bag with a typed façade.
No WaferId / LotId / OperatorId anywhere — impossible to correlate results to a wafer, which is the single most important identifier for inspection data.

3.12 Configuration

appsettings.json contains only simulator profiles. Paths (%LocalAppData%\LcnWaferInspection\*) are hard-coded in InfrastructureServiceCollectionExtensions. A real app wants environment-aware configuration, a development.json override, and an ops-friendly config editor.
No feature flags / kill switches.
No appsettings.*.json per environment.

4. Readiness scorecard

Dimension	Prototype	Real-world production	Gap
Layering & dependency direction	✅	✅	low
Central app state, pure command guards	✅	✅	low
Workflow semantics (Stop/Abort/Fault/Recover)	✅	✅ (model is right)	low
Async / threading / cancellation	✅	✅	low
Bounded streaming pipelines	✅	✅	low
Vendor SDK abstraction shape	✅	⚠️	medium — interfaces will shift on first real SDK
Test coverage of application layer	✅	✅	low
Test coverage of UI	❌	—	medium — no UI tests (WPF UI test harness is hard; document as explicit non-goal or add)
CI / quality gates	❌	❌	high
Structured logging / observability / metrics	⚠️	❌	high
Crash handling, supervisor, single-instance	❌	❌	high
Persistence scale, schema versioning, DB	⚠️	❌	high
Auth, authz, audit, code signing	❌	❌	high
Safety-critical architecture (PLC / interlocks)	❌	❌	critical — by design
Real image/defect pipeline	❌	❌	high
Installer, update, versioning	❌	❌	high
Accessibility / localization / design system	❌	❌	medium
Documentation and ADR discipline	✅	✅	low

5. Recommended next steps

Grouped by what unlocks the most real-world value per unit of effort.

5.1 Cheap, high-leverage (do before calling anything "production-ready")

Add CI. A .github/workflows/ci.yml running dotnet restore / build / test on every PR. Gate merges on green. Add coverage upload. This is an afternoon and pays back forever.
Global exception handlers in App.xaml.cs — DispatcherUnhandledException, AppDomain.UnhandledException, TaskScheduler.UnobservedTaskException. Log and surface via the diagnostics panel.
Configure logging sinks. Wire AddSerilog (or AddOpenTelemetry) in Host.CreateDefaultBuilder. Persist to %LocalAppData%\LcnWaferInspection\logs\app-.log with rolling. Include run correlation via ILogger.BeginScope.
Single-instance guard (named mutex) to prevent two apps corrupting the JSON history file.
Repo hygiene: .editorconfig, Directory.Build.props with <TreatWarningsAsErrors>true</TreatWarningsAsErrors>, Directory.Packages.props to centralize versions.
Alarms collection diffing in MainViewModel.Project — mirror the ReferenceEquals pattern already used for other lists.

5.2 Medium — next slice's worth of work

Introduce a real schema-versioned persistence layer (SQLite + EF Core or Dapper). Migrate RunSummary, Alarm, DiagnosticsEntry onto it. Add export to JSON for operators.
Add a minimal WaferId / LotId / OperatorId identity story. Even a prompt-on-start textbox wired through to RunSummary is a huge leap in realism.
Publish a System.Diagnostics.Metrics meter: frames/sec, drops/sec, pipeline latency, run duration histogram. Surface in the diagnostics pane.
MSIX installer + code-signing pipeline. Per-environment config.
Split the monolithic MainWindow.xaml into UserControls per section. Add a theme / resource dictionary. Consider accessibility and localization infrastructure even if you don't translate yet.

5.3 Larger — becomes a different product

Swap one I*Controller for a real vendor SDK. Expect interface churn; use that to harden abstractions and add integration tests against a "hardware-in-loop" rig.
Real defect detection (OpenCvSharp / Emgu.CV / ONNX). Reintroduce InspectionResult as a rich type with geometry and classifier output.
Split safety-critical logic into a PLC or dedicated motion controller talking over OPC UA / EtherCAT; keep C# as the operator's viewer of safety state, not the authority.
Factory integration: MES, SECS/GEM, historian (OSIsoft PI / InfluxDB). Likely its own subsystem.
Operator identity + role-based permissions + audit trail suitable for a regulated environment.

6. Closing assessment

Taken on its own terms — "a believable industrial desktop prototype that can be grown incrementally with AI tools" — this repository already delivers. The code is disciplined, the tests are real, and the paper trail is exemplary.

Taken as "a real wafer inspection product" — it is the first 15–20% of the journey. The architecture is defensible enough that the remaining 80% can be added without rewrites (which is itself a production-grade outcome of a prototype). The things it's missing are the things that separate software from a shipped industrial product: safety architecture, observability, persistence, identity, packaging, CI.

If the goal is to grow this into real-world software, prioritize §5.1 immediately (low-cost, removes silent-failure modes), plan §5.2 as the next umbrella slice, and treat §5.3 as a multi-quarter product roadmap — ideally each item with its own ADR and slice spec using the same docs-first method the repo has already proven out.

— End of review

Streaming Pipelines Dotnet Real World

Production-Readiness Review ​

1. TL;DR ​

2. What makes it feel production-level (the good) ​

2.1 Architecture and layering ​

2.2 Central app state ​

2.3 Workflow semantics ​

2.4 Streaming / backpressure ​

2.5 Async/threading ​

2.6 Persistence ​

2.7 Tests ​

2.8 Documentation ​

3. What is missing for real production (the gaps) ​

3.1 Safety, certification, and determinism (the biggest gap) ​

3.2 Real vendor SDK integration ​

3.3 Process resilience and crash handling ​

3.4 Observability ​

3.5 Data persistence at scale ​

3.6 Security and auditability ​

3.7 Deployment ​

3.8 CI / build / quality gates ​

3.9 UI and UX ​

3.10 Concurrency correctness (minor, worth tightening) ​

3.11 Domain modelling gaps ​

3.12 Configuration ​

4. Readiness scorecard ​

5. Recommended next steps ​

5.1 Cheap, high-leverage (do before calling anything "production-ready") ​

5.2 Medium — next slice's worth of work ​

5.3 Larger — becomes a different product ​

6. Closing assessment ​