Production-Readiness Review
- Date: 2026-04-22
- Reviewer: Independent audit (Claude / Opus 4.7)
- Scope: Whole repository — is this "production-level" and ready to grow into real, real-world wafer inspection software?
- Verdict: Excellent demo / training artifact. Not production-ready. Strong architectural bones; multiple entire categories of production concern are intentionally out of scope and remain absent.
1. TL;DR
The repo is honestly and carefully positioned as a docs-first, AI-assisted desktop prototype that imitates an industrial wafer inspection workstation. Within that framing it is well above average:
- The architecture is layered, disciplined, and enforced. Domain is pure; Application owns orchestration; Infrastructure hides the "vendor SDK"; Presentation is a thin projection.
- State is centralized and strictly mutated through one store (ADR-001). Command guards are pure functions derived from canonical state.
- Streaming pipelines use bounded
System.Threading.Channelswith documented, test-verifiedDropOldestbackpressure policies. - Workflow semantics (Stop vs. Abort vs. Fault, acknowledgement vs. recovery) are encoded correctly and tested.
- 216 xUnit tests across 14 files plus reusable fakes; the suite covers the areas a first-slice review would ask about.
- Documentation (requirements, ADRs, specs, tasks, scenarios, architecture diagrams, course material) is unusually thorough and self-consistent.
However, the requirements themselves (§3 Non-Goals; §16 Future Expansion Areas) explicitly exclude almost every axis that separates a demo from a production machine-control application:
- real hardware integration
- production deployment
- MES / SCADA / historian / factory integration
- user management, authentication, authorization, auditability
- polished industrial UI and accessibility
- real image processing / vision algorithms
- reporting and analytics
- distributed / multi-station orchestration
So "production level" here means production-grade prototype engineering practice, not a product you can ship to a fab. That distinction matters and this review is written around it.
2. What makes it feel production-level (the good)
2.1 Architecture and layering
- Five projects with clean dependency direction:
App → Presentation/Application/Infrastructure → Domain. No layering violations found. Domainis pure records (no framework references, no logic).Applicationdepends only onDomain+Microsoft.Extensions.*abstractions.Infrastructurehides the "vendor SDK" behind interfaces (IMachineConnection,IMotionController,ICameraController,ILightController,IMachineSignals,IFaultInjector,ITelemetrySource,IFrameSource,IRunHistoryStore,IRecipeCatalog).Presentationhas no Infrastructure reference. ViewModels talk only toIWorkflowService,IAppStateStore,IFaultInjector,ISimulatorProfileService.
This is the single most important thing a real industrial app must get right to survive SDK swaps and hardware changes, and this repo gets it right.
2.2 Central app state
AppStateStore(src/InspectionPrototype.Application/Services/AppStateStore.cs) is the single mutation point. Every update is locked, applied via a reducer function, and broadcast viaStateChanged.AppState(src/InspectionPrototype.Application/State/AppState.cs) is a purerecordwith anInitialfactory — good for predictability, diffing, and testing.CommandGuards(src/InspectionPrototype.Application/Guards/CommandGuards.cs) are pure functions overAppState. ViewModels re-query them after every state change (MainViewModel.NotifyAllCommandsCanExecuteChanged).- This matches ADR-001 and the requirements' §9.5. In a real machine-control app this pattern makes operator-command behaviour deterministic and auditable.
2.3 Workflow semantics
WorkflowService (src/InspectionPrototype.Application/Services/WorkflowService.cs:9-786) encodes the three-way distinction the requirements demand:
- Stop — cooperative; a flag is set and the run exits at the next scan-point boundary.
- Abort — immediate; run CTS is cancelled, loop unwinds via
OperationCanceledException. - Fault — critical; alarm added to state, workflow forced to
Faulted, active run and homing cancelled, explicitRecoverAsyncrequired after the condition clears.
The finally-block in RunLoopAsync maps _terminationReason to a RunTerminalStatus and constructs exactly one RunSummary. Acknowledgement and recovery are separated per §12.6. Concurrency around _runCts/_homeCts is guarded by _ctsLock.
2.4 Streaming / backpressure
- Telemetry: capacity 1,
DropOldest(coalesce-to-latest). - Frames: capacity 3,
DropOldest(sliding recent-window). - Drop and coalesce counters are incremented via
Interlockedand surfaced throughAppState.PipelineCounters, with diagnostic entries when events occur. StreamingPipelineTestsasserts both channel policies and the pipeline-to-state bridge.
This is a textbook-correct answer to the requirements §9.6 and is the right shape for a real telemetry/frame pipeline.
2.5 Async/threading
Host.CreateDefaultBuilder()+IHostedService+BackgroundServicefor pipelines.- UI marshalling is centralized:
MainViewModelcapturesDispatcher.CurrentDispatcherin its constructor and everyOnStateChangedgoes through_dispatcher.Invoke. This is the correct pattern for WPF + background producers. - Long-running simulator tasks (
SimulatedCamera,SimulatedTelemetrySource,SimulatedMotionController.InterpolateAsync) honour cancellation tokens. - No
.Result/.Wait()/.GetAwaiter().GetResult()on the UI thread.
2.6 Persistence
JsonRunHistoryStoreuses temp-file-then-File.Movefor atomic-ish writes, swallows parse errors, logs, and returns empty history.JsonRecipeCatalogvalidates each file, returns per-fileValid/Invalidresults with reasons, detects duplicaterecipeIds, and is deterministic across runs via sorted filenames.SampleRecipeProvisioningServiceseeds starter recipes on first launch without ever overwriting.HistoryHydrationService/RecipeCatalogHydrationService/SimulatorProfileHydrationServiceareIHostedServices so the UI sees stable data before the first frame is painted.
2.7 Tests
- 216
[Fact]/[Theory]cases across 14 files;WorkflowServiceTestContextplus fakes for every vendor-SDK abstraction. - Covers: command guards, start preconditions, stop-vs-abort, fault transition, acknowledgement, recovery, bounded streaming, recipe JSON validation and duplicate handling, sample provisioning, history round-trip, history hydration, simulator profiles (incl. live switching guard), Slice-004 regression suite, alarm lifecycle, run metrics.
- Includes a
ThrowingRunHistoryStorestub — the tests actively exercise infrastructure-failure paths.
2.8 Documentation
- Requirements are split into seven sections with a hub page.
- Four ADRs (one still "Proposed"), each linked to the sections it implements.
- Five architecture pages (system context, project/layer map, domain, workflow, runtime sequences).
- Four slice specs and matching TASK-### plans, plus scenarios, sample recipes, and a course track.
- VitePress site builds from
docs/with apackage.jsonscript.
This is a better paper trail than most commercial projects actually maintain.
3. What is missing for real production (the gaps)
These are not criticisms of the prototype — most are explicitly out of scope per §3 / §16. They are listed here so the bar to "real-world production" is visible.
3.1 Safety, certification, and determinism (the biggest gap)
A real wafer inspection workstation is adjacent to capital equipment and human operators. It needs to address:
- Safety-function separation: E-Stop, interlocks, and door-closed logic must be implemented (or at minimum asserted) by certified hardware / PLC and never by a WPF process. Today the C# layer owns safety state; on a real machine this code is a monitor, not an authority.
- Functional-safety classification (IEC 61508 / SEMI S2 / S8 considerations) — not addressed.
- Deterministic timing:
Task.Delay+ThreadPoolis acceptable for simulation; not acceptable for anything motion-critical. On a real stage, motion is driven by a deterministic motion controller. - Watchdogs, heartbeats, E-Stop latching in software — none present.
3.2 Real vendor SDK integration
Everything behind I*Controller interfaces is simulated. The interface shapes are reasonable but have not survived contact with a real SDK. Expect real SDKs to surface:
- Async cancellation semantics that don't match
CancellationToken. - Out-of-band error callbacks / C++ exceptions crossing PInvoke.
- Long-duration blocking calls that are not actually cancellable.
- Licensing / hardware dongles / per-machine key provisioning.
None of these are modelled. SimulatedMachineConnection throws no exceptions; it returns bool. Real SDKs rarely cooperate that nicely.
3.3 Process resilience and crash handling
- No
App.DispatcherUnhandledException,AppDomain.CurrentDomain.UnhandledException, orTaskScheduler.UnobservedTaskExceptionhandler. A single unhandled exception on a backgroundTask.RuninWorkflowService.DoConnectAsync/DoHomeAsync/RunLoopAsynccould terminate the process silently. - No crash reporting (e.g. Sentry, AppCenter, Watson minidump hookup).
- No auto-restart / supervisor policy for hosted services that fault.
- Single-instance enforcement is not implemented (two instances could fight over
%LocalAppData%\LcnWaferInspection\run-history.json). JsonRunHistoryStore.SaveAsyncdoes a read-modify-write on the whole history file with no cross-process lock. Two app instances can corrupt it.
3.4 Observability
- Uses
Microsoft.Extensions.Logging, but nothing is configured inApp.xaml.cs/appsettings.json— no sinks (file, Seq, Serilog, OpenTelemetry, Event Log). Logs vanish at process exit. - No structured log correlation IDs (e.g., the run
Guidcould be a scope; right now it is only included in diagnostic entries). - No metrics (
System.Diagnostics.Metrics), no OpenTelemetry tracing, no health-check endpoints. RecentDiagnosticsis capped at 200 entries in memory and never persisted. A real workstation needs a durable audit log.
3.5 Data persistence at scale
- JSON file for run history is fine for demos; at >~1,000 runs it becomes slow to read-parse-write on every completion. No pagination, no rollover, no archiving policy.
- No schema migration story for
RunSummary/Recipe(JsonStringEnumConverterhelps; a missing field would silently deserialize to default). - No database abstraction — SQLite at minimum is expected at production scale (defect records, per-frame results, wafer maps, operator IDs).
- No backup / export / restore commands.
3.6 Security and auditability
- No authentication or operator identity.
RunSummaryrecords what happened, not who did it. - No role separation between Operator, Engineer, Maintenance, Administrator — the diagnostics panel exposes fault injection to anyone with the UI.
- No code signing configured (
signtool, Authenticode). Installers must be signed for acceptance in most fabs. appsettings.jsonhas no secret handling story — fine today because there are none, but the shape will need Key Vault / DPAPI eventually.
3.7 Deployment
- No installer project (MSIX, WiX, Inno Setup, ClickOnce).
- No versioning / update mechanism.
AssemblyInfo.csexists but nothing increments it. - No packaging of appsettings overrides per environment (dev / QA / factory).
- No uninstall / clean-up of
%LocalAppData%\LcnWaferInspection.
3.8 CI / build / quality gates
- No CI configured. There is no
.github/workflows/,azure-pipelines.yml, or equivalent. Tests are present but nothing runs them on push. - No
Directory.Build.props/Directory.Packages.propsfor centralized versions. Hosting is9.0.4while target framework isnet10.0-windows— functional but drifts easily. - No static analysis (
<TreatWarningsAsErrors>,<AnalysisLevel>latest-all</AnalysisLevel>,.editorconfigfor diagnostics, Roslynator, StyleCop). - No coverage gate despite
coverlet.collectorbeing referenced. - No dependency-vulnerability scanning (Dependabot,
dotnet list package --vulnerable). .NET 10is a very fresh target — pinning / supported-LTS discussion is missing.
3.9 UI and UX
- Single
MainWindow.xamlis ~640 lines of inline layout and data triggers. Fine for a prototype; a real operator UI would split intoUserControls, use a design system, and include keyboard-first operation, high-contrast / accessibility support, and localization (no.resx, no RESW, all strings are inline). - No virtualization on
AlarmsorDiagnosticsLogbeyond the defaultListViewbehaviour; at high event rates this will get sluggish. - Live Preview area is a placeholder — no actual
BitmapSource/WriteableBitmapwiring, which is a major real-world concern (memory churn, GC pauses). - No dpi/scaling assertions, no per-monitor DPI configured in the app manifest.
3.10 Concurrency correctness (minor, worth tightening)
Most of the threading is fine, but two points to watch:
MainViewModel.ProjectcallsAlarms.Clear()+foreach Add(...)on every state change, even when alarms haven't changed. At high event rates this is both wasteful and causesListViewselection churn. The VM already usesReferenceEqualstracking for other collections (RunHistoryItems,RecipeCatalog,SimulatorProfileCatalog,RecentDiagnostics). Apply the same toAlarms.SimulatorFaultInjectormutates a plainHashSet<string>without a lock. Fault injection is currently invoked only from the UI thread, but nothing prevents a future caller from doing it elsewhere. Wrap it in aLockorConcurrentDictionary<string, byte>.RunLoopAsyncchecks_terminationReasonwithout the lock at line 553. Reads of a volatileenumare atomic, but the pattern is inconsistent with the rest of the file where every read uses the lock. Harmonize.SimulatedCamera.ProduceFramesAsyncincrements_droppedCountbased on_frameChannel.Reader.Count >= ChannelCapacitybeforeTryWrite, which is a race: the count can drop between the check and the write, causing occasional false-positive drop counters. In practice harmless, but mention it.
3.11 Domain modelling gaps
Frame.PreviewPayloadisbyte[]?with no codec / size / stride — real preview frames need stride, pixel format, bit depth, capture timestamp (hardware clock), camera id, ROI.InspectionResultcarries a single string summary; a real detector emits bounding boxes, defect classes, confidence, and a reference to the raw image.Recipehas onlyScanPoints; no focus, exposure, lighting, wafer map, coordinate system, calibration.SafetySignalsis a fixed record — real machines add signals over time. Consider a dictionary/bag with a typed façade.- No
WaferId/LotId/OperatorIdanywhere — impossible to correlate results to a wafer, which is the single most important identifier for inspection data.
3.12 Configuration
appsettings.jsoncontains only simulator profiles. Paths (%LocalAppData%\LcnWaferInspection\*) are hard-coded inInfrastructureServiceCollectionExtensions. A real app wants environment-aware configuration, adevelopment.jsonoverride, and an ops-friendly config editor.- No feature flags / kill switches.
- No
appsettings.*.jsonper environment.
4. Readiness scorecard
| Dimension | Prototype | Real-world production | Gap |
|---|---|---|---|
| Layering & dependency direction | ✅ | ✅ | low |
| Central app state, pure command guards | ✅ | ✅ | low |
| Workflow semantics (Stop/Abort/Fault/Recover) | ✅ | ✅ (model is right) | low |
| Async / threading / cancellation | ✅ | ✅ | low |
| Bounded streaming pipelines | ✅ | ✅ | low |
| Vendor SDK abstraction shape | ✅ | ⚠️ | medium — interfaces will shift on first real SDK |
| Test coverage of application layer | ✅ | ✅ | low |
| Test coverage of UI | ❌ | — | medium — no UI tests (WPF UI test harness is hard; document as explicit non-goal or add) |
| CI / quality gates | ❌ | ❌ | high |
| Structured logging / observability / metrics | ⚠️ | ❌ | high |
| Crash handling, supervisor, single-instance | ❌ | ❌ | high |
| Persistence scale, schema versioning, DB | ⚠️ | ❌ | high |
| Auth, authz, audit, code signing | ❌ | ❌ | high |
| Safety-critical architecture (PLC / interlocks) | ❌ | ❌ | critical — by design |
| Real image/defect pipeline | ❌ | ❌ | high |
| Installer, update, versioning | ❌ | ❌ | high |
| Accessibility / localization / design system | ❌ | ❌ | medium |
| Documentation and ADR discipline | ✅ | ✅ | low |
5. Recommended next steps
Grouped by what unlocks the most real-world value per unit of effort.
5.1 Cheap, high-leverage (do before calling anything "production-ready")
- Add CI. A
.github/workflows/ci.ymlrunningdotnet restore / build / teston every PR. Gate merges on green. Add coverage upload. This is an afternoon and pays back forever. - Global exception handlers in
App.xaml.cs—DispatcherUnhandledException,AppDomain.UnhandledException,TaskScheduler.UnobservedTaskException. Log and surface via the diagnostics panel. - Configure logging sinks. Wire
AddSerilog(orAddOpenTelemetry) inHost.CreateDefaultBuilder. Persist to%LocalAppData%\LcnWaferInspection\logs\app-.logwith rolling. Include run correlation viaILogger.BeginScope. - Single-instance guard (named mutex) to prevent two apps corrupting the JSON history file.
- Repo hygiene:
.editorconfig,Directory.Build.propswith<TreatWarningsAsErrors>true</TreatWarningsAsErrors>,Directory.Packages.propsto centralize versions. - Alarms collection diffing in
MainViewModel.Project— mirror theReferenceEqualspattern already used for other lists.
5.2 Medium — next slice's worth of work
- Introduce a real schema-versioned persistence layer (SQLite + EF Core or Dapper). Migrate
RunSummary,Alarm,DiagnosticsEntryonto it. Add export to JSON for operators. - Add a minimal WaferId / LotId / OperatorId identity story. Even a prompt-on-start textbox wired through to
RunSummaryis a huge leap in realism. - Publish a
System.Diagnostics.Metricsmeter: frames/sec, drops/sec, pipeline latency, run duration histogram. Surface in the diagnostics pane. - MSIX installer + code-signing pipeline. Per-environment config.
- Split the monolithic
MainWindow.xamlintoUserControls per section. Add a theme / resource dictionary. Consider accessibility and localization infrastructure even if you don't translate yet.
5.3 Larger — becomes a different product
- Swap one
I*Controllerfor a real vendor SDK. Expect interface churn; use that to harden abstractions and add integration tests against a "hardware-in-loop" rig. - Real defect detection (OpenCvSharp / Emgu.CV / ONNX). Reintroduce
InspectionResultas a rich type with geometry and classifier output. - Split safety-critical logic into a PLC or dedicated motion controller talking over OPC UA / EtherCAT; keep C# as the operator's viewer of safety state, not the authority.
- Factory integration: MES, SECS/GEM, historian (OSIsoft PI / InfluxDB). Likely its own subsystem.
- Operator identity + role-based permissions + audit trail suitable for a regulated environment.
6. Closing assessment
Taken on its own terms — "a believable industrial desktop prototype that can be grown incrementally with AI tools" — this repository already delivers. The code is disciplined, the tests are real, and the paper trail is exemplary.
Taken as "a real wafer inspection product" — it is the first 15–20% of the journey. The architecture is defensible enough that the remaining 80% can be added without rewrites (which is itself a production-grade outcome of a prototype). The things it's missing are the things that separate software from a shipped industrial product: safety architecture, observability, persistence, identity, packaging, CI.
If the goal is to grow this into real-world software, prioritize §5.1 immediately (low-cost, removes silent-failure modes), plan §5.2 as the next umbrella slice, and treat §5.3 as a multi-quarter product roadmap — ideally each item with its own ADR and slice spec using the same docs-first method the repo has already proven out.
— End of review