SLICE-004: Operational Maturity
- Status: Completed
- Date: 2026-04-16
- Depends on: Requirements, ADR-001: Use Central App State Store, ADR-002: File-Backed Run History Store Before Database Persistence, ADR-004: Use One Operational Maturity Slice Before Specialized Modules
Goal
Consolidate the remaining core realism work into one bounded slice that makes the prototype easier to inspect, recover, configure, and demonstrate without reopening the architecture again in several tiny follow-up slices.
Why This Slice
After persistent run history and JSON recipe management, the next highest-value improvement is operational maturity:
- better diagnostics and runtime visibility
- more explicit alarm acknowledgment and recovery semantics
- configurable simulator behavior for different teaching and demo scenarios
- richer inspection results and live run metrics
Treating these as one umbrella slice keeps the shared context in one place and lets external AI tools work through a medium-sized task pack instead of one very large prompt or several overly fragmented ones.
Requirements Coverage
This slice extends or activates these requirement areas:
- 03. Functional Scope: inspection results, metrics, fault support, and diagnostics behavior
- 04. UI and Technical Requirements: diagnostics surface, fault controls, measurable pipeline behavior, and testability
- 05. Failure Modes and Workflow Requirements: explicit fault handling, blocked commands, and recovery semantics
- 07. AI Delivery Constraints and Roadmap: grouped medium-sized tasks for AI-assisted implementation
In Scope
- extend canonical app state with structured diagnostics timeline and operational counters
- provide a richer operational workspace in the UI for alarms, diagnostics, fault injection, live metrics, and selected simulator profile
- harden alarm acknowledgment, fault clearance, and recovery or reset semantics
- introduce named simulator profiles loaded from configuration
- enrich live inspection results and persisted run summaries with more useful metrics
- keep new behavior testable without launching the full UI where practical
Out of Scope
- historical charts or long-term analytics dashboards
- a new multi-page or multi-window application architecture
- advanced image synthesis or realistic computer vision output
- hot reload or editing UI for simulator profiles
- explicit adoption of a third-party state machine library
- performance instrumentation as a required outcome of this slice
Runtime Behavior
Operational Workspace
The app should expose one richer diagnostics-oriented workspace or pane rather than scattering these features across several unrelated screens.
That workspace should make it possible to see:
- active alarms
- recent diagnostics timeline entries
- injected fault controls
- selected simulator profile
- live run metrics and relevant counters
The goal is not a full production HMI, but a believable operator and developer surface for understanding what the system is doing.
Diagnostics Timeline
The system must maintain a structured diagnostics timeline in canonical app state.
At minimum, timeline entries should capture:
- timestamp
- severity or importance
- source or subsystem
- short message
- optional run correlation data where useful
The timeline must be bounded so it cannot grow without limit during long sessions.
The timeline should record major operational events such as:
- connect and disconnect
- recipe load or refresh events
- homing
- run start, stop, abort, complete, and fault
- alarm acknowledgment
- recovery or reset
- simulator profile changes
Alarm Lifecycle and Recovery
Alarm handling must become more explicit than the first slices.
For this slice:
- acknowledgment is separate from condition clearance
- acknowledgment marks that the operator has seen the alarm but does not by itself re-enable blocked commands
- critical faults still transition active work to
Faulted - after a critical fault,
Start,Home, and motion commands remain blocked until the fault condition is cleared and an explicit recovery or reset action occurs - recovery must create diagnostics entries and preserve the history of the faulted run
- after successful recovery, the machine may return to
IdleorReadydepending on current prerequisites
The slice does not need a large enterprise alarm model, but it must make the distinction between seen, cleared, and recovered behavior explicit.
Simulator Profiles
The simulator should support named profiles loaded from configuration rather than hard-coded runtime constants only.
For the first profile version:
- available profiles are loaded at startup
- one profile is selected as the active profile
- profile selection is visible to the operator
- profile changes apply only to future operations and must not silently mutate an active run
- profile changes create diagnostics entries
Profiles may shape behavior such as:
- motion timing
- telemetry cadence
- preview frame cadence
- defect density or result distribution
- fault sensitivity or other safe scenario parameters
Inspection Results and Run Metrics
Inspection results should become more informative than a minimal defect count.
For this slice, the system should expose richer but still simple results such as:
- scan points completed versus total
- elapsed run duration
- total detected defects
- defect counts grouped by simple severity or category where practical
- selected simulator profile name
- completion reason
These metrics should be visible during active work where appropriate and should also flow into persisted run summaries and history projection.
Observability and Counters
Where queues, channels, or coalescing behavior already exist, the system should expose enough counters or diagnostics to understand when data is processed, dropped, or coalesced.
This slice does not require a full telemetry platform, but it should make important backpressure behavior visible through state, diagnostics, logs, or a small diagnostics surface.
Acceptance Criteria
This slice is satisfied only if all of the following are true:
- The system records structured diagnostics timeline entries for major operational events including connection, recipe load, homing, run state changes, faults, acknowledgment, recovery, and profile changes.
- Diagnostics timeline state is exposed through canonical app state with a documented bounded capacity.
- The UI provides an operational workspace or pane showing active alarms, recent diagnostics, fault injection controls, selected simulator profile, and live metrics.
- Injecting a critical fault during active work raises an alarm, transitions the workflow to
Faulted, preserves the run summary, and blocks invalid commands until the condition is cleared and an explicit recovery or reset occurs. - Alarm acknowledgment is tracked separately from clearance and recovery, and acknowledgment alone does not re-enable blocked commands.
- The operator can view and switch between named simulator profiles loaded from configuration, and profile changes apply only to future operations.
- Active runs expose richer metrics and results, and persisted run summaries include the richer fields introduced by this slice.
- Core timeline, recovery, simulator profile, and result-metric behavior are covered by automated tests.
Verification Notes
The implementation task for this spec must include verification for:
- bounded diagnostics timeline behavior
- acknowledgment versus recovery guard behavior
- faulted-run preservation and post-recovery state transition behavior
- simulator profile loading and selection rules
- richer run metrics flowing into persisted history
- visibility of queue, drop, or coalescing counters where such behavior exists