Skip to content

Scenario 03: Fault Injection and Recovery

Why This Scenario Matters

A training app becomes much more valuable when it teaches failure handling rather than only happy-path success. This repository now has an explicit alarm lifecycle with acknowledgment, fault clearance, and recovery as separate steps.

This scenario is the best entry point to the operational maturity work from SLICE-004.

By the end of it, the learner should understand:

  • what a critical fault does to the workflow
  • why acknowledgment alone is not enough
  • how explicit recovery differs from merely clearing a fault signal
  • how diagnostics and run history preserve the story of a faulted run

Operator Actions

  1. Connect to the machine.
  2. Load a recipe and home the stage.
  3. Start a run.
  4. In the right-side Fault Injection (Engineer) panel, leave the default code and message or enter your own critical fault code.
  5. Click Inject Fault while the run is active.
  6. Observe the alarms list, workflow state, and diagnostics timeline.
  7. In the alarms list, click Ack for the active alarm.
  8. Attempt to continue normal operation mentally by checking whether Start Run or Home becomes available.
  9. In the fault injection panel, click Clear Fault.
  10. Observe diagnostics again.
  11. Click Recover.
  12. Confirm that the machine returns to a usable non-faulted state.

Expected UI And State Changes

On Fault Injection

You should see:

  • an active alarm appear in the Active Alarms list
  • the workflow transition to Faulted
  • homing state cleared because the unsafe condition invalidates the prior readiness
  • diagnostics entries recording the fault event
  • a faulted run summary preserved if the run was active when the fault occurred

On Acknowledgment

You should see:

  • the alarm status change to acknowledged
  • diagnostics record that the operator has seen the fault
  • blocked commands remain blocked

This is the key teaching point. Acknowledgment is not recovery.

On Fault Clearance

You should see:

  • diagnostics indicating that the underlying condition is cleared
  • the system still remain unrecovered until you explicitly click Recover

On Recovery

You should see:

  • Recover succeed only after no active critical fault remains
  • diagnostics record the recovery event
  • the workflow return to Idle or Ready depending on current prerequisites

What To Inspect In Code After Running It

Start with:

  • src/InspectionPrototype.Application/Services/WorkflowService.cs
  • src/InspectionPrototype.Application/Guards/CommandGuards.cs
  • src/InspectionPrototype.Infrastructure/Simulator/SimulatorFaultInjector.cs
  • src/InspectionPrototype.Presentation/ViewModels/AlarmViewModel.cs

Pay attention to:

  • OnFaultInjected() and how it appends alarms, transitions to Faulted, and cancels active work
  • AcknowledgeFault() and the difference between acknowledgment and clearance
  • OnFaultCleared() and why it still does not count as recovery
  • RecoverAsync() and the precise guard that requires both fault clearance and WorkflowState.Faulted

Troubleshooting Notes

  • If Recover is disabled, check whether the fault condition is actually cleared. Acknowledgment alone does not satisfy the guard.
  • If Start Run remains disabled after recovery, remember that the system may still require homing again because the fault invalidated safe motion state.
  • If you inject a fault while no run is active, you will still learn alarm behavior, but you will not get the same faulted-run history story as an in-flight fault.

Diagram Brief

  • Title: Critical fault lifecycle
  • Purpose: Show how fault injection, acknowledgment, clearance, and recovery interact
  • Audience: newcomer developer or automation engineer learning alarm semantics
  • Nodes: Operator, MainWindow, MainViewModel, FaultInjector, WorkflowService, AppStateStore, Alarm List, Run Summary History
  • Edges: inject fault raises active alarm and transitions workflow to faulted; acknowledgment marks alarm seen; clearance removes active unsafe condition; recovery returns workflow to a non-faulted state
  • Groups: Fault occurrence, acknowledgment, clearance, recovery
  • Caption: A fault is not truly resolved until the unsafe condition is cleared and the operator explicitly recovers
  • Destination file path: docs/diagrams/source/scenario-03-fault-injection-and-recovery.drawio

Docs-first project memory for AI-assisted implementation.