Scenario 03: Fault Injection and Recovery
Why This Scenario Matters
A training app becomes much more valuable when it teaches failure handling rather than only happy-path success. This repository now has an explicit alarm lifecycle with acknowledgment, fault clearance, and recovery as separate steps.
This scenario is the best entry point to the operational maturity work from SLICE-004.
By the end of it, the learner should understand:
- what a critical fault does to the workflow
- why acknowledgment alone is not enough
- how explicit recovery differs from merely clearing a fault signal
- how diagnostics and run history preserve the story of a faulted run
Operator Actions
- Connect to the machine.
- Load a recipe and home the stage.
- Start a run.
- In the right-side
Fault Injection (Engineer)panel, leave the default code and message or enter your own critical fault code. - Click
Inject Faultwhile the run is active. - Observe the alarms list, workflow state, and diagnostics timeline.
- In the alarms list, click
Ackfor the active alarm. - Attempt to continue normal operation mentally by checking whether
Start RunorHomebecomes available. - In the fault injection panel, click
Clear Fault. - Observe diagnostics again.
- Click
Recover. - Confirm that the machine returns to a usable non-faulted state.
Expected UI And State Changes
On Fault Injection
You should see:
- an active alarm appear in the
Active Alarmslist - the workflow transition to
Faulted - homing state cleared because the unsafe condition invalidates the prior readiness
- diagnostics entries recording the fault event
- a faulted run summary preserved if the run was active when the fault occurred
On Acknowledgment
You should see:
- the alarm status change to acknowledged
- diagnostics record that the operator has seen the fault
- blocked commands remain blocked
This is the key teaching point. Acknowledgment is not recovery.
On Fault Clearance
You should see:
- diagnostics indicating that the underlying condition is cleared
- the system still remain unrecovered until you explicitly click
Recover
On Recovery
You should see:
Recoversucceed only after no active critical fault remains- diagnostics record the recovery event
- the workflow return to
IdleorReadydepending on current prerequisites
What To Inspect In Code After Running It
Start with:
src/InspectionPrototype.Application/Services/WorkflowService.cssrc/InspectionPrototype.Application/Guards/CommandGuards.cssrc/InspectionPrototype.Infrastructure/Simulator/SimulatorFaultInjector.cssrc/InspectionPrototype.Presentation/ViewModels/AlarmViewModel.cs
Pay attention to:
OnFaultInjected()and how it appends alarms, transitions toFaulted, and cancels active workAcknowledgeFault()and the difference between acknowledgment and clearanceOnFaultCleared()and why it still does not count as recoveryRecoverAsync()and the precise guard that requires both fault clearance andWorkflowState.Faulted
Troubleshooting Notes
- If
Recoveris disabled, check whether the fault condition is actually cleared. Acknowledgment alone does not satisfy the guard. - If
Start Runremains disabled after recovery, remember that the system may still require homing again because the fault invalidated safe motion state. - If you inject a fault while no run is active, you will still learn alarm behavior, but you will not get the same faulted-run history story as an in-flight fault.
Related Lessons And Specs
- SLICE-001: First Strong Vertical Slice
- SLICE-004: Operational Maturity
- 09. Error Handling in .NET - Real World
- 24. Advanced error modeling .NET
- 09. Safety Systems in Industry
- 08. Machine Workflow and State Machine
Diagram Brief
Title: Critical fault lifecyclePurpose: Show how fault injection, acknowledgment, clearance, and recovery interactAudience: newcomer developer or automation engineer learning alarm semanticsNodes: Operator, MainWindow, MainViewModel, FaultInjector, WorkflowService, AppStateStore, Alarm List, Run Summary HistoryEdges: inject fault raises active alarm and transitions workflow to faulted; acknowledgment marks alarm seen; clearance removes active unsafe condition; recovery returns workflow to a non-faulted stateGroups: Fault occurrence, acknowledgment, clearance, recoveryCaption: A fault is not truly resolved until the unsafe condition is cleared and the operator explicitly recoversDestination file path:docs/diagrams/source/scenario-03-fault-injection-and-recovery.drawio