System State Persistence & Recovery
In industrial machine software, recovery is not simply “load the last state from the database.”
The core rule is:
Persisted state is not truth. Physical validation after restart is truth.
This topic belongs directly inside reliability and recovery. Your roadmap says machines fail through hardware disconnects, motion faults, timeout chains, power issues, incomplete sequences, sensor disagreement, corrupted states, and operator interruptions; the system must detect failures, fail safely, and recover without making things worse.
PART 1 — Why State Recovery Is Hard in Machine Software
In business software, persisted state is usually the source of truth.
Example:
Order.Status = PaidAfter restart, you can usually trust it.
In machine software, this is dangerous.
Example:
WaferClamp.State = Clamped
AxisX.LastPosition = 123.45 mm
Workflow.State = InspectingAfter power loss, these may only mean:
The software last believed this was true.They do not guarantee that the machine is physically still in that condition.
A wafer may still be clamped. A robot may be holding a part. An axis may have lost reference. A vacuum may have dropped. A camera may have captured an image, but the result may not have been stored.
That is why recovery must answer four questions:
1. What was happening before failure?
2. What is physically true now?
3. What can be safely resumed?
4. What must be revalidated or aborted?This matches the bigger machine-control mindset: industrial software interacts with physical reality, operations are long-running and asynchronous, and wrong logic can cause real-world damage.
PART 2 — Types of State in Industrial Systems
A good recovery design separates state into categories.
+------------------------------------------------------+
| Industrial System State |
+------------------------------------------------------+
| 1. Production Context |
| Lot ID, Job ID, Wafer ID, Recipe Version |
| |
| 2. Workflow State |
| Current operation, step, completed checkpoints |
| |
| 3. Machine Physical State |
| Axis position, clamp state, vacuum, part presence |
| |
| 4. Device State |
| Connected, ready, initialized, faulted |
| |
| 5. Transient Runtime State |
| Queues, callbacks, pending commands, subscriptions |
+------------------------------------------------------+1. Persistent production context
Usually safe and important to persist:
LotId
JobId
RunId
WaferId / PartId
RecipeId
RecipeVersion
OperatorId
MachineIdThis preserves traceability.
Without it, after restart the machine may not know which wafer was being processed, which recipe was active, or whether results were already reported.
2. Workflow state
Partially safe to persist, but must be modeled carefully.
You should not persist only:
CurrentStep = "InspectWafer"Better:
WorkflowInstanceId
CurrentOperation
LastCompletedCheckpoint
StepStatus
RecoveryPolicyThe key is checkpoint-based recovery.
A workflow step may have started but not completed. So you need to know the last safe recovery point, not just the last method being executed.
3. Machine physical state
Dangerous to trust blindly.
Examples:
AxisX.LastKnownPosition = 120.5 mm
Clamp.LastKnownState = Clamped
Vacuum.LastKnownPressure = -80 kPa
PartPresence.LastKnown = PresentAfter restart, these should become:
LastKnownPosition
LastKnownClampState
LastKnownVacuumState
LastKnownPartPresenceNot:
CurrentPosition
CurrentClampState
CurrentVacuumState
CurrentPartPresencePhysical state must be revalidated using sensors, controller feedback, homing, probing, or service confirmation.
4. Device state
Usually reconstructed, not blindly persisted.
Do not restore:
Camera.State = Ready
MotionController.State = Initialized
Robot.State = ConnectedInstead, rebuild it:
Connect
Identify device
Check firmware/config
Initialize
Read status
Verify ready/fault stateA device that was ready yesterday may be disconnected today.
5. Transient runtime state
Usually should not be persisted.
Examples:
In-memory queues
Pending async commands
Callbacks
Subscriptions
Temporary buffers
Cancellation tokens
Thread state
Device SDK handlesThese belong to the running process. After restart, they are dead.
PART 3 — Persisted State vs Trusted State
This distinction is one of the most important ideas.
+-------------------+
| Persisted State |
| "Last known" |
+---------+---------+
|
v
+-------------------+
| Validation |
| sensors, devices, |
| homing, checks |
+---------+---------+
|
v
+---------------------------+
| Trusted Current State |
| safe to use |
+---------------------------+
OR
+---------------------------+
| Unknown / Unsafe State |
| requires recovery action |
+---------------------------+Persisted state says:
What did software last record?Trusted state says:
What has the system verified now?Example:
Persisted:
AxisX.LastKnownPosition = 150.0 mm
After restart:
AxisX.TrustLevel = UnknownUntilHomedAnother example:
Persisted:
Vacuum.LastKnownState = On
After restart:
Vacuum.CurrentState = Unknown
RecoveryAction = ReadVacuumSensorAndPressureA mature system has explicit language for this:
Verified
Unverified
LastKnown
Unknown
RequiresHoming
RequiresOperatorConfirmation
UnsafeToResumeBad systems only have:
Running
Stopped
Ready
ErrorThat is not enough.
PART 4 — Recovery After Crash or Power Loss
A safe recovery flow looks like this:
+----------------------+
| Application Restart |
+----------+-----------+
|
v
+----------------------+
| Load Persisted |
| Recovery Context |
+----------+-----------+
|
v
+----------------------+
| Reconnect Devices |
+----------+-----------+
|
v
+----------------------+
| Validate Device |
| Identity / Config |
+----------+-----------+
|
v
+----------------------+
| Re-establish |
| Physical State |
+----------+-----------+
|
v
+----------------------+
| Determine Workflow |
| Recovery Point |
+----------+-----------+
|
v
+----------------------+
| Operator / Service |
| Confirmation Needed? |
+----+-------------+---+
| |
v v
+----------+ +----------+
| Resume | | Recover |
| Safely | | / Abort |
+----------+ +----------+Automatic resume is often unsafe because the machine may have changed while software was down.
During power loss:
Axis may coast or lose reference
Vacuum may drop
Part may move
Robot may stop mid-transfer
Controller may keep some state while PC lost state
Operator may manually interveneSo restart should usually enter a recovery mode, not normal running mode.
Example recovery screen:
Machine restarted after abnormal shutdown.
Last known context:
- Lot: LOT-2026-0412
- Wafer: W25
- Recipe: RCP-A v14
- Last checkpoint: ImageCaptured
- Result status: NotReported
Current validation:
- Motion controller: Connected
- X/Y axes: Not homed
- Vacuum: Pressure not detected
- Wafer presence: Sensor indicates present
Recommended action:
- Home axes using safe recovery path
- Verify wafer clamped
- Repeat inspection step or abort waferThis is much safer than showing:
Machine ReadyPART 5 — Workflow Recovery and Partial Completion
Industrial workflows fail in the middle all the time.
Example:
Load wafer
Clamp wafer
Move to inspection position
Capture image
Process image
Store result
Report result
Unload waferA crash may happen here:
Load wafer DONE
Clamp wafer DONE
Move to position DONE
Capture image DONE
Process image IN PROGRESS
Store result NOT DONE
Report result NOT DONE
Unload wafer NOT DONEThe system must know the difference between:
Step started
Step physically completed
Step verified
Step persisted
Step reportedA good workflow model:
+-------------+
| Step Started|
+------+------+
|
v
+-------------+
| Action Sent |
+------+------+
|
v
+-------------+
| Device Done |
+------+------+
|
v
+-------------+
| Verified |
+------+------+
|
v
+-------------+
| Persisted |
+------+------+
|
v
+-------------+
| Checkpoint |
| Completed |
+-------------+Recovery should resume only from completed checkpoints.
Bad design:
CurrentStep = CaptureImageGood design:
Step = CaptureImage
CommandSent = true
ImageReceived = true
ImageSaved = true
ResultComputed = false
CheckpointCompleted = false
RecoveryPolicy = RepeatFromImageProcessingRecovery options include:
Resume from known safe checkpoint
Repeat step
Rollback
Move to recovery workflow
Require operator intervention
Abort current item
Scrap / quarantine materialThe important point:
Recovery checkpoints must be designed before failure happens. They cannot be guessed reliably after a crash.
PART 6 — Production Context Recovery
Production context is different from physical machine state.
Production context answers:
What are we processing?
Under which recipe?
For which job?
What has already been recorded?
What has already been reported?Typical persisted production state:
RunId
LotId
WaferId
SlotId
RecipeId
RecipeVersion
InspectionId
ImageSetId
ResultStatus
ReportStatus
ExportStatusThe dangerous cases are usually around duplicates and gaps.
Example:
Image saved: YES
Database result record: NO
MES report sent: UNKNOWNAfter restart, you must not blindly send the result again unless the reporting operation is idempotent.
Useful status markers:
Created
Started
ImageCaptured
ResultComputed
ResultPersisted
ReportPending
ReportSent
ReportAcknowledged
Failed
Aborted
RequiresReviewFor production records, atomicity matters.
Bad:
Save image file
Crash
Insert DB row laterAfter restart, you have an orphan image.
Better:
Create inspection record: ImageCapturePending
Capture image
Save image with inspection ID
Update record: ImageCaptured
Process result
Update record: ResultComputed
Report result with idempotency key
Update record: ReportAcknowledgedThis lets recovery scan for incomplete records and decide what to do.
PART 7 — Real-World Failure Scenarios
Scenario 1 — Software restores “Running” after restart
What it looks like:
Application starts.
UI shows Running.
Machine is physically stopped.
Operator assumes system is processing.
Nothing moves.
Production time is lost.Why it happens:
MachineState = Runningwas persisted and restored directly.
Prevention:
Persist LastKnownMachineState = Running
Start as RecoveryRequired
Validate devices and physical state
Only transition to Running through normal start logicScenario 2 — Last known axis position is trusted after reference is lost
What it looks like:
System thinks X = 120 mm.
Axis actually lost encoder reference.
Software commands movement based on wrong coordinate.
Machine hits limit or risks collision.Why it happens:
LastKnownPosition was treated as current verified position.Prevention:
After power loss:
AxisPositionTrust = InvalidUntilHomed
Motion commands disabled except safe recovery/homingThis connects directly to motion safety: homing, reference positions, hard limits, soft limits, and safe travel zones are core machine-control topics.
Scenario 3 — Workflow resumes after partial completion
What it looks like:
Robot picked wafer.
Crash occurs before placement confirmation.
After restart, workflow resumes from "Place wafer complete".
Robot/wafer state is wrong.Why it happens:
Workflow step was marked complete too early.Prevention:
Mark checkpoint complete only after:
- command completed
- sensor confirmed
- state persisted
- recovery-safe condition reachedScenario 4 — Product processed twice
What it looks like:
Wafer W25 inspected.
Crash happens before completion marker.
After restart, system inspects W25 again.
MES receives duplicate or conflicting result.Why it happens:
Physical completion and production record completion were not updated atomically.Prevention:
Use item-level processing status.
Use idempotency keys for reporting.
Detect existing inspection result before repeat.
Require operator decision for ambiguous cases.Scenario 5 — Image saved but database failed
What it looks like:
Image file exists on disk.
Database has no result record.
Review UI cannot find the image.
Engineer later finds orphan files.Why it happens:
File storage and metadata persistence were not coordinated.Prevention:
Create record before image capture.
Use stable IDs in file path.
Recover orphan/pending records on startup.
Expose "incomplete inspection data" diagnostics.Scenario 6 — UI shows ready while device initialization is incomplete
What it looks like:
Operator clicks Start.
Camera is connected but not configured.
First acquisition fails.
Machine enters alarm.Why it happens:
Connected was treated as Ready.Prevention:
Device states should be explicit:
Disconnected
Connected
Identified
Configured
Ready
Faulted
DegradedScenario 7 — Stale recipe restored after hardware change
What it looks like:
System restores Recipe A.
But camera/lens/stage configuration changed.
Recipe parameters no longer match machine capability.
Inspection quality becomes wrong.Why it happens:
Recipe was restored without compatibility validation.Prevention:
Validate:
- recipe version
- machine configuration version
- device identity
- calibration version
- firmware/driver compatibilityRecipe/configuration safety is important because industrial systems are heavily parameterized, and recipe mistakes can damage throughput, quality, or hardware.
PART 8 — Software Design Implications
The architecture should look like this:
+---------------------+
| Persistence Store |
| - recovery context |
| - production state |
| - checkpoints |
+----------+----------+
|
v
+---------------------+
| Recovery Manager |
| - load last known |
| - classify state |
| - plan validation |
+----------+----------+
|
v
+-------------------------------+
| Device Validation |
| + Physical State Checks |
| - reconnect |
| - identify |
| - read sensors |
| - home/probe if needed |
+----------+--------------------+
|
v
+-------------------------------+
| Workflow Recovery Decision |
| - resume |
| - repeat |
| - rollback |
| - abort |
| - operator confirmation |
+----------+--------------------+
|
v
+-------------------------------+
| Operator Guidance |
| Safe Resume / Recovery / Abort|
+-------------------------------+Bad approach
Serialize whole machine object graph.
Restart application.
Deserialize object graph.
Continue execution.This is dangerous because it restores software memory, not physical truth.
Better approach
Persist minimal recovery context:
MachineSessionId
ProductionContext
LastCompletedCheckpoint
IncompleteOperation
DeviceConfigurationSnapshot
LastKnownPhysicalState
RecoveryRequiredReasonThen validate everything that matters.
Good state model
public enum TrustLevel
{
Unknown,
LastKnownOnly,
Verified,
InvalidUntilHomed,
RequiresOperatorConfirmation
}
public sealed record AxisRecoveryState(
string AxisName,
double? LastKnownPosition,
TrustLevel PositionTrust,
bool HomingRequired,
DateTimeOffset LastUpdatedAt);The key idea is not the C# syntax.
The key idea is that state has trust level.
PART 9 — Interview / Real-World Talking Points
A strong answer:
In machine software, I would not restore runtime state blindly after restart. I would persist minimal recovery context: production identity, recipe version, workflow checkpoint, incomplete operation, and last-known physical state. On startup, I would treat that state as untrusted until devices reconnect, hardware identity is validated, sensors are checked, axes are homed if needed, and the workflow recovery point is determined. Only then would the system allow safe resume, repeat, rollback, abort, or operator-guided recovery.
Common mistakes software engineers make:
They persist too much runtime state.
They trust last-known physical state.
They confuse workflow step with safe checkpoint.
They restore Running directly after restart.
They ignore partial completion.
They do not model Unknown state.
They assume device connected means device ready.
They forget duplicate reporting and traceability risks.Strong engineers understand:
Persisted != trusted
Last-known != current
Started != completed
Completed != verified
Verified != reported
Connected != ready
Recovery must be explicit
Unknown is a valid and important stateFinal mental model:
Business software recovery:
Load state -> continue
Industrial machine recovery:
Load last-known context
-> validate physical reality
-> classify uncertainty
-> choose safe recovery path
-> guide operator if needed
-> resume only from safe checkpointThat is the heart of system state persistence and recovery in industrial machine software.