Skip to content

System State Persistence & Recovery

In industrial machine software, recovery is not simply “load the last state from the database.”

The core rule is:

Persisted state is not truth. Physical validation after restart is truth.

This topic belongs directly inside reliability and recovery. Your roadmap says machines fail through hardware disconnects, motion faults, timeout chains, power issues, incomplete sequences, sensor disagreement, corrupted states, and operator interruptions; the system must detect failures, fail safely, and recover without making things worse.


PART 1 — Why State Recovery Is Hard in Machine Software

In business software, persisted state is usually the source of truth.

Example:

text
Order.Status = Paid

After restart, you can usually trust it.

In machine software, this is dangerous.

Example:

text
WaferClamp.State = Clamped
AxisX.LastPosition = 123.45 mm
Workflow.State = Inspecting

After power loss, these may only mean:

text
The software last believed this was true.

They do not guarantee that the machine is physically still in that condition.

A wafer may still be clamped. A robot may be holding a part. An axis may have lost reference. A vacuum may have dropped. A camera may have captured an image, but the result may not have been stored.

That is why recovery must answer four questions:

text
1. What was happening before failure?
2. What is physically true now?
3. What can be safely resumed?
4. What must be revalidated or aborted?

This matches the bigger machine-control mindset: industrial software interacts with physical reality, operations are long-running and asynchronous, and wrong logic can cause real-world damage.


PART 2 — Types of State in Industrial Systems

A good recovery design separates state into categories.

text
+------------------------------------------------------+
|                Industrial System State               |
+------------------------------------------------------+
| 1. Production Context                                |
|    Lot ID, Job ID, Wafer ID, Recipe Version           |
|                                                      |
| 2. Workflow State                                    |
|    Current operation, step, completed checkpoints     |
|                                                      |
| 3. Machine Physical State                            |
|    Axis position, clamp state, vacuum, part presence  |
|                                                      |
| 4. Device State                                      |
|    Connected, ready, initialized, faulted             |
|                                                      |
| 5. Transient Runtime State                           |
|    Queues, callbacks, pending commands, subscriptions |
+------------------------------------------------------+

1. Persistent production context

Usually safe and important to persist:

text
LotId
JobId
RunId
WaferId / PartId
RecipeId
RecipeVersion
OperatorId
MachineId

This preserves traceability.

Without it, after restart the machine may not know which wafer was being processed, which recipe was active, or whether results were already reported.

2. Workflow state

Partially safe to persist, but must be modeled carefully.

You should not persist only:

text
CurrentStep = "InspectWafer"

Better:

text
WorkflowInstanceId
CurrentOperation
LastCompletedCheckpoint
StepStatus
RecoveryPolicy

The key is checkpoint-based recovery.

A workflow step may have started but not completed. So you need to know the last safe recovery point, not just the last method being executed.

3. Machine physical state

Dangerous to trust blindly.

Examples:

text
AxisX.LastKnownPosition = 120.5 mm
Clamp.LastKnownState = Clamped
Vacuum.LastKnownPressure = -80 kPa
PartPresence.LastKnown = Present

After restart, these should become:

text
LastKnownPosition
LastKnownClampState
LastKnownVacuumState
LastKnownPartPresence

Not:

text
CurrentPosition
CurrentClampState
CurrentVacuumState
CurrentPartPresence

Physical state must be revalidated using sensors, controller feedback, homing, probing, or service confirmation.

4. Device state

Usually reconstructed, not blindly persisted.

Do not restore:

text
Camera.State = Ready
MotionController.State = Initialized
Robot.State = Connected

Instead, rebuild it:

text
Connect
Identify device
Check firmware/config
Initialize
Read status
Verify ready/fault state

A device that was ready yesterday may be disconnected today.

5. Transient runtime state

Usually should not be persisted.

Examples:

text
In-memory queues
Pending async commands
Callbacks
Subscriptions
Temporary buffers
Cancellation tokens
Thread state
Device SDK handles

These belong to the running process. After restart, they are dead.


PART 3 — Persisted State vs Trusted State

This distinction is one of the most important ideas.

text
+-------------------+
| Persisted State   |
| "Last known"      |
+---------+---------+
          |
          v
+-------------------+
| Validation        |
| sensors, devices, |
| homing, checks    |
+---------+---------+
          |
          v
+---------------------------+
| Trusted Current State     |
| safe to use               |
+---------------------------+

          OR

+---------------------------+
| Unknown / Unsafe State    |
| requires recovery action  |
+---------------------------+

Persisted state says:

text
What did software last record?

Trusted state says:

text
What has the system verified now?

Example:

text
Persisted:
AxisX.LastKnownPosition = 150.0 mm

After restart:
AxisX.TrustLevel = UnknownUntilHomed

Another example:

text
Persisted:
Vacuum.LastKnownState = On

After restart:
Vacuum.CurrentState = Unknown
RecoveryAction = ReadVacuumSensorAndPressure

A mature system has explicit language for this:

text
Verified
Unverified
LastKnown
Unknown
RequiresHoming
RequiresOperatorConfirmation
UnsafeToResume

Bad systems only have:

text
Running
Stopped
Ready
Error

That is not enough.


PART 4 — Recovery After Crash or Power Loss

A safe recovery flow looks like this:

text
+----------------------+
| Application Restart  |
+----------+-----------+
           |
           v
+----------------------+
| Load Persisted       |
| Recovery Context     |
+----------+-----------+
           |
           v
+----------------------+
| Reconnect Devices    |
+----------+-----------+
           |
           v
+----------------------+
| Validate Device      |
| Identity / Config    |
+----------+-----------+
           |
           v
+----------------------+
| Re-establish         |
| Physical State       |
+----------+-----------+
           |
           v
+----------------------+
| Determine Workflow   |
| Recovery Point       |
+----------+-----------+
           |
           v
+----------------------+
| Operator / Service   |
| Confirmation Needed? |
+----+-------------+---+
     |             |
     v             v
+----------+   +----------+
| Resume   |   | Recover  |
| Safely   |   | / Abort  |
+----------+   +----------+

Automatic resume is often unsafe because the machine may have changed while software was down.

During power loss:

text
Axis may coast or lose reference
Vacuum may drop
Part may move
Robot may stop mid-transfer
Controller may keep some state while PC lost state
Operator may manually intervene

So restart should usually enter a recovery mode, not normal running mode.

Example recovery screen:

text
Machine restarted after abnormal shutdown.

Last known context:
- Lot: LOT-2026-0412
- Wafer: W25
- Recipe: RCP-A v14
- Last checkpoint: ImageCaptured
- Result status: NotReported

Current validation:
- Motion controller: Connected
- X/Y axes: Not homed
- Vacuum: Pressure not detected
- Wafer presence: Sensor indicates present

Recommended action:
- Home axes using safe recovery path
- Verify wafer clamped
- Repeat inspection step or abort wafer

This is much safer than showing:

text
Machine Ready

PART 5 — Workflow Recovery and Partial Completion

Industrial workflows fail in the middle all the time.

Example:

text
Load wafer
Clamp wafer
Move to inspection position
Capture image
Process image
Store result
Report result
Unload wafer

A crash may happen here:

text
Load wafer              DONE
Clamp wafer             DONE
Move to position         DONE
Capture image            DONE
Process image            IN PROGRESS
Store result             NOT DONE
Report result            NOT DONE
Unload wafer             NOT DONE

The system must know the difference between:

text
Step started
Step physically completed
Step verified
Step persisted
Step reported

A good workflow model:

text
+-------------+
| Step Started|
+------+------+
       |
       v
+-------------+
| Action Sent |
+------+------+
       |
       v
+-------------+
| Device Done |
+------+------+
       |
       v
+-------------+
| Verified    |
+------+------+
       |
       v
+-------------+
| Persisted   |
+------+------+
       |
       v
+-------------+
| Checkpoint  |
| Completed   |
+-------------+

Recovery should resume only from completed checkpoints.

Bad design:

text
CurrentStep = CaptureImage

Good design:

text
Step = CaptureImage
CommandSent = true
ImageReceived = true
ImageSaved = true
ResultComputed = false
CheckpointCompleted = false
RecoveryPolicy = RepeatFromImageProcessing

Recovery options include:

text
Resume from known safe checkpoint
Repeat step
Rollback
Move to recovery workflow
Require operator intervention
Abort current item
Scrap / quarantine material

The important point:

Recovery checkpoints must be designed before failure happens. They cannot be guessed reliably after a crash.


PART 6 — Production Context Recovery

Production context is different from physical machine state.

Production context answers:

text
What are we processing?
Under which recipe?
For which job?
What has already been recorded?
What has already been reported?

Typical persisted production state:

text
RunId
LotId
WaferId
SlotId
RecipeId
RecipeVersion
InspectionId
ImageSetId
ResultStatus
ReportStatus
ExportStatus

The dangerous cases are usually around duplicates and gaps.

Example:

text
Image saved: YES
Database result record: NO
MES report sent: UNKNOWN

After restart, you must not blindly send the result again unless the reporting operation is idempotent.

Useful status markers:

text
Created
Started
ImageCaptured
ResultComputed
ResultPersisted
ReportPending
ReportSent
ReportAcknowledged
Failed
Aborted
RequiresReview

For production records, atomicity matters.

Bad:

text
Save image file
Crash
Insert DB row later

After restart, you have an orphan image.

Better:

text
Create inspection record: ImageCapturePending
Capture image
Save image with inspection ID
Update record: ImageCaptured
Process result
Update record: ResultComputed
Report result with idempotency key
Update record: ReportAcknowledged

This lets recovery scan for incomplete records and decide what to do.


PART 7 — Real-World Failure Scenarios

Scenario 1 — Software restores “Running” after restart

What it looks like:

text
Application starts.
UI shows Running.
Machine is physically stopped.
Operator assumes system is processing.
Nothing moves.
Production time is lost.

Why it happens:

text
MachineState = Running

was persisted and restored directly.

Prevention:

text
Persist LastKnownMachineState = Running
Start as RecoveryRequired
Validate devices and physical state
Only transition to Running through normal start logic

Scenario 2 — Last known axis position is trusted after reference is lost

What it looks like:

text
System thinks X = 120 mm.
Axis actually lost encoder reference.
Software commands movement based on wrong coordinate.
Machine hits limit or risks collision.

Why it happens:

text
LastKnownPosition was treated as current verified position.

Prevention:

text
After power loss:
AxisPositionTrust = InvalidUntilHomed
Motion commands disabled except safe recovery/homing

This connects directly to motion safety: homing, reference positions, hard limits, soft limits, and safe travel zones are core machine-control topics.


Scenario 3 — Workflow resumes after partial completion

What it looks like:

text
Robot picked wafer.
Crash occurs before placement confirmation.
After restart, workflow resumes from "Place wafer complete".
Robot/wafer state is wrong.

Why it happens:

text
Workflow step was marked complete too early.

Prevention:

text
Mark checkpoint complete only after:
- command completed
- sensor confirmed
- state persisted
- recovery-safe condition reached

Scenario 4 — Product processed twice

What it looks like:

text
Wafer W25 inspected.
Crash happens before completion marker.
After restart, system inspects W25 again.
MES receives duplicate or conflicting result.

Why it happens:

text
Physical completion and production record completion were not updated atomically.

Prevention:

text
Use item-level processing status.
Use idempotency keys for reporting.
Detect existing inspection result before repeat.
Require operator decision for ambiguous cases.

Scenario 5 — Image saved but database failed

What it looks like:

text
Image file exists on disk.
Database has no result record.
Review UI cannot find the image.
Engineer later finds orphan files.

Why it happens:

text
File storage and metadata persistence were not coordinated.

Prevention:

text
Create record before image capture.
Use stable IDs in file path.
Recover orphan/pending records on startup.
Expose "incomplete inspection data" diagnostics.

Scenario 6 — UI shows ready while device initialization is incomplete

What it looks like:

text
Operator clicks Start.
Camera is connected but not configured.
First acquisition fails.
Machine enters alarm.

Why it happens:

text
Connected was treated as Ready.

Prevention:

text
Device states should be explicit:

Disconnected
Connected
Identified
Configured
Ready
Faulted
Degraded

Scenario 7 — Stale recipe restored after hardware change

What it looks like:

text
System restores Recipe A.
But camera/lens/stage configuration changed.
Recipe parameters no longer match machine capability.
Inspection quality becomes wrong.

Why it happens:

text
Recipe was restored without compatibility validation.

Prevention:

text
Validate:
- recipe version
- machine configuration version
- device identity
- calibration version
- firmware/driver compatibility

Recipe/configuration safety is important because industrial systems are heavily parameterized, and recipe mistakes can damage throughput, quality, or hardware.


PART 8 — Software Design Implications

The architecture should look like this:

text
+---------------------+
| Persistence Store   |
| - recovery context  |
| - production state  |
| - checkpoints       |
+----------+----------+
           |
           v
+---------------------+
| Recovery Manager    |
| - load last known   |
| - classify state    |
| - plan validation   |
+----------+----------+
           |
           v
+-------------------------------+
| Device Validation             |
| + Physical State Checks       |
| - reconnect                   |
| - identify                    |
| - read sensors                |
| - home/probe if needed         |
+----------+--------------------+
           |
           v
+-------------------------------+
| Workflow Recovery Decision    |
| - resume                      |
| - repeat                      |
| - rollback                    |
| - abort                       |
| - operator confirmation        |
+----------+--------------------+
           |
           v
+-------------------------------+
| Operator Guidance             |
| Safe Resume / Recovery / Abort|
+-------------------------------+

Bad approach

text
Serialize whole machine object graph.
Restart application.
Deserialize object graph.
Continue execution.

This is dangerous because it restores software memory, not physical truth.

Better approach

Persist minimal recovery context:

text
MachineSessionId
ProductionContext
LastCompletedCheckpoint
IncompleteOperation
DeviceConfigurationSnapshot
LastKnownPhysicalState
RecoveryRequiredReason

Then validate everything that matters.

Good state model

text
public enum TrustLevel
{
    Unknown,
    LastKnownOnly,
    Verified,
    InvalidUntilHomed,
    RequiresOperatorConfirmation
}

public sealed record AxisRecoveryState(
    string AxisName,
    double? LastKnownPosition,
    TrustLevel PositionTrust,
    bool HomingRequired,
    DateTimeOffset LastUpdatedAt);

The key idea is not the C# syntax.

The key idea is that state has trust level.


PART 9 — Interview / Real-World Talking Points

A strong answer:

In machine software, I would not restore runtime state blindly after restart. I would persist minimal recovery context: production identity, recipe version, workflow checkpoint, incomplete operation, and last-known physical state. On startup, I would treat that state as untrusted until devices reconnect, hardware identity is validated, sensors are checked, axes are homed if needed, and the workflow recovery point is determined. Only then would the system allow safe resume, repeat, rollback, abort, or operator-guided recovery.

Common mistakes software engineers make:

text
They persist too much runtime state.
They trust last-known physical state.
They confuse workflow step with safe checkpoint.
They restore Running directly after restart.
They ignore partial completion.
They do not model Unknown state.
They assume device connected means device ready.
They forget duplicate reporting and traceability risks.

Strong engineers understand:

text
Persisted != trusted
Last-known != current
Started != completed
Completed != verified
Verified != reported
Connected != ready
Recovery must be explicit
Unknown is a valid and important state

Final mental model:

text
Business software recovery:
Load state -> continue

Industrial machine recovery:
Load last-known context
-> validate physical reality
-> classify uncertainty
-> choose safe recovery path
-> guide operator if needed
-> resume only from safe checkpoint

That is the heart of system state persistence and recovery in industrial machine software.

Docs-first project memory for AI-assisted implementation.