System State Persistence & Recovery

In industrial machine software, recovery is not simply “load the last state from the database.”

The core rule is:

Persisted state is not truth. Physical validation after restart is truth.

This topic belongs directly inside reliability and recovery. Your roadmap says machines fail through hardware disconnects, motion faults, timeout chains, power issues, incomplete sequences, sensor disagreement, corrupted states, and operator interruptions; the system must detect failures, fail safely, and recover without making things worse.

PART 1 — Why State Recovery Is Hard in Machine Software

In business software, persisted state is usually the source of truth.

Example:

text

Order.Status = Paid

After restart, you can usually trust it.

In machine software, this is dangerous.

Example:

text

WaferClamp.State = Clamped
AxisX.LastPosition = 123.45 mm
Workflow.State = Inspecting

After power loss, these may only mean:

text

The software last believed this was true.

They do not guarantee that the machine is physically still in that condition.

A wafer may still be clamped. A robot may be holding a part. An axis may have lost reference. A vacuum may have dropped. A camera may have captured an image, but the result may not have been stored.

That is why recovery must answer four questions:

text

1. What was happening before failure?
2. What is physically true now?
3. What can be safely resumed?
4. What must be revalidated or aborted?

This matches the bigger machine-control mindset: industrial software interacts with physical reality, operations are long-running and asynchronous, and wrong logic can cause real-world damage.

PART 2 — Types of State in Industrial Systems

A good recovery design separates state into categories.

text

+------------------------------------------------------+
|                Industrial System State               |
+------------------------------------------------------+
| 1. Production Context                                |
|    Lot ID, Job ID, Wafer ID, Recipe Version           |
|                                                      |
| 2. Workflow State                                    |
|    Current operation, step, completed checkpoints     |
|                                                      |
| 3. Machine Physical State                            |
|    Axis position, clamp state, vacuum, part presence  |
|                                                      |
| 4. Device State                                      |
|    Connected, ready, initialized, faulted             |
|                                                      |
| 5. Transient Runtime State                           |
|    Queues, callbacks, pending commands, subscriptions |
+------------------------------------------------------+

1. Persistent production context

Usually safe and important to persist:

text

LotId
JobId
RunId
WaferId / PartId
RecipeId
RecipeVersion
OperatorId
MachineId

This preserves traceability.

Without it, after restart the machine may not know which wafer was being processed, which recipe was active, or whether results were already reported.

2. Workflow state

Partially safe to persist, but must be modeled carefully.

You should not persist only:

text

CurrentStep = "InspectWafer"

Better:

text

WorkflowInstanceId
CurrentOperation
LastCompletedCheckpoint
StepStatus
RecoveryPolicy

The key is checkpoint-based recovery.

A workflow step may have started but not completed. So you need to know the last safe recovery point, not just the last method being executed.

3. Machine physical state

Dangerous to trust blindly.

Examples:

text

AxisX.LastKnownPosition = 120.5 mm
Clamp.LastKnownState = Clamped
Vacuum.LastKnownPressure = -80 kPa
PartPresence.LastKnown = Present

After restart, these should become:

text

LastKnownPosition
LastKnownClampState
LastKnownVacuumState
LastKnownPartPresence

Not:

text

CurrentPosition
CurrentClampState
CurrentVacuumState
CurrentPartPresence

Physical state must be revalidated using sensors, controller feedback, homing, probing, or service confirmation.

4. Device state

Usually reconstructed, not blindly persisted.

Do not restore:

text

Camera.State = Ready
MotionController.State = Initialized
Robot.State = Connected

Instead, rebuild it:

text

Connect
Identify device
Check firmware/config
Initialize
Read status
Verify ready/fault state

A device that was ready yesterday may be disconnected today.

5. Transient runtime state

Usually should not be persisted.

Examples:

text

In-memory queues
Pending async commands
Callbacks
Subscriptions
Temporary buffers
Cancellation tokens
Thread state
Device SDK handles

These belong to the running process. After restart, they are dead.

PART 3 — Persisted State vs Trusted State

This distinction is one of the most important ideas.

text

+-------------------+
| Persisted State   |
| "Last known"      |
+---------+---------+
          |
          v
+-------------------+
| Validation        |
| sensors, devices, |
| homing, checks    |
+---------+---------+
          |
          v
+---------------------------+
| Trusted Current State     |
| safe to use               |
+---------------------------+

          OR

+---------------------------+
| Unknown / Unsafe State    |
| requires recovery action  |
+---------------------------+

Persisted state says:

text

What did software last record?

Trusted state says:

text

What has the system verified now?

Example:

text

Persisted:
AxisX.LastKnownPosition = 150.0 mm

After restart:
AxisX.TrustLevel = UnknownUntilHomed

Another example:

text

Persisted:
Vacuum.LastKnownState = On

After restart:
Vacuum.CurrentState = Unknown
RecoveryAction = ReadVacuumSensorAndPressure

A mature system has explicit language for this:

text

Verified
Unverified
LastKnown
Unknown
RequiresHoming
RequiresOperatorConfirmation
UnsafeToResume

Bad systems only have:

text

Running
Stopped
Ready
Error

That is not enough.

PART 4 — Recovery After Crash or Power Loss

A safe recovery flow looks like this:

text

+----------------------+
| Application Restart  |
+----------+-----------+
           |
           v
+----------------------+
| Load Persisted       |
| Recovery Context     |
+----------+-----------+
           |
           v
+----------------------+
| Reconnect Devices    |
+----------+-----------+
           |
           v
+----------------------+
| Validate Device      |
| Identity / Config    |
+----------+-----------+
           |
           v
+----------------------+
| Re-establish         |
| Physical State       |
+----------+-----------+
           |
           v
+----------------------+
| Determine Workflow   |
| Recovery Point       |
+----------+-----------+
           |
           v
+----------------------+
| Operator / Service   |
| Confirmation Needed? |
+----+-------------+---+
     |             |
     v             v
+----------+   +----------+
| Resume   |   | Recover  |
| Safely   |   | / Abort  |
+----------+   +----------+

Automatic resume is often unsafe because the machine may have changed while software was down.

During power loss:

text

Axis may coast or lose reference
Vacuum may drop
Part may move
Robot may stop mid-transfer
Controller may keep some state while PC lost state
Operator may manually intervene

So restart should usually enter a recovery mode, not normal running mode.

Example recovery screen:

text

Machine restarted after abnormal shutdown.

Last known context:
- Lot: LOT-2026-0412
- Wafer: W25
- Recipe: RCP-A v14
- Last checkpoint: ImageCaptured
- Result status: NotReported

Current validation:
- Motion controller: Connected
- X/Y axes: Not homed
- Vacuum: Pressure not detected
- Wafer presence: Sensor indicates present

Recommended action:
- Home axes using safe recovery path
- Verify wafer clamped
- Repeat inspection step or abort wafer

This is much safer than showing:

text

Machine Ready

PART 5 — Workflow Recovery and Partial Completion

Industrial workflows fail in the middle all the time.

Example:

text

Load wafer
Clamp wafer
Move to inspection position
Capture image
Process image
Store result
Report result
Unload wafer

A crash may happen here:

text

Load wafer              DONE
Clamp wafer             DONE
Move to position         DONE
Capture image            DONE
Process image            IN PROGRESS
Store result             NOT DONE
Report result            NOT DONE
Unload wafer             NOT DONE

The system must know the difference between:

text

Step started
Step physically completed
Step verified
Step persisted
Step reported

A good workflow model:

text

+-------------+
| Step Started|
+------+------+
       |
       v
+-------------+
| Action Sent |
+------+------+
       |
       v
+-------------+
| Device Done |
+------+------+
       |
       v
+-------------+
| Verified    |
+------+------+
       |
       v
+-------------+
| Persisted   |
+------+------+
       |
       v
+-------------+
| Checkpoint  |
| Completed   |
+-------------+

Recovery should resume only from completed checkpoints.

Bad design:

text

CurrentStep = CaptureImage

Good design:

text

Step = CaptureImage
CommandSent = true
ImageReceived = true
ImageSaved = true
ResultComputed = false
CheckpointCompleted = false
RecoveryPolicy = RepeatFromImageProcessing

Recovery options include:

text

Resume from known safe checkpoint
Repeat step
Rollback
Move to recovery workflow
Require operator intervention
Abort current item
Scrap / quarantine material

The important point:

Recovery checkpoints must be designed before failure happens. They cannot be guessed reliably after a crash.

PART 6 — Production Context Recovery

Production context is different from physical machine state.

Production context answers:

text

What are we processing?
Under which recipe?
For which job?
What has already been recorded?
What has already been reported?

Typical persisted production state:

text

RunId
LotId
WaferId
SlotId
RecipeId
RecipeVersion
InspectionId
ImageSetId
ResultStatus
ReportStatus
ExportStatus

The dangerous cases are usually around duplicates and gaps.

Example:

text

Image saved: YES
Database result record: NO
MES report sent: UNKNOWN

After restart, you must not blindly send the result again unless the reporting operation is idempotent.

Useful status markers:

text

Created
Started
ImageCaptured
ResultComputed
ResultPersisted
ReportPending
ReportSent
ReportAcknowledged
Failed
Aborted
RequiresReview

For production records, atomicity matters.

Bad:

text

Save image file
Crash
Insert DB row later

After restart, you have an orphan image.

Better:

text

Create inspection record: ImageCapturePending
Capture image
Save image with inspection ID
Update record: ImageCaptured
Process result
Update record: ResultComputed
Report result with idempotency key
Update record: ReportAcknowledged

This lets recovery scan for incomplete records and decide what to do.

PART 7 — Real-World Failure Scenarios

Scenario 1 — Software restores “Running” after restart

What it looks like:

text

Application starts.
UI shows Running.
Machine is physically stopped.
Operator assumes system is processing.
Nothing moves.
Production time is lost.

Why it happens:

text

MachineState = Running

was persisted and restored directly.

Prevention:

text

Persist LastKnownMachineState = Running
Start as RecoveryRequired
Validate devices and physical state
Only transition to Running through normal start logic

Scenario 2 — Last known axis position is trusted after reference is lost

What it looks like:

text

System thinks X = 120 mm.
Axis actually lost encoder reference.
Software commands movement based on wrong coordinate.
Machine hits limit or risks collision.

Why it happens:

text

LastKnownPosition was treated as current verified position.

Prevention:

text

After power loss:
AxisPositionTrust = InvalidUntilHomed
Motion commands disabled except safe recovery/homing

This connects directly to motion safety: homing, reference positions, hard limits, soft limits, and safe travel zones are core machine-control topics.

Scenario 3 — Workflow resumes after partial completion

What it looks like:

text

Robot picked wafer.
Crash occurs before placement confirmation.
After restart, workflow resumes from "Place wafer complete".
Robot/wafer state is wrong.

Why it happens:

text

Workflow step was marked complete too early.

Prevention:

text

Mark checkpoint complete only after:
- command completed
- sensor confirmed
- state persisted
- recovery-safe condition reached

Scenario 4 — Product processed twice

What it looks like:

text

Wafer W25 inspected.
Crash happens before completion marker.
After restart, system inspects W25 again.
MES receives duplicate or conflicting result.

Why it happens:

text

Physical completion and production record completion were not updated atomically.

Prevention:

text

Use item-level processing status.
Use idempotency keys for reporting.
Detect existing inspection result before repeat.
Require operator decision for ambiguous cases.

Scenario 5 — Image saved but database failed

What it looks like:

text

Image file exists on disk.
Database has no result record.
Review UI cannot find the image.
Engineer later finds orphan files.

Why it happens:

text

File storage and metadata persistence were not coordinated.

Prevention:

text

Create record before image capture.
Use stable IDs in file path.
Recover orphan/pending records on startup.
Expose "incomplete inspection data" diagnostics.

Scenario 6 — UI shows ready while device initialization is incomplete

What it looks like:

text

Operator clicks Start.
Camera is connected but not configured.
First acquisition fails.
Machine enters alarm.

Why it happens:

text

Connected was treated as Ready.

Prevention:

text

Device states should be explicit:

Disconnected
Connected
Identified
Configured
Ready
Faulted
Degraded

Scenario 7 — Stale recipe restored after hardware change

What it looks like:

text

System restores Recipe A.
But camera/lens/stage configuration changed.
Recipe parameters no longer match machine capability.
Inspection quality becomes wrong.

Why it happens:

text

Recipe was restored without compatibility validation.

Prevention:

text

Validate:
- recipe version
- machine configuration version
- device identity
- calibration version
- firmware/driver compatibility

Recipe/configuration safety is important because industrial systems are heavily parameterized, and recipe mistakes can damage throughput, quality, or hardware.

PART 8 — Software Design Implications

The architecture should look like this:

text

+---------------------+
| Persistence Store   |
| - recovery context  |
| - production state  |
| - checkpoints       |
+----------+----------+
           |
           v
+---------------------+
| Recovery Manager    |
| - load last known   |
| - classify state    |
| - plan validation   |
+----------+----------+
           |
           v
+-------------------------------+
| Device Validation             |
| + Physical State Checks       |
| - reconnect                   |
| - identify                    |
| - read sensors                |
| - home/probe if needed         |
+----------+--------------------+
           |
           v
+-------------------------------+
| Workflow Recovery Decision    |
| - resume                      |
| - repeat                      |
| - rollback                    |
| - abort                       |
| - operator confirmation        |
+----------+--------------------+
           |
           v
+-------------------------------+
| Operator Guidance             |
| Safe Resume / Recovery / Abort|
+-------------------------------+

Bad approach

text

Serialize whole machine object graph.
Restart application.
Deserialize object graph.
Continue execution.

This is dangerous because it restores software memory, not physical truth.

Better approach

Persist minimal recovery context:

text

MachineSessionId
ProductionContext
LastCompletedCheckpoint
IncompleteOperation
DeviceConfigurationSnapshot
LastKnownPhysicalState
RecoveryRequiredReason

Then validate everything that matters.

Good state model

text

public enum TrustLevel
{
    Unknown,
    LastKnownOnly,
    Verified,
    InvalidUntilHomed,
    RequiresOperatorConfirmation
}

public sealed record AxisRecoveryState(
    string AxisName,
    double? LastKnownPosition,
    TrustLevel PositionTrust,
    bool HomingRequired,
    DateTimeOffset LastUpdatedAt);

The key idea is not the C# syntax.

The key idea is that state has trust level.

PART 9 — Interview / Real-World Talking Points

A strong answer:

In machine software, I would not restore runtime state blindly after restart. I would persist minimal recovery context: production identity, recipe version, workflow checkpoint, incomplete operation, and last-known physical state. On startup, I would treat that state as untrusted until devices reconnect, hardware identity is validated, sensors are checked, axes are homed if needed, and the workflow recovery point is determined. Only then would the system allow safe resume, repeat, rollback, abort, or operator-guided recovery.

Common mistakes software engineers make:

text

They persist too much runtime state.
They trust last-known physical state.
They confuse workflow step with safe checkpoint.
They restore Running directly after restart.
They ignore partial completion.
They do not model Unknown state.
They assume device connected means device ready.
They forget duplicate reporting and traceability risks.

Strong engineers understand:

text

Persisted != trusted
Last-known != current
Started != completed
Completed != verified
Verified != reported
Connected != ready
Recovery must be explicit
Unknown is a valid and important state

Final mental model:

text

Business software recovery:
Load state -> continue

Industrial machine recovery:
Load last-known context
-> validate physical reality
-> classify uncertainty
-> choose safe recovery path
-> guide operator if needed
-> resume only from safe checkpoint

That is the heart of system state persistence and recovery in industrial machine software.

Domains

Terms

1 Machine Control and Motion Systems

2 Hardware Integration and Device Control

3 Industrial Software Architecture

4 Industrial Communication and Integration

5 Vision, Imaging and Inspection Systems

6 UI HMI Operator Experience

7 Reliability Safety and Production Readiness

Industrial Desktop Systems

Streaming Pipelines Dotnet Real World

System State Persistence & Recovery

PART 1 — Why State Recovery Is Hard in Machine Software

PART 2 — Types of State in Industrial Systems

1. Persistent production context

2. Workflow state

3. Machine physical state

4. Device state

5. Transient runtime state

PART 3 — Persisted State vs Trusted State

PART 4 — Recovery After Crash or Power Loss

PART 5 — Workflow Recovery and Partial Completion

PART 6 — Production Context Recovery

PART 7 — Real-World Failure Scenarios

Scenario 1 — Software restores “Running” after restart

Scenario 2 — Last known axis position is trusted after reference is lost

Scenario 3 — Workflow resumes after partial completion

Scenario 4 — Product processed twice

Scenario 5 — Image saved but database failed

Scenario 6 — UI shows ready while device initialization is incomplete

Scenario 7 — Stale recipe restored after hardware change

PART 8 — Software Design Implications

Bad approach

Better approach

Good state model

PART 9 — Interview / Real-World Talking Points

Streaming Pipelines Dotnet Real World

System State Persistence & Recovery ​

PART 1 — Why State Recovery Is Hard in Machine Software ​

PART 2 — Types of State in Industrial Systems ​

1. Persistent production context ​

2. Workflow state ​

3. Machine physical state ​

4. Device state ​

5. Transient runtime state ​

PART 3 — Persisted State vs Trusted State ​

PART 4 — Recovery After Crash or Power Loss ​

PART 5 — Workflow Recovery and Partial Completion ​

PART 6 — Production Context Recovery ​

PART 7 — Real-World Failure Scenarios ​

Scenario 1 — Software restores “Running” after restart ​

Scenario 2 — Last known axis position is trusted after reference is lost ​

Scenario 3 — Workflow resumes after partial completion ​

Scenario 4 — Product processed twice ​

Scenario 5 — Image saved but database failed ​

Scenario 6 — UI shows ready while device initialization is incomplete ​

Scenario 7 — Stale recipe restored after hardware change ​

PART 8 — Software Design Implications ​

Bad approach ​

Better approach ​

Good state model ​

PART 9 — Interview / Real-World Talking Points ​

System State Persistence & Recovery

PART 1 — Why State Recovery Is Hard in Machine Software

PART 2 — Types of State in Industrial Systems

1. Persistent production context

2. Workflow state

3. Machine physical state

4. Device state

5. Transient runtime state

PART 3 — Persisted State vs Trusted State

PART 4 — Recovery After Crash or Power Loss

PART 5 — Workflow Recovery and Partial Completion

PART 6 — Production Context Recovery

PART 7 — Real-World Failure Scenarios

Scenario 1 — Software restores “Running” after restart

Scenario 2 — Last known axis position is trusted after reference is lost

Scenario 3 — Workflow resumes after partial completion

Scenario 4 — Product processed twice

Scenario 5 — Image saved but database failed

Scenario 6 — UI shows ready while device initialization is incomplete

Scenario 7 — Stale recipe restored after hardware change

PART 8 — Software Design Implications

Bad approach

Better approach

Good state model

PART 9 — Interview / Real-World Talking Points