Error Handling, Fault Propagation & Recovery
This topic fits your roadmap’s Reliability, Fault Handling & Recovery domain, which emphasizes detecting failures, failing safely, reporting clearly, and recovering without making the situation worse.
PART 1 — Why Error Handling Is Not Just Try/Catch
In normal business software, an error often means:
request failed → return error response → user tries again
In industrial machine software, an error may affect:
- physical motion
- camera acquisition
- robot handling
- vacuum state
- wafer position
- machine mode
- operator action
- production material
So the real question is not:
“Did we catch the exception?”
The real question is:
“What state is the machine in now, and what must happen next to keep it safe and recoverable?”
A try/catch handles a code-level problem.
A fault-handling strategy handles a system-level abnormal condition.
Example:
A vision pipeline throws an exception while processing an image.
In a normal app, you might log it and skip the image.
In a wafer inspection machine, you must ask:
- Was the wafer moving?
- Was this image tied to a specific stage position?
- Is the inspection result now invalid?
- Should the workflow pause?
- Can the machine continue?
- Does the operator need intervention?
- Is the current lot still trustworthy?
The exception is only the symptom.
The machine fault is the real problem.
PART 2 — Error vs Fault vs Failure
A useful distinction:
Error
Something went wrong in code, data, communication, or execution.
Examples:
- null value
- invalid parameter
- SDK call failed
- timeout occurred
- image processing exception
- unexpected device response
Fault
The system is now in an abnormal condition.
Examples:
- camera disconnected
- axis not ready
- vacuum not reached
- wafer position unknown
- recipe invalid
- subsystem unavailable
Failure
The system cannot perform its required function.
Examples:
- machine cannot inspect wafers
- robot cannot load material
- stage cannot move safely
- inspection result cannot be trusted
- recovery cannot continue automatically
The architecture must not confuse these three.
Bad design treats every error as an exception.
Good design converts low-level errors into meaningful machine faults.
PART 3 — Fault Propagation Across Layers
Faults become dangerous when they travel upward without structure.
Simple propagation diagram:
+----------+ +-----------+ +------------+ +------+
| Device | ---> | Control | ---> | Workflow | ---> | UI |
| Layer | | Layer | | Layer | | |
+----------+ +-----------+ +------------+ +------+
| | | |
camera timeout exception thrown sequence stuck operator confusedExample:
Camera SDK timeout
↓
Camera adapter throws exception
↓
Inspection controller does not classify it
↓
Workflow waits forever for image result
↓
UI shows "Running" even though inspection is dead
↓
Operator presses Stop/Start repeatedly
↓
Machine state becomes harder to recoverThe original problem was small: a camera timeout.
The real failure became large because the system did not contain it.
Industrial software must control propagation.
A fault should move upward only after it has been classified.
PART 4 — Containment Strategy
The principle:
Handle as low as possible. Escalate only when needed.
But “low as possible” does not mean “hide the problem.”
It means:
- retry local transient problems locally
- reset only the affected subsystem if safe
- escalate when the current layer cannot guarantee correctness
- never silently continue if physical state or product quality is uncertain
Containment diagram:
+------------------------------------------------------------+
| UI Layer |
| Inform operator, restrict commands, show recovery path |
+-----------------------------↑------------------------------+
|
+-----------------------------|------------------------------+
| Application / Workflow |
| Pause step, abort run, mark product/result invalid |
+-----------------------------↑------------------------------+
|
+-----------------------------|------------------------------+
| Control Layer |
| Isolate subsystem, stop motion, block dependent actions |
+-----------------------------↑------------------------------+
|
+-----------------------------|------------------------------+
| Device Layer |
| Retry, reconnect, reset, report device-specific fault |
+------------------------------------------------------------+Layer responsibility:
| Layer | Should do |
|---|---|
| Device layer | Detect communication/device errors, retry safe transient operations, expose clear device fault |
| Control layer | Protect physical subsystem, stop unsafe actions, isolate unavailable devices |
| Workflow layer | Decide whether current operation can continue, pause, abort, or require operator recovery |
| UI layer | Inform operator and prevent unsafe/manual conflicting actions |
A common mistake is letting the UI decide machine recovery.
The UI should request recovery.
The workflow/control layers should decide whether recovery is valid.
PART 5 — Error Handling Strategies
1. Fail-fast
Use when continuing is unsafe or correctness is impossible.
Examples:
- unexpected motion state
- axis position unknown
- safety-related interlock lost
- wafer presence inconsistent
- recipe parameter violates physical limit
Fail-fast means:
stop the affected operation immediately and move toward a safe state.
It does not always mean crash the process.
In industrial systems, fail-fast usually means controlled stop, not application death.
2. Retry
Use for transient faults where retry is safe and bounded.
Examples:
- temporary TCP communication loss
- short device busy response
- camera frame not ready
- database write transient failure
Retry must have:
- maximum attempt count
- timeout
- delay/backoff
- cancellation
- fault classification
- no hidden infinite loops
Bad retry:
while(true)
TryReadCamera();Good retry:
Retry 3 times within 2 seconds.
If still failed, raise CameraAcquisitionFault.
Pause inspection workflow.3. Fallback
Use when an alternative path exists and correctness remains acceptable.
Examples:
- use cached calibration only if still valid
- use secondary sensor if primary sensor fails
- use offline result buffering if host connection is down
Fallback must be explicit.
The system should know it is operating on fallback behavior.
4. Degrade
Use when the machine can continue with reduced capability.
Examples:
- continue handling wafers but disable inspection
- continue production without optional image preview
- run slower because one performance optimization is unavailable
- disable automatic review while still saving raw results
Degraded mode is dangerous if operators do not understand it.
The system must define:
- what is disabled
- what remains valid
- what quality risk exists
- how to return to normal
5. Isolate
Use when one subsystem fails but the whole system should not collapse.
Examples:
- disable camera subsystem
- keep motion controller alive
- stop only one station in a multi-station machine
- prevent recipe activation while allowing diagnostics
Isolation is one of the strongest anti-cascade strategies.
PART 6 — Recovery Models
Recovery is not:
“Run the same code again and hope it works.”
Recovery means:
bring the machine from an abnormal state to a safe, known, consistent state, then resume only if valid.
Recovery flow:
+----------+ +------------+ +----------------+ +--------+
| Failure | ---> | Safe State | ---> | Recovery Action| ---> | Resume |
+----------+ +------------+ +----------------+ +--------+
| | | |
device fault motion stopped reset/rehome/retry continue,
timeout outputs safe operator confirm restart step,
bad state workflow paused reload context or abortImportant recovery concepts:
Safe state
The machine is not doing anything dangerous.
Examples:
- motion stopped
- actuator outputs disabled
- robot not moving
- laser off
- workflow paused
- no new material introduced
Known state
The software understands the physical condition.
Examples:
- axis position known
- wafer presence confirmed
- camera connected
- recipe loaded
- workflow step known
- subsystem readiness confirmed
A state can be safe but not known.
For example:
Motion stopped, but stage position is uncertain.
That is safe, but not recoverable until position is re-established.
Consistent state
Software state, hardware state, workflow state, and operator-visible state agree.
If recovery resets a device but the workflow still thinks the old operation is running, the system is inconsistent.
That is where dangerous bugs appear.
PART 7 — Avoiding Cascading Failures
A cascading failure happens when one fault triggers more faults.
Example:
Camera timeout
↓
Processing queue grows
↓
Memory pressure increases
↓
UI becomes slow
↓
Operator presses commands repeatedly
↓
Workflow receives conflicting requests
↓
Machine enters unclear stateStrategies to prevent this:
Isolation boundaries
Each subsystem should fail independently where possible.
Camera failure should not crash motion control.
Storage failure should not freeze emergency stop handling.
UI rendering failure should not corrupt workflow state.
Timeouts
Every external wait needs a timeout.
Dangerous waits:
- wait for image forever
- wait for PLC bit forever
- wait for motion complete forever
- wait for operator response forever
- wait for device reconnect forever
Industrial workflows should not have unbounded waits.
Circuit breaker concept
If a subsystem repeatedly fails, stop calling it temporarily.
Example:
Camera failed 5 times in 30 seconds
↓
Mark camera subsystem unavailable
↓
Stop acquisition attempts
↓
Require reset/reconnect/recoveryThis prevents failure storms.
Queue limits
Unbounded queues are hidden failure amplifiers.
Examples:
- image queue
- result queue
- log/event queue
- UI update queue
- device command queue
If the system cannot keep up, it must apply backpressure, drop non-critical data, or pause upstream work.
Subsystem independence
A machine should not be one giant synchronous call chain.
Bad:
UI button click
→ workflow
→ motion
→ camera
→ processing
→ storage
→ UI updateBetter:
UI sends command
Workflow owns sequence
Subsystems report state/events
Failures are classified and escalated
Recovery is state-drivenPART 8 — Real-World Failure Scenarios
Scenario 1: Camera failure causes infinite retry loop
What it looks like:
- machine appears stuck
- CPU usage increases
- UI becomes slow
- camera keeps reconnecting
- workflow never completes
Why it happens:
- retry is hidden inside device adapter
- no retry limit
- no escalation
- workflow never receives a real fault
How engineers fix it:
- bounded retry
- classify as
CameraUnavailable - stop acquisition
- pause workflow
- require reconnect/reinitialize recovery path
Scenario 2: Processing error propagates to UI thread
What it looks like:
- inspection crashes the UI
- operator loses current screen
- machine state may still be active underneath
- recovery is unclear
Why it happens:
- background processing exception crosses thread boundary badly
- UI directly depends on processing task success
- no fault boundary between pipeline and presentation
How engineers fix it:
- isolate processing pipeline
- convert processing exception into inspection fault
- mark result invalid
- keep UI alive
- let workflow decide whether to retry, skip, pause, or abort
Scenario 3: Device timeout not handled
What it looks like:
- workflow stays “Running”
- no visible progress
- operator cannot tell if machine is busy or stuck
- stop command may not work cleanly
Why it happens:
- command waits forever
- timeout not modeled as a fault
- no cancellation path
- workflow step has no failure transition
How engineers fix it:
- every device operation has timeout
- timeout becomes structured fault
- workflow transitions to
Paused,Faulted, orRecovering - operator commands are gated based on real state
Scenario 4: Recovery resets subsystem but state is not synchronized
What it looks like:
- device reconnects successfully
- UI says ready
- workflow still fails
- next operation behaves unexpectedly
Why it happens:
- recovery only reset hardware
- software state was not rebuilt
- cached state was stale
- workflow context was not reconciled
How engineers fix it:
- recovery includes state reconciliation
- reread device status
- verify position/sensor/product state
- rebuild subsystem readiness
- resume only from valid workflow checkpoint
Scenario 5: Operator retries manually and worsens state inconsistency
What it looks like:
- operator presses Reset, Start, Stop, Retry repeatedly
- machine enters confusing partial state
- support team cannot reconstruct what happened
- material may need to be scrapped
Why it happens:
- system exposes too many commands during fault state
- recovery path is not guided
- UI command enablement is not tied to machine state
- workflow accepts commands while recovery is incomplete
How engineers fix it:
- restrict commands during fault/recovery
- define explicit recovery states
- allow only valid next actions
- require confirmation when product/material state is uncertain
PART 9 — Software Design Implications
Industrial error handling needs architecture, not scattered catch blocks.
Component diagram:
+-------------+ +---------------+ +-------------------+
| Subsystem | ---> | Error Handler | ---> | Recovery Strategy |
| Camera | | Classify | | Retry / Reset |
| Motion | | Normalize | | Pause / Abort |
| Robot | | Escalate | | Rehome / Manual |
+-------------+ +---------------+ +-------------------+
|
v
+----------------+
| Escalation |
| Workflow Fault |
| Machine Fault |
+----------------+
|
v
+----------------+
| UI / Alarm |
| Operator Path |
+----------------+A strong design usually has:
- clear fault model
- subsystem-specific fault classification
- bounded retries
- explicit recovery states
- workflow-level fault transitions
- no silent failure
- no hidden infinite retry
- no UI-owned recovery logic
- no raw device exception leaking into workflow logic
- no direct hardware control from UI
- consistent escalation rules
Bad design:
catch(Exception)
{
// ignore
}Also bad:
catch(Exception ex)
{
Log(ex);
RetryForever();
}Also bad:
catch(Exception ex)
{
MessageBox.Show(ex.Message);
}Good design:
Device error
↓
Classify fault
↓
Contain locally if safe
↓
Escalate structured fault if not recoverable locally
↓
Workflow transitions to known fault state
↓
Machine moves to safe state
↓
Recovery path is selected
↓
Operator is guided only when neededPART 10 — Interview / Real-World Talking Points
A strong answer in an interview:
In industrial systems, error handling is not just catching exceptions. The important part is controlling machine behavior when something abnormal happens. A low-level error must be classified into a meaningful fault, contained at the right layer, and escalated only when local recovery is not safe or sufficient. Recovery must bring the machine to a safe, known, and consistent state before resuming. The goal is to prevent one subsystem failure from cascading into workflow deadlock, UI confusion, unsafe motion, or corrupted production state.
Key distinction:
Exception handling = code-level control flow
Fault handling = system-level safety, state, and recovery behaviorCommon mistakes engineers make:
- catch everywhere but define no fault model
- retry blindly
- hide device failures
- let UI own recovery logic
- allow workflows to wait forever
- continue when physical state is unknown
- reset hardware without rebuilding software state
- treat all errors as equal
- fail to isolate subsystems
- allow one queue or device failure to freeze the whole system
What strong engineers understand:
- failures are normal, not exceptional
- recovery is a state machine problem
- safe state is not the same as known state
- bounded retry is useful; infinite retry is dangerous
- escalation must be intentional
- subsystem isolation protects the whole machine
- operator actions must be constrained during fault states
- the machine must never pretend it is healthy when correctness is uncertain
The core mindset:
In industrial software, good error handling does not merely keep the application alive. It keeps the machine safe, the workflow understandable, and recovery controlled.