Error Handling, Fault Propagation & Recovery

This topic fits your roadmap’s Reliability, Fault Handling & Recovery domain, which emphasizes detecting failures, failing safely, reporting clearly, and recovering without making the situation worse.

PART 1 — Why Error Handling Is Not Just Try/Catch

In normal business software, an error often means:

request failed → return error response → user tries again

In industrial machine software, an error may affect:

physical motion
camera acquisition
robot handling
vacuum state
wafer position
machine mode
operator action
production material

So the real question is not:

“Did we catch the exception?”

The real question is:

“What state is the machine in now, and what must happen next to keep it safe and recoverable?”

A try/catch handles a code-level problem.

A fault-handling strategy handles a system-level abnormal condition.

Example:

A vision pipeline throws an exception while processing an image.

In a normal app, you might log it and skip the image.

In a wafer inspection machine, you must ask:

Was the wafer moving?
Was this image tied to a specific stage position?
Is the inspection result now invalid?
Should the workflow pause?
Can the machine continue?
Does the operator need intervention?
Is the current lot still trustworthy?

The exception is only the symptom.

The machine fault is the real problem.

PART 2 — Error vs Fault vs Failure

A useful distinction:

Error

Something went wrong in code, data, communication, or execution.

Examples:

null value
invalid parameter
SDK call failed
timeout occurred
image processing exception
unexpected device response

Fault

The system is now in an abnormal condition.

Examples:

camera disconnected
axis not ready
vacuum not reached
wafer position unknown
recipe invalid
subsystem unavailable

Failure

The system cannot perform its required function.

Examples:

machine cannot inspect wafers
robot cannot load material
stage cannot move safely
inspection result cannot be trusted
recovery cannot continue automatically

The architecture must not confuse these three.

Bad design treats every error as an exception.

Good design converts low-level errors into meaningful machine faults.

PART 3 — Fault Propagation Across Layers

Faults become dangerous when they travel upward without structure.

Simple propagation diagram:

text

+----------+      +-----------+      +------------+      +------+
| Device   | ---> | Control   | ---> | Workflow   | ---> | UI   |
| Layer    |      | Layer     |      | Layer      |      |      |
+----------+      +-----------+      +------------+      +------+
     |                 |                  |                  |
 camera timeout   exception thrown   sequence stuck     operator confused

Example:

text

Camera SDK timeout
    ↓
Camera adapter throws exception
    ↓
Inspection controller does not classify it
    ↓
Workflow waits forever for image result
    ↓
UI shows "Running" even though inspection is dead
    ↓
Operator presses Stop/Start repeatedly
    ↓
Machine state becomes harder to recover

The original problem was small: a camera timeout.

The real failure became large because the system did not contain it.

Industrial software must control propagation.

A fault should move upward only after it has been classified.

PART 4 — Containment Strategy

The principle:

Handle as low as possible. Escalate only when needed.

But “low as possible” does not mean “hide the problem.”

It means:

retry local transient problems locally
reset only the affected subsystem if safe
escalate when the current layer cannot guarantee correctness
never silently continue if physical state or product quality is uncertain

Containment diagram:

text

+------------------------------------------------------------+
|                        UI Layer                            |
|   Inform operator, restrict commands, show recovery path    |
+-----------------------------↑------------------------------+
                              |
+-----------------------------|------------------------------+
|                    Application / Workflow                  |
|   Pause step, abort run, mark product/result invalid        |
+-----------------------------↑------------------------------+
                              |
+-----------------------------|------------------------------+
|                     Control Layer                          |
|   Isolate subsystem, stop motion, block dependent actions   |
+-----------------------------↑------------------------------+
                              |
+-----------------------------|------------------------------+
|                      Device Layer                          |
|   Retry, reconnect, reset, report device-specific fault     |
+------------------------------------------------------------+

Layer responsibility:

Layer	Should do
Device layer	Detect communication/device errors, retry safe transient operations, expose clear device fault
Control layer	Protect physical subsystem, stop unsafe actions, isolate unavailable devices
Workflow layer	Decide whether current operation can continue, pause, abort, or require operator recovery
UI layer	Inform operator and prevent unsafe/manual conflicting actions

A common mistake is letting the UI decide machine recovery.

The UI should request recovery.

The workflow/control layers should decide whether recovery is valid.

PART 5 — Error Handling Strategies

1. Fail-fast

Use when continuing is unsafe or correctness is impossible.

Examples:

unexpected motion state
axis position unknown
safety-related interlock lost
wafer presence inconsistent
recipe parameter violates physical limit

Fail-fast means:

stop the affected operation immediately and move toward a safe state.

It does not always mean crash the process.

In industrial systems, fail-fast usually means controlled stop, not application death.

2. Retry

Use for transient faults where retry is safe and bounded.

Examples:

temporary TCP communication loss
short device busy response
camera frame not ready
database write transient failure

Retry must have:

maximum attempt count
timeout
delay/backoff
cancellation
fault classification
no hidden infinite loops

Bad retry:

text

while(true)
    TryReadCamera();

Good retry:

text

Retry 3 times within 2 seconds.
If still failed, raise CameraAcquisitionFault.
Pause inspection workflow.

3. Fallback

Use when an alternative path exists and correctness remains acceptable.

Examples:

use cached calibration only if still valid
use secondary sensor if primary sensor fails
use offline result buffering if host connection is down

Fallback must be explicit.

The system should know it is operating on fallback behavior.

4. Degrade

Use when the machine can continue with reduced capability.

Examples:

continue handling wafers but disable inspection
continue production without optional image preview
run slower because one performance optimization is unavailable
disable automatic review while still saving raw results

Degraded mode is dangerous if operators do not understand it.

The system must define:

what is disabled
what remains valid
what quality risk exists
how to return to normal

5. Isolate

Use when one subsystem fails but the whole system should not collapse.

Examples:

disable camera subsystem
keep motion controller alive
stop only one station in a multi-station machine
prevent recipe activation while allowing diagnostics

Isolation is one of the strongest anti-cascade strategies.

PART 6 — Recovery Models

Recovery is not:

“Run the same code again and hope it works.”

Recovery means:

bring the machine from an abnormal state to a safe, known, consistent state, then resume only if valid.

Recovery flow:

text

+----------+      +------------+      +----------------+      +--------+
| Failure  | ---> | Safe State | ---> | Recovery Action| ---> | Resume |
+----------+      +------------+      +----------------+      +--------+
      |                 |                    |                    |
 device fault      motion stopped       reset/rehome/retry     continue,
 timeout           outputs safe          operator confirm       restart step,
 bad state         workflow paused       reload context         or abort

Important recovery concepts:

Safe state

The machine is not doing anything dangerous.

Examples:

motion stopped
actuator outputs disabled
robot not moving
laser off
workflow paused
no new material introduced

Known state

The software understands the physical condition.

Examples:

axis position known
wafer presence confirmed
camera connected
recipe loaded
workflow step known
subsystem readiness confirmed

A state can be safe but not known.

For example:

Motion stopped, but stage position is uncertain.

That is safe, but not recoverable until position is re-established.

Consistent state

Software state, hardware state, workflow state, and operator-visible state agree.

If recovery resets a device but the workflow still thinks the old operation is running, the system is inconsistent.

That is where dangerous bugs appear.

PART 7 — Avoiding Cascading Failures

A cascading failure happens when one fault triggers more faults.

Example:

text

Camera timeout
   ↓
Processing queue grows
   ↓
Memory pressure increases
   ↓
UI becomes slow
   ↓
Operator presses commands repeatedly
   ↓
Workflow receives conflicting requests
   ↓
Machine enters unclear state

Strategies to prevent this:

Isolation boundaries

Each subsystem should fail independently where possible.

Camera failure should not crash motion control.

Storage failure should not freeze emergency stop handling.

UI rendering failure should not corrupt workflow state.

Timeouts

Every external wait needs a timeout.

Dangerous waits:

wait for image forever
wait for PLC bit forever
wait for motion complete forever
wait for operator response forever
wait for device reconnect forever

Industrial workflows should not have unbounded waits.

Circuit breaker concept

If a subsystem repeatedly fails, stop calling it temporarily.

Example:

text

Camera failed 5 times in 30 seconds
    ↓
Mark camera subsystem unavailable
    ↓
Stop acquisition attempts
    ↓
Require reset/reconnect/recovery

This prevents failure storms.

Queue limits

Unbounded queues are hidden failure amplifiers.

Examples:

image queue
result queue
log/event queue
UI update queue
device command queue

If the system cannot keep up, it must apply backpressure, drop non-critical data, or pause upstream work.

Subsystem independence

A machine should not be one giant synchronous call chain.

Bad:

text

UI button click
  → workflow
    → motion
      → camera
        → processing
          → storage
            → UI update

Better:

text

UI sends command
Workflow owns sequence
Subsystems report state/events
Failures are classified and escalated
Recovery is state-driven

PART 8 — Real-World Failure Scenarios

Scenario 1: Camera failure causes infinite retry loop

What it looks like:

machine appears stuck
CPU usage increases
UI becomes slow
camera keeps reconnecting
workflow never completes

Why it happens:

retry is hidden inside device adapter
no retry limit
no escalation
workflow never receives a real fault

How engineers fix it:

bounded retry
classify as CameraUnavailable
stop acquisition
pause workflow
require reconnect/reinitialize recovery path

Scenario 2: Processing error propagates to UI thread

What it looks like:

inspection crashes the UI
operator loses current screen
machine state may still be active underneath
recovery is unclear

Why it happens:

background processing exception crosses thread boundary badly
UI directly depends on processing task success
no fault boundary between pipeline and presentation

How engineers fix it:

isolate processing pipeline
convert processing exception into inspection fault
mark result invalid
keep UI alive
let workflow decide whether to retry, skip, pause, or abort

Scenario 3: Device timeout not handled

What it looks like:

workflow stays “Running”
no visible progress
operator cannot tell if machine is busy or stuck
stop command may not work cleanly

Why it happens:

command waits forever
timeout not modeled as a fault
no cancellation path
workflow step has no failure transition

How engineers fix it:

every device operation has timeout
timeout becomes structured fault
workflow transitions to Paused, Faulted, or Recovering
operator commands are gated based on real state

Scenario 4: Recovery resets subsystem but state is not synchronized

What it looks like:

device reconnects successfully
UI says ready
workflow still fails
next operation behaves unexpectedly

Why it happens:

recovery only reset hardware
software state was not rebuilt
cached state was stale
workflow context was not reconciled

How engineers fix it:

recovery includes state reconciliation
reread device status
verify position/sensor/product state
rebuild subsystem readiness
resume only from valid workflow checkpoint

Scenario 5: Operator retries manually and worsens state inconsistency

What it looks like:

operator presses Reset, Start, Stop, Retry repeatedly
machine enters confusing partial state
support team cannot reconstruct what happened
material may need to be scrapped

Why it happens:

system exposes too many commands during fault state
recovery path is not guided
UI command enablement is not tied to machine state
workflow accepts commands while recovery is incomplete

How engineers fix it:

restrict commands during fault/recovery
define explicit recovery states
allow only valid next actions
require confirmation when product/material state is uncertain

PART 9 — Software Design Implications

Industrial error handling needs architecture, not scattered catch blocks.

Component diagram:

text

+-------------+      +---------------+      +-------------------+
| Subsystem   | ---> | Error Handler | ---> | Recovery Strategy |
| Camera      |      | Classify      |      | Retry / Reset     |
| Motion      |      | Normalize     |      | Pause / Abort     |
| Robot       |      | Escalate      |      | Rehome / Manual   |
+-------------+      +---------------+      +-------------------+
                              |
                              v
                      +----------------+
                      | Escalation     |
                      | Workflow Fault |
                      | Machine Fault  |
                      +----------------+
                              |
                              v
                      +----------------+
                      | UI / Alarm     |
                      | Operator Path  |
                      +----------------+

A strong design usually has:

clear fault model
subsystem-specific fault classification
bounded retries
explicit recovery states
workflow-level fault transitions
no silent failure
no hidden infinite retry
no UI-owned recovery logic
no raw device exception leaking into workflow logic
no direct hardware control from UI
consistent escalation rules

Bad design:

text

catch(Exception)
{
    // ignore
}

Also bad:

text

catch(Exception ex)
{
    Log(ex);
    RetryForever();
}

Also bad:

text

catch(Exception ex)
{
    MessageBox.Show(ex.Message);
}

Good design:

text

Device error
    ↓
Classify fault
    ↓
Contain locally if safe
    ↓
Escalate structured fault if not recoverable locally
    ↓
Workflow transitions to known fault state
    ↓
Machine moves to safe state
    ↓
Recovery path is selected
    ↓
Operator is guided only when needed

PART 10 — Interview / Real-World Talking Points

A strong answer in an interview:

In industrial systems, error handling is not just catching exceptions. The important part is controlling machine behavior when something abnormal happens. A low-level error must be classified into a meaningful fault, contained at the right layer, and escalated only when local recovery is not safe or sufficient. Recovery must bring the machine to a safe, known, and consistent state before resuming. The goal is to prevent one subsystem failure from cascading into workflow deadlock, UI confusion, unsafe motion, or corrupted production state.

Key distinction:

text

Exception handling = code-level control flow

Fault handling = system-level safety, state, and recovery behavior

Common mistakes engineers make:

catch everywhere but define no fault model
retry blindly
hide device failures
let UI own recovery logic
allow workflows to wait forever
continue when physical state is unknown
reset hardware without rebuilding software state
treat all errors as equal
fail to isolate subsystems
allow one queue or device failure to freeze the whole system

What strong engineers understand:

failures are normal, not exceptional
recovery is a state machine problem
safe state is not the same as known state
bounded retry is useful; infinite retry is dangerous
escalation must be intentional
subsystem isolation protects the whole machine
operator actions must be constrained during fault states
the machine must never pretend it is healthy when correctness is uncertain

The core mindset:

In industrial software, good error handling does not merely keep the application alive. It keeps the machine safe, the workflow understandable, and recovery controlled.

Streaming Pipelines Dotnet Real World

Error Handling, Fault Propagation & Recovery ​

PART 1 — Why Error Handling Is Not Just Try/Catch ​

PART 2 — Error vs Fault vs Failure ​

Error ​

Fault ​

Failure ​

PART 3 — Fault Propagation Across Layers ​

PART 4 — Containment Strategy ​

PART 5 — Error Handling Strategies ​

1. Fail-fast ​

2. Retry ​

3. Fallback ​

4. Degrade ​

5. Isolate ​

PART 6 — Recovery Models ​

Safe state ​

Known state ​

Consistent state ​

PART 7 — Avoiding Cascading Failures ​

Isolation boundaries ​

Timeouts ​

Circuit breaker concept ​

Queue limits ​

Subsystem independence ​

PART 8 — Real-World Failure Scenarios ​

Scenario 1: Camera failure causes infinite retry loop ​

Scenario 2: Processing error propagates to UI thread ​

Scenario 3: Device timeout not handled ​

Scenario 4: Recovery resets subsystem but state is not synchronized ​

Scenario 5: Operator retries manually and worsens state inconsistency ​

PART 9 — Software Design Implications ​

PART 10 — Interview / Real-World Talking Points ​

Error Handling, Fault Propagation & Recovery

PART 1 — Why Error Handling Is Not Just Try/Catch

PART 2 — Error vs Fault vs Failure

Error

Fault

Failure

PART 3 — Fault Propagation Across Layers

PART 4 — Containment Strategy

PART 5 — Error Handling Strategies

1. Fail-fast

2. Retry

3. Fallback

4. Degrade

5. Isolate

PART 6 — Recovery Models

Safe state

Known state

Consistent state

PART 7 — Avoiding Cascading Failures

Isolation boundaries

Timeouts

Circuit breaker concept

Queue limits

Subsystem independence

PART 8 — Real-World Failure Scenarios

Scenario 1: Camera failure causes infinite retry loop

Scenario 2: Processing error propagates to UI thread

Scenario 3: Device timeout not handled

Scenario 4: Recovery resets subsystem but state is not synchronized

Scenario 5: Operator retries manually and worsens state inconsistency

PART 9 — Software Design Implications

PART 10 — Interview / Real-World Talking Points