Skip to content

Error Handling, Fault Propagation & Recovery

This topic fits your roadmap’s Reliability, Fault Handling & Recovery domain, which emphasizes detecting failures, failing safely, reporting clearly, and recovering without making the situation worse.


PART 1 — Why Error Handling Is Not Just Try/Catch

In normal business software, an error often means:

request failed → return error response → user tries again

In industrial machine software, an error may affect:

  • physical motion
  • camera acquisition
  • robot handling
  • vacuum state
  • wafer position
  • machine mode
  • operator action
  • production material

So the real question is not:

“Did we catch the exception?”

The real question is:

“What state is the machine in now, and what must happen next to keep it safe and recoverable?”

A try/catch handles a code-level problem.

A fault-handling strategy handles a system-level abnormal condition.

Example:

A vision pipeline throws an exception while processing an image.

In a normal app, you might log it and skip the image.

In a wafer inspection machine, you must ask:

  • Was the wafer moving?
  • Was this image tied to a specific stage position?
  • Is the inspection result now invalid?
  • Should the workflow pause?
  • Can the machine continue?
  • Does the operator need intervention?
  • Is the current lot still trustworthy?

The exception is only the symptom.

The machine fault is the real problem.


PART 2 — Error vs Fault vs Failure

A useful distinction:

Error

Something went wrong in code, data, communication, or execution.

Examples:

  • null value
  • invalid parameter
  • SDK call failed
  • timeout occurred
  • image processing exception
  • unexpected device response

Fault

The system is now in an abnormal condition.

Examples:

  • camera disconnected
  • axis not ready
  • vacuum not reached
  • wafer position unknown
  • recipe invalid
  • subsystem unavailable

Failure

The system cannot perform its required function.

Examples:

  • machine cannot inspect wafers
  • robot cannot load material
  • stage cannot move safely
  • inspection result cannot be trusted
  • recovery cannot continue automatically

The architecture must not confuse these three.

Bad design treats every error as an exception.

Good design converts low-level errors into meaningful machine faults.


PART 3 — Fault Propagation Across Layers

Faults become dangerous when they travel upward without structure.

Simple propagation diagram:

text
+----------+      +-----------+      +------------+      +------+
| Device   | ---> | Control   | ---> | Workflow   | ---> | UI   |
| Layer    |      | Layer     |      | Layer      |      |      |
+----------+      +-----------+      +------------+      +------+
     |                 |                  |                  |
 camera timeout   exception thrown   sequence stuck     operator confused

Example:

text
Camera SDK timeout

Camera adapter throws exception

Inspection controller does not classify it

Workflow waits forever for image result

UI shows "Running" even though inspection is dead

Operator presses Stop/Start repeatedly

Machine state becomes harder to recover

The original problem was small: a camera timeout.

The real failure became large because the system did not contain it.

Industrial software must control propagation.

A fault should move upward only after it has been classified.


PART 4 — Containment Strategy

The principle:

Handle as low as possible. Escalate only when needed.

But “low as possible” does not mean “hide the problem.”

It means:

  • retry local transient problems locally
  • reset only the affected subsystem if safe
  • escalate when the current layer cannot guarantee correctness
  • never silently continue if physical state or product quality is uncertain

Containment diagram:

text
+------------------------------------------------------------+
|                        UI Layer                            |
|   Inform operator, restrict commands, show recovery path    |
+-----------------------------↑------------------------------+
                              |
+-----------------------------|------------------------------+
|                    Application / Workflow                  |
|   Pause step, abort run, mark product/result invalid        |
+-----------------------------↑------------------------------+
                              |
+-----------------------------|------------------------------+
|                     Control Layer                          |
|   Isolate subsystem, stop motion, block dependent actions   |
+-----------------------------↑------------------------------+
                              |
+-----------------------------|------------------------------+
|                      Device Layer                          |
|   Retry, reconnect, reset, report device-specific fault     |
+------------------------------------------------------------+

Layer responsibility:

LayerShould do
Device layerDetect communication/device errors, retry safe transient operations, expose clear device fault
Control layerProtect physical subsystem, stop unsafe actions, isolate unavailable devices
Workflow layerDecide whether current operation can continue, pause, abort, or require operator recovery
UI layerInform operator and prevent unsafe/manual conflicting actions

A common mistake is letting the UI decide machine recovery.

The UI should request recovery.

The workflow/control layers should decide whether recovery is valid.


PART 5 — Error Handling Strategies

1. Fail-fast

Use when continuing is unsafe or correctness is impossible.

Examples:

  • unexpected motion state
  • axis position unknown
  • safety-related interlock lost
  • wafer presence inconsistent
  • recipe parameter violates physical limit

Fail-fast means:

stop the affected operation immediately and move toward a safe state.

It does not always mean crash the process.

In industrial systems, fail-fast usually means controlled stop, not application death.


2. Retry

Use for transient faults where retry is safe and bounded.

Examples:

  • temporary TCP communication loss
  • short device busy response
  • camera frame not ready
  • database write transient failure

Retry must have:

  • maximum attempt count
  • timeout
  • delay/backoff
  • cancellation
  • fault classification
  • no hidden infinite loops

Bad retry:

text
while(true)
    TryReadCamera();

Good retry:

text
Retry 3 times within 2 seconds.
If still failed, raise CameraAcquisitionFault.
Pause inspection workflow.

3. Fallback

Use when an alternative path exists and correctness remains acceptable.

Examples:

  • use cached calibration only if still valid
  • use secondary sensor if primary sensor fails
  • use offline result buffering if host connection is down

Fallback must be explicit.

The system should know it is operating on fallback behavior.


4. Degrade

Use when the machine can continue with reduced capability.

Examples:

  • continue handling wafers but disable inspection
  • continue production without optional image preview
  • run slower because one performance optimization is unavailable
  • disable automatic review while still saving raw results

Degraded mode is dangerous if operators do not understand it.

The system must define:

  • what is disabled
  • what remains valid
  • what quality risk exists
  • how to return to normal

5. Isolate

Use when one subsystem fails but the whole system should not collapse.

Examples:

  • disable camera subsystem
  • keep motion controller alive
  • stop only one station in a multi-station machine
  • prevent recipe activation while allowing diagnostics

Isolation is one of the strongest anti-cascade strategies.


PART 6 — Recovery Models

Recovery is not:

“Run the same code again and hope it works.”

Recovery means:

bring the machine from an abnormal state to a safe, known, consistent state, then resume only if valid.

Recovery flow:

text
+----------+      +------------+      +----------------+      +--------+
| Failure  | ---> | Safe State | ---> | Recovery Action| ---> | Resume |
+----------+      +------------+      +----------------+      +--------+
      |                 |                    |                    |
 device fault      motion stopped       reset/rehome/retry     continue,
 timeout           outputs safe          operator confirm       restart step,
 bad state         workflow paused       reload context         or abort

Important recovery concepts:

Safe state

The machine is not doing anything dangerous.

Examples:

  • motion stopped
  • actuator outputs disabled
  • robot not moving
  • laser off
  • workflow paused
  • no new material introduced

Known state

The software understands the physical condition.

Examples:

  • axis position known
  • wafer presence confirmed
  • camera connected
  • recipe loaded
  • workflow step known
  • subsystem readiness confirmed

A state can be safe but not known.

For example:

Motion stopped, but stage position is uncertain.

That is safe, but not recoverable until position is re-established.

Consistent state

Software state, hardware state, workflow state, and operator-visible state agree.

If recovery resets a device but the workflow still thinks the old operation is running, the system is inconsistent.

That is where dangerous bugs appear.


PART 7 — Avoiding Cascading Failures

A cascading failure happens when one fault triggers more faults.

Example:

text
Camera timeout

Processing queue grows

Memory pressure increases

UI becomes slow

Operator presses commands repeatedly

Workflow receives conflicting requests

Machine enters unclear state

Strategies to prevent this:

Isolation boundaries

Each subsystem should fail independently where possible.

Camera failure should not crash motion control.

Storage failure should not freeze emergency stop handling.

UI rendering failure should not corrupt workflow state.

Timeouts

Every external wait needs a timeout.

Dangerous waits:

  • wait for image forever
  • wait for PLC bit forever
  • wait for motion complete forever
  • wait for operator response forever
  • wait for device reconnect forever

Industrial workflows should not have unbounded waits.

Circuit breaker concept

If a subsystem repeatedly fails, stop calling it temporarily.

Example:

text
Camera failed 5 times in 30 seconds

Mark camera subsystem unavailable

Stop acquisition attempts

Require reset/reconnect/recovery

This prevents failure storms.

Queue limits

Unbounded queues are hidden failure amplifiers.

Examples:

  • image queue
  • result queue
  • log/event queue
  • UI update queue
  • device command queue

If the system cannot keep up, it must apply backpressure, drop non-critical data, or pause upstream work.

Subsystem independence

A machine should not be one giant synchronous call chain.

Bad:

text
UI button click
  → workflow
    → motion
      → camera
        → processing
          → storage
            → UI update

Better:

text
UI sends command
Workflow owns sequence
Subsystems report state/events
Failures are classified and escalated
Recovery is state-driven

PART 8 — Real-World Failure Scenarios

Scenario 1: Camera failure causes infinite retry loop

What it looks like:

  • machine appears stuck
  • CPU usage increases
  • UI becomes slow
  • camera keeps reconnecting
  • workflow never completes

Why it happens:

  • retry is hidden inside device adapter
  • no retry limit
  • no escalation
  • workflow never receives a real fault

How engineers fix it:

  • bounded retry
  • classify as CameraUnavailable
  • stop acquisition
  • pause workflow
  • require reconnect/reinitialize recovery path

Scenario 2: Processing error propagates to UI thread

What it looks like:

  • inspection crashes the UI
  • operator loses current screen
  • machine state may still be active underneath
  • recovery is unclear

Why it happens:

  • background processing exception crosses thread boundary badly
  • UI directly depends on processing task success
  • no fault boundary between pipeline and presentation

How engineers fix it:

  • isolate processing pipeline
  • convert processing exception into inspection fault
  • mark result invalid
  • keep UI alive
  • let workflow decide whether to retry, skip, pause, or abort

Scenario 3: Device timeout not handled

What it looks like:

  • workflow stays “Running”
  • no visible progress
  • operator cannot tell if machine is busy or stuck
  • stop command may not work cleanly

Why it happens:

  • command waits forever
  • timeout not modeled as a fault
  • no cancellation path
  • workflow step has no failure transition

How engineers fix it:

  • every device operation has timeout
  • timeout becomes structured fault
  • workflow transitions to Paused, Faulted, or Recovering
  • operator commands are gated based on real state

Scenario 4: Recovery resets subsystem but state is not synchronized

What it looks like:

  • device reconnects successfully
  • UI says ready
  • workflow still fails
  • next operation behaves unexpectedly

Why it happens:

  • recovery only reset hardware
  • software state was not rebuilt
  • cached state was stale
  • workflow context was not reconciled

How engineers fix it:

  • recovery includes state reconciliation
  • reread device status
  • verify position/sensor/product state
  • rebuild subsystem readiness
  • resume only from valid workflow checkpoint

Scenario 5: Operator retries manually and worsens state inconsistency

What it looks like:

  • operator presses Reset, Start, Stop, Retry repeatedly
  • machine enters confusing partial state
  • support team cannot reconstruct what happened
  • material may need to be scrapped

Why it happens:

  • system exposes too many commands during fault state
  • recovery path is not guided
  • UI command enablement is not tied to machine state
  • workflow accepts commands while recovery is incomplete

How engineers fix it:

  • restrict commands during fault/recovery
  • define explicit recovery states
  • allow only valid next actions
  • require confirmation when product/material state is uncertain

PART 9 — Software Design Implications

Industrial error handling needs architecture, not scattered catch blocks.

Component diagram:

text
+-------------+      +---------------+      +-------------------+
| Subsystem   | ---> | Error Handler | ---> | Recovery Strategy |
| Camera      |      | Classify      |      | Retry / Reset     |
| Motion      |      | Normalize     |      | Pause / Abort     |
| Robot       |      | Escalate      |      | Rehome / Manual   |
+-------------+      +---------------+      +-------------------+
                              |
                              v
                      +----------------+
                      | Escalation     |
                      | Workflow Fault |
                      | Machine Fault  |
                      +----------------+
                              |
                              v
                      +----------------+
                      | UI / Alarm     |
                      | Operator Path  |
                      +----------------+

A strong design usually has:

  • clear fault model
  • subsystem-specific fault classification
  • bounded retries
  • explicit recovery states
  • workflow-level fault transitions
  • no silent failure
  • no hidden infinite retry
  • no UI-owned recovery logic
  • no raw device exception leaking into workflow logic
  • no direct hardware control from UI
  • consistent escalation rules

Bad design:

text
catch(Exception)
{
    // ignore
}

Also bad:

text
catch(Exception ex)
{
    Log(ex);
    RetryForever();
}

Also bad:

text
catch(Exception ex)
{
    MessageBox.Show(ex.Message);
}

Good design:

text
Device error

Classify fault

Contain locally if safe

Escalate structured fault if not recoverable locally

Workflow transitions to known fault state

Machine moves to safe state

Recovery path is selected

Operator is guided only when needed

PART 10 — Interview / Real-World Talking Points

A strong answer in an interview:

In industrial systems, error handling is not just catching exceptions. The important part is controlling machine behavior when something abnormal happens. A low-level error must be classified into a meaningful fault, contained at the right layer, and escalated only when local recovery is not safe or sufficient. Recovery must bring the machine to a safe, known, and consistent state before resuming. The goal is to prevent one subsystem failure from cascading into workflow deadlock, UI confusion, unsafe motion, or corrupted production state.

Key distinction:

text
Exception handling = code-level control flow

Fault handling = system-level safety, state, and recovery behavior

Common mistakes engineers make:

  • catch everywhere but define no fault model
  • retry blindly
  • hide device failures
  • let UI own recovery logic
  • allow workflows to wait forever
  • continue when physical state is unknown
  • reset hardware without rebuilding software state
  • treat all errors as equal
  • fail to isolate subsystems
  • allow one queue or device failure to freeze the whole system

What strong engineers understand:

  • failures are normal, not exceptional
  • recovery is a state machine problem
  • safe state is not the same as known state
  • bounded retry is useful; infinite retry is dangerous
  • escalation must be intentional
  • subsystem isolation protects the whole machine
  • operator actions must be constrained during fault states
  • the machine must never pretend it is healthy when correctness is uncertain

The core mindset:

In industrial software, good error handling does not merely keep the application alive. It keeps the machine safe, the workflow understandable, and recovery controlled.

Docs-first project memory for AI-assisted implementation.