Skip to content

Failure Modes & System Reliability Model

This topic belongs to Reliability, Fault Handling & Recovery, where machines must detect failures, fail safely, report clearly, and recover without making things worse. The roadmap explicitly frames this as a major mindset shift from enterprise software because industrial failures can stop production, damage equipment, scrap material, or create safety risk.


PART 1 — BIG PICTURE: WHY FAILURE MODELING COMES FIRST

Industrial machine software is not designed only for the happy path.

A business web system might ask:

“What should happen when the user clicks Submit?”

An industrial machine system must also ask:

“What if the camera disconnects while the stage is moving?” “What if the axis reaches the target but the encoder reports an impossible value?” “What if the image pipeline becomes overloaded after running for six hours?” “What if the operator presses Stop during a half-completed sequence?”

This is why strong industrial engineers think about failure before they think about implementation.

In industrial systems, failure is not exceptional. It is part of normal reality.

Machines deal with:

  • vibration
  • dust
  • heat
  • electrical noise
  • cable wear
  • bad sensors
  • device firmware bugs
  • timing drift
  • operator mistakes
  • long-running memory pressure
  • partial startup/shutdown
  • unstable factory environments

So the architecture must support:

  • partial failure
  • degraded operation
  • safe stop
  • clear fault reporting
  • controlled recovery
  • diagnosability after the fact

A good industrial system does not assume:

“Everything works unless there is an exception.”

It assumes:

“Anything can fail, some failures will be delayed, some will be misleading, and some will only appear under production conditions.”

Example:

text
Camera disconnects during inspection

No new image arrives

Image processing waits

Inspection workflow does not complete

Stage remains in inspection position

Operator sees machine stuck

Wrong manual recovery may cause more damage

The camera failure is only the origin. The real problem is how the whole system responds.


PART 2 — FAILURE CATEGORIES: LAYERED MODEL

A useful way to model industrial failure is by system layer.

text
+--------------------------------------------------+
| UI / HMI                                         |
| Operator screens, commands, alarms, status       |
+--------------------------------------------------+
| Application / Workflow                           |
| Recipes, sequences, inspection jobs, run logic   |
+--------------------------------------------------+
| Control Layer                                    |
| State machines, command gating, interlocks       |
+--------------------------------------------------+
| Communication Layer                              |
| Ethernet, serial, fieldbus, SDK calls, messages  |
+--------------------------------------------------+
| Device Layer                                     |
| Cameras, motion controllers, IO cards, PLCs      |
+--------------------------------------------------+
| Physical Layer                                   |
| Motors, stages, sensors, cables, mechanics       |
+--------------------------------------------------+

The same visible symptom can come from different layers.

For example, “axis did not move” could mean:

  • motor power is off
  • servo drive faulted
  • cable is loose
  • motion controller rejected command
  • communication timed out
  • interlock blocked motion
  • workflow sent command in wrong state
  • UI enabled a button when it should not have

That is why failure modeling must be layered.


1. Physical / Mechanical Failures

These happen in the real machine body.

Examples:

  • belt slips
  • stage jams
  • actuator stalls
  • vacuum cup loses grip
  • guide rail contamination increases friction
  • wafer is misaligned
  • mechanical backlash affects position repeatability

Software may not directly “see” the mechanical problem. It may only see symptoms:

  • motion timeout
  • position error
  • sensor mismatch
  • unexpected vibration
  • repeatability drift

Architectural implication:

Software should not assume that command accepted means physical action succeeded.


2. Electrical / IO Failures

These involve signals, wiring, voltage, and electrical noise.

Examples:

  • sensor signal flickers
  • digital input stuck ON
  • cable intermittently disconnects
  • emergency stop circuit changes state
  • electrical noise causes false trigger
  • analog signal drifts

Architectural implication:

Raw IO should usually be interpreted through validation, debounce, state context, and health checks.


3. Device / Hardware Failures

These happen inside devices controlled by software.

Examples:

  • camera disconnects
  • frame grabber stops delivering frames
  • motion controller enters fault state
  • light controller ignores command
  • robot controller reports alarm
  • IO module becomes unreachable

Architectural implication:

Each device should have an explicit health/state model, not just a wrapper around SDK calls.

This connects strongly to the roadmap’s hardware integration domain, where many real failures come from unstable drivers, communication drops, timeouts, partial initialization, and device contention.


4. Communication Failures

These happen between software and devices/controllers.

Examples:

  • timeout
  • dropped connection
  • delayed response
  • corrupted frame
  • duplicate message
  • out-of-order event
  • stale cached status
  • request succeeds but response is lost

Architectural implication:

Communication is not just transport. It affects system truth.

A motion command may have reached the controller even if the PC did not receive the response. That creates uncertainty.


5. Timing / Synchronization Failures

These are very common in inspection, motion, and automation systems.

Examples:

  • camera trigger arrives too early
  • image timestamp does not match stage position
  • light turns on after exposure starts
  • encoder position sample is delayed
  • UI shows old state as current
  • processing result belongs to previous wafer

Architectural implication:

Correctness depends not only on data, but on when the data was valid.


6. Data / State Inconsistency

The system’s internal model no longer matches machine reality.

Examples:

  • software thinks wafer is loaded, but sensor says empty
  • workflow thinks machine is Running, but motion controller is Faulted
  • recipe version changed during run
  • UI shows Ready while one subsystem is still initializing
  • inspection result is attached to wrong product ID

Architectural implication:

State must be modeled explicitly and validated across subsystem boundaries.

The roadmap also emphasizes state machines, deterministic workflow execution, interlocks, and fault handling as core machine-control concerns.


7. Software Logic Errors

These are normal software bugs, but with physical consequences.

Examples:

  • wrong state transition
  • race condition
  • missing interlock check
  • incorrect coordinate transform
  • unit conversion error
  • wrong recipe parameter applied
  • command allowed in unsafe mode

Architectural implication:

Business software bugs may corrupt data. Machine software bugs can move hardware incorrectly.


8. Resource Exhaustion

These appear after time or under load.

Examples:

  • memory leak after three days
  • image buffers not released
  • disk fills with inspection images
  • CPU spike delays control updates
  • queue grows faster than processing
  • UI becomes sluggish during high-throughput inspection

Architectural implication:

Long-running behavior is part of reliability, not just performance.


9. Human / Operator Errors

Operators are part of the system.

Examples:

  • wrong recipe selected
  • manual command issued in wrong context
  • alarm ignored
  • recovery step skipped
  • service mode left enabled
  • part loaded incorrectly

Architectural implication:

The system should make correct actions easy and unsafe actions difficult or impossible.

But this topic is not mainly about UI design. At system level, the key point is: operator actions must be modeled as inputs that can fail, arrive at bad times, or conflict with machine state.


PART 3 — FAILURE MODES: HOW THINGS FAIL

A common beginner mistake is to ask only:

“Which component failed?”

A stronger engineer asks:

“How did it fail?”

The failure mode often matters more than the component.


1. Fail-Stop

The component stops working clearly.

Example:

  • camera disconnects
  • motion controller stops responding
  • PLC connection drops
  • service crashes

This is often the easiest failure to detect.

text
Command → No response → Timeout → Fault

Architectural response:

  • mark subsystem unavailable
  • stop dependent workflows
  • enter safe state
  • require recovery or reconnect

2. Fail-Slow

The component still works, but too slowly.

Example:

  • camera frames arrive late
  • image processing latency increases
  • database writes become slow
  • device SDK call blocks longer than usual

This is dangerous because the system appears alive.

text
Normal response time: 20 ms
Current response time: 2,000 ms
System status: technically alive, operationally unsafe

Architectural response:

  • define timing expectations
  • detect latency degradation
  • apply timeouts
  • prevent backlog from becoming cascading failure

3. Fail-Incorrect

The component gives a response, but the response is wrong.

Example:

  • camera returns stale image
  • sensor reports false ON
  • encoder gives invalid position
  • recipe parameter is loaded from wrong version
  • inspection result belongs to previous frame

This is one of the most dangerous modes.

Why?

Because many systems are better at detecting “no data” than “wrong data.”

Architectural response:

  • validate freshness
  • validate correlation IDs
  • validate timestamps
  • cross-check sensors
  • reject impossible state combinations

4. Intermittent Failure

The problem appears and disappears.

Example:

  • loose cable disconnects only during vibration
  • camera fails only when CPU load is high
  • sensor flickers near threshold
  • race condition appears once per thousand runs

These are hard because the system may pass tests and fail in production.

Architectural response:

  • design evidence capture
  • preserve fault history
  • expose unstable health states
  • avoid clearing faults too aggressively

5. Partial System Failure

One subsystem fails while others still work.

Example:

  • vision is unavailable, but motion works
  • MES connection is down, but local production can continue
  • one camera fails in a multi-camera machine
  • one axis is faulted, but IO is healthy

Architectural response:

  • define subsystem boundaries
  • define degraded modes
  • prevent healthy subsystems from making unsafe assumptions about failed ones

6. Cascading Failure

One failure causes other failures.

Example:

text
Image processing slows down

Frame queue grows

Memory usage increases

GC pauses increase

UI updates become delayed

Operator sees stale machine state

Wrong recovery action is taken

Architectural response:

  • use backpressure
  • isolate subsystems
  • fail fast at boundaries
  • define containment zones

PART 4 — FAILURE PROPAGATION

Failures rarely stay local.

A local device issue can become a full machine incident if the system has poor containment.

text
+-------------+      +----------------+      +------------------+
| Camera      | ---> | Acquisition    | ---> | Processing       |
| disconnects |      | receives none  |      | waits forever    |
+-------------+      +----------------+      +------------------+
                                                        |
                                                        v
+-------------+      +----------------+      +------------------+
| Operator    | <--- | UI freezes or  | <--- | Workflow stuck   |
| confused    |      | shows stale    |      | mid-inspection   |
+-------------+      +----------------+      +------------------+

The root failure is camera disconnect.

But the system failure is larger:

  • acquisition did not classify the failure clearly
  • processing had no timeout boundary
  • workflow had no recovery state
  • UI did not show reliable status
  • operator received no safe guidance

This is why industrial reliability is architectural.

Not:

“Add try/catch around the camera call.”

But:

“Define how camera failure is contained, propagated, reported, and recovered.”


PART 5 — FAILURE DETECTION VS FAILURE ASSUMPTION

A weak design says:

“We will handle the error when we receive it.”

A stronger design says:

“What if we never receive the error?”

In industrial systems, silence can be a failure signal.

Examples:

  • no heartbeat from PLC
  • no image from camera
  • no position update from controller
  • no completion event from motion command
  • no response from robot
  • no sensor transition after actuator command

No event does not mean success.

It may mean:

  • device died
  • cable disconnected
  • event handler failed
  • communication dropped
  • controller is overloaded
  • software missed the message

That is why reliability design uses assumptions like:

text
Expected signal did not arrive within allowed time

Treat as abnormal

Stop dependent operation

Move to known safe/recovery state

Detection is reactive.

Failure assumption is proactive.

A reliable system asks:

  • What signal proves the action completed?
  • How long is it reasonable to wait?
  • What if the signal never comes?
  • What if the signal comes late?
  • What if the signal contradicts another signal?
  • What state should the machine enter?

PART 6 — RELIABILITY MODELING

In industrial systems, reliability is not only uptime.

A machine that stays running while producing bad results is not reliable.

A machine that keeps moving after losing position is not reliable.

A machine that hides failures from operators is not reliable.

A useful reliability model includes four dimensions.


1. Availability

Can the machine continue operating when expected?

Questions:

  • Can production continue if one non-critical subsystem fails?
  • Can the system restart cleanly?
  • Can devices reconnect without full machine reboot?
  • Can degraded operation be allowed safely?

2. Correctness

Is the machine doing the right thing?

Questions:

  • Is the image matched to the correct wafer position?
  • Is the recipe version correct?
  • Is the axis actually at the expected position?
  • Is the result associated with the correct run?
  • Is the sensor data fresh?

Correctness is often more important than uptime.


3. Recoverability

Can the system return to a known good state?

Questions:

  • After failure, do we know what step was active?
  • Do we know which commands completed?
  • Can the operator safely resume?
  • Must the machine abort the whole run?
  • Is manual intervention required?
  • Can partial material be saved?

4. Safety

Can the system prevent harm or damage?

Questions:

  • What is the safe state?
  • Should motion stop?
  • Should vacuum remain on or release?
  • Should light/laser turn off?
  • Should robot movement be inhibited?
  • Should operator intervention be blocked?

Reliability modeling should answer this pattern:

text
When X fails:
    How do we detect it?
    How long can detection take?
    What state is unsafe?
    What state is safe?
    What must stop?
    What may continue?
    Who needs to know?
    Can we recover automatically?
    When is operator/service intervention required?

PART 7 — REAL-WORLD FAILURE SCENARIOS

Scenario 1: Intermittent Camera Disconnect Under Load

What it looks like:

  • system works during lab testing
  • camera disconnects during full-speed production
  • issue appears only after high frame rate + motion + processing load
  • restart temporarily fixes it

Why it is hard:

  • no single code path always fails
  • SDK may report generic timeout
  • logs may show processing delay, not camera root cause
  • hardware, driver, USB/Ethernet bandwidth, CPU load, and cable quality may all be involved

Actual layer:

text
Physical / Device / Communication / Resource Load

Architecture lesson:

Camera health, acquisition timing, buffer pressure, and processing throughput must be modeled together.


Scenario 2: Works in Lab, Fails in Factory Noise

What it looks like:

  • sensor behaves correctly in engineering lab
  • in factory, sensor flickers randomly
  • machine occasionally enters wrong branch
  • operators report “sometimes it just stops”

Why it is hard:

  • lab environment is cleaner
  • electrical noise is not reproduced
  • sensor signal may flicker faster than logs capture
  • software may treat a single input transition as truth

Actual layer:

text
Electrical / IO / Control Interpretation

Architecture lesson:

Raw signals are not always reliable facts. They need interpretation, filtering, and validation against machine state.


Scenario 3: Memory Leak Causes Failure After Three Days

What it looks like:

  • machine runs fine after startup
  • after days of production, UI slows down
  • inspection latency increases
  • eventually acquisition fails or process crashes

Why it is hard:

  • short tests pass
  • leak may be in image buffers, native SDK handles, event subscriptions, or unmanaged memory
  • failure symptom appears far from the cause

Actual layer:

text
Software Resource Management / Native Device Integration

Architecture lesson:

Long-running stability is a core reliability requirement. Industrial apps must be designed as long-lived processes.


Scenario 4: Race Condition Causes Rare Incorrect Motion

What it looks like:

  • once in a while, axis moves at the wrong time
  • logs show two valid commands
  • system state changed between validation and execution
  • issue cannot be reproduced easily

Why it is hard:

  • each individual command looks legal
  • timing window is small
  • concurrency hides the real cause
  • UI, workflow, and device events may interleave

Actual layer:

text
Application / Control / Concurrency

Architecture lesson:

Command validation and command execution must be tied to a consistent state model.


Scenario 5: Wrong State Allows Unsafe Command Acceptance

What it looks like:

  • UI enables Start
  • operator clicks Start
  • machine accepts command
  • one subsystem is not actually ready
  • sequence fails or moves into unsafe condition

Why it is hard:

  • UI may show aggregated Ready incorrectly
  • subsystem state may be stale
  • state model may be too simple
  • “Ready” may mean different things for different subsystems

Actual layer:

text
State Modeling / Application / Control

Architecture lesson:

Readiness must be explicit, scoped, and validated at the command boundary, not just displayed in the UI.


PART 8 — SOFTWARE DESIGN IMPLICATIONS

Reliability is not something you add at the end.

It affects architecture from the beginning.


1. Failure Boundaries

A failure boundary defines where an error is contained and translated.

text
+------------------+
| Camera SDK       |
| raw errors       |
+--------+---------+
         |
         v
+------------------+
| Camera Adapter   |
| classifies fault |
+--------+---------+
         |
         v
+------------------+
| Vision Service   |
| decides impact   |
+--------+---------+
         |
         v
+------------------+
| Workflow         |
| stop/recover     |
+------------------+

Bad design:

text
Camera SDK exception leaks everywhere.

Good design:

text
Camera adapter converts SDK chaos into meaningful machine-level faults.

2. Subsystem Isolation

Subsystems should not collapse together unnecessarily.

Example:

  • vision failure should not freeze the UI
  • MES failure should not necessarily stop local machine operation
  • logging failure should not crash motion control
  • one camera failure should not always kill the whole machine if degraded operation is allowed

Isolation requires:

  • clear ownership
  • explicit health states
  • bounded queues
  • timeout boundaries
  • independent lifecycle management

3. Timeout Strategies

Timeouts are not just technical parameters.

They define the boundary between:

text
Still waiting

and

text
This is now abnormal

A timeout should be based on machine behavior, not random numbers.

Ask:

  • How long should this physical action take?
  • What is the worst expected time?
  • What if the machine is cold, loaded, or under stress?
  • What happens when the timeout fires?
  • Is it safe to retry?
  • Is operator intervention required?

4. State Validation

Every important command should be validated against current state.

Not just:

text
Can I call MoveAsync()?

But:

text
Is the machine in a mode where motion is allowed?
Is the axis homed?
Are limits valid?
Are interlocks satisfied?
Is the recipe active?
Is the material clamped?
Is another command already in progress?
Is the position data fresh?

The roadmap’s Domain 1 principle is very relevant here: machine software must be state-driven, not call-driven.


5. Defensive Design

Defensive design means the system expects bad inputs, bad timing, and bad states.

Examples:

  • reject stale data
  • reject impossible transitions
  • treat missing heartbeat as failure
  • validate command preconditions
  • isolate device SDK failures
  • prevent unbounded queue growth
  • distinguish warning, fault, and fatal conditions
  • require explicit recovery after serious faults

6. Observability Hooks

This topic is not about logging details, but architecture must leave hooks for diagnosis.

For every important failure, the system should preserve enough evidence to answer:

  • What command was running?
  • What device state was observed?
  • What workflow step was active?
  • What recipe/version was active?
  • What changed just before failure?
  • Was the system overloaded?
  • Was this the first failure or a consequence?

Without this, production debugging becomes guessing.


Bad vs Good Reliability Design

Bad:

text
UI Button

Direct SDK Call

Try/Catch

MessageBox("Error")

Good:

text
UI Command

Application Command Handler

State / Mode / Interlock Validation

Workflow Orchestrator

Subsystem Boundary

Device Adapter

Classified Result / Fault

State Transition / Recovery Decision

Good systems do not merely catch failures.

They classify, contain, propagate, and recover from them.


PART 9 — INTERVIEW / REAL-WORLD TALKING POINTS

How to Explain Failure Modeling Clearly

You can say:

In industrial software, I would not start by modeling only the normal workflow. I would first identify what can fail at each layer: physical hardware, IO, devices, communication, timing, application state, resources, and operator actions. Then I would define how each failure is detected, how it propagates, what safe state is required, and whether the system can recover automatically or needs operator/service intervention.

That is a strong answer because it shows system thinking.


Why Failure Modes Matter

You can say:

The component name is less important than the failure mode. A camera can fail-stop, fail-slow, return stale images, disconnect intermittently, or overload the processing pipeline. Each mode requires different containment and recovery behavior. Treating all of them as just “camera error” is too shallow for production machine software.


Common Mistakes Engineers Make

Common mistakes include:

  • assuming device calls either succeed or throw
  • treating timeout as generic exception
  • trusting stale status
  • letting SDK errors leak into workflow logic
  • allowing UI state to become the source of truth
  • missing partial failure scenarios
  • retrying commands that should not be retried
  • not defining safe states
  • not modeling recovery states
  • ignoring long-running resource exhaustion
  • designing only for lab conditions

What Strong Engineers Understand

Strong industrial engineers understand:

Failures are not isolated technical events. They are system behavior events.

A camera timeout is not just a camera problem.

It may affect:

  • acquisition
  • processing
  • workflow
  • motion synchronization
  • result correctness
  • operator trust
  • production throughput
  • recovery safety

Strong engineers design boundaries so one failure does not silently corrupt the whole system.


Final Mental Model

Think of industrial reliability like this:

text
Failure Origin

Failure Mode

Detection Mechanism

Containment Boundary

State Transition

Safe Action

Recovery Path

Diagnostic Evidence

A production-grade machine system is reliable not because nothing fails.

It is reliable because when things fail, the system:

  • detects the problem early enough
  • avoids unsafe action
  • protects machine/material/operator
  • keeps state understandable
  • prevents cascading failure
  • supports recovery
  • leaves enough evidence to diagnose root cause

That is the core mindset behind Failure Modes & System Reliability Model.

Docs-first project memory for AI-assisted implementation.