Failure Modes & System Reliability Model
This topic belongs to Reliability, Fault Handling & Recovery, where machines must detect failures, fail safely, report clearly, and recover without making things worse. The roadmap explicitly frames this as a major mindset shift from enterprise software because industrial failures can stop production, damage equipment, scrap material, or create safety risk.
PART 1 — BIG PICTURE: WHY FAILURE MODELING COMES FIRST
Industrial machine software is not designed only for the happy path.
A business web system might ask:
“What should happen when the user clicks Submit?”
An industrial machine system must also ask:
“What if the camera disconnects while the stage is moving?” “What if the axis reaches the target but the encoder reports an impossible value?” “What if the image pipeline becomes overloaded after running for six hours?” “What if the operator presses Stop during a half-completed sequence?”
This is why strong industrial engineers think about failure before they think about implementation.
In industrial systems, failure is not exceptional. It is part of normal reality.
Machines deal with:
- vibration
- dust
- heat
- electrical noise
- cable wear
- bad sensors
- device firmware bugs
- timing drift
- operator mistakes
- long-running memory pressure
- partial startup/shutdown
- unstable factory environments
So the architecture must support:
- partial failure
- degraded operation
- safe stop
- clear fault reporting
- controlled recovery
- diagnosability after the fact
A good industrial system does not assume:
“Everything works unless there is an exception.”
It assumes:
“Anything can fail, some failures will be delayed, some will be misleading, and some will only appear under production conditions.”
Example:
Camera disconnects during inspection
↓
No new image arrives
↓
Image processing waits
↓
Inspection workflow does not complete
↓
Stage remains in inspection position
↓
Operator sees machine stuck
↓
Wrong manual recovery may cause more damageThe camera failure is only the origin. The real problem is how the whole system responds.
PART 2 — FAILURE CATEGORIES: LAYERED MODEL
A useful way to model industrial failure is by system layer.
+--------------------------------------------------+
| UI / HMI |
| Operator screens, commands, alarms, status |
+--------------------------------------------------+
| Application / Workflow |
| Recipes, sequences, inspection jobs, run logic |
+--------------------------------------------------+
| Control Layer |
| State machines, command gating, interlocks |
+--------------------------------------------------+
| Communication Layer |
| Ethernet, serial, fieldbus, SDK calls, messages |
+--------------------------------------------------+
| Device Layer |
| Cameras, motion controllers, IO cards, PLCs |
+--------------------------------------------------+
| Physical Layer |
| Motors, stages, sensors, cables, mechanics |
+--------------------------------------------------+The same visible symptom can come from different layers.
For example, “axis did not move” could mean:
- motor power is off
- servo drive faulted
- cable is loose
- motion controller rejected command
- communication timed out
- interlock blocked motion
- workflow sent command in wrong state
- UI enabled a button when it should not have
That is why failure modeling must be layered.
1. Physical / Mechanical Failures
These happen in the real machine body.
Examples:
- belt slips
- stage jams
- actuator stalls
- vacuum cup loses grip
- guide rail contamination increases friction
- wafer is misaligned
- mechanical backlash affects position repeatability
Software may not directly “see” the mechanical problem. It may only see symptoms:
- motion timeout
- position error
- sensor mismatch
- unexpected vibration
- repeatability drift
Architectural implication:
Software should not assume that command accepted means physical action succeeded.
2. Electrical / IO Failures
These involve signals, wiring, voltage, and electrical noise.
Examples:
- sensor signal flickers
- digital input stuck ON
- cable intermittently disconnects
- emergency stop circuit changes state
- electrical noise causes false trigger
- analog signal drifts
Architectural implication:
Raw IO should usually be interpreted through validation, debounce, state context, and health checks.
3. Device / Hardware Failures
These happen inside devices controlled by software.
Examples:
- camera disconnects
- frame grabber stops delivering frames
- motion controller enters fault state
- light controller ignores command
- robot controller reports alarm
- IO module becomes unreachable
Architectural implication:
Each device should have an explicit health/state model, not just a wrapper around SDK calls.
This connects strongly to the roadmap’s hardware integration domain, where many real failures come from unstable drivers, communication drops, timeouts, partial initialization, and device contention.
4. Communication Failures
These happen between software and devices/controllers.
Examples:
- timeout
- dropped connection
- delayed response
- corrupted frame
- duplicate message
- out-of-order event
- stale cached status
- request succeeds but response is lost
Architectural implication:
Communication is not just transport. It affects system truth.
A motion command may have reached the controller even if the PC did not receive the response. That creates uncertainty.
5. Timing / Synchronization Failures
These are very common in inspection, motion, and automation systems.
Examples:
- camera trigger arrives too early
- image timestamp does not match stage position
- light turns on after exposure starts
- encoder position sample is delayed
- UI shows old state as current
- processing result belongs to previous wafer
Architectural implication:
Correctness depends not only on data, but on when the data was valid.
6. Data / State Inconsistency
The system’s internal model no longer matches machine reality.
Examples:
- software thinks wafer is loaded, but sensor says empty
- workflow thinks machine is Running, but motion controller is Faulted
- recipe version changed during run
- UI shows Ready while one subsystem is still initializing
- inspection result is attached to wrong product ID
Architectural implication:
State must be modeled explicitly and validated across subsystem boundaries.
The roadmap also emphasizes state machines, deterministic workflow execution, interlocks, and fault handling as core machine-control concerns.
7. Software Logic Errors
These are normal software bugs, but with physical consequences.
Examples:
- wrong state transition
- race condition
- missing interlock check
- incorrect coordinate transform
- unit conversion error
- wrong recipe parameter applied
- command allowed in unsafe mode
Architectural implication:
Business software bugs may corrupt data. Machine software bugs can move hardware incorrectly.
8. Resource Exhaustion
These appear after time or under load.
Examples:
- memory leak after three days
- image buffers not released
- disk fills with inspection images
- CPU spike delays control updates
- queue grows faster than processing
- UI becomes sluggish during high-throughput inspection
Architectural implication:
Long-running behavior is part of reliability, not just performance.
9. Human / Operator Errors
Operators are part of the system.
Examples:
- wrong recipe selected
- manual command issued in wrong context
- alarm ignored
- recovery step skipped
- service mode left enabled
- part loaded incorrectly
Architectural implication:
The system should make correct actions easy and unsafe actions difficult or impossible.
But this topic is not mainly about UI design. At system level, the key point is: operator actions must be modeled as inputs that can fail, arrive at bad times, or conflict with machine state.
PART 3 — FAILURE MODES: HOW THINGS FAIL
A common beginner mistake is to ask only:
“Which component failed?”
A stronger engineer asks:
“How did it fail?”
The failure mode often matters more than the component.
1. Fail-Stop
The component stops working clearly.
Example:
- camera disconnects
- motion controller stops responding
- PLC connection drops
- service crashes
This is often the easiest failure to detect.
Command → No response → Timeout → FaultArchitectural response:
- mark subsystem unavailable
- stop dependent workflows
- enter safe state
- require recovery or reconnect
2. Fail-Slow
The component still works, but too slowly.
Example:
- camera frames arrive late
- image processing latency increases
- database writes become slow
- device SDK call blocks longer than usual
This is dangerous because the system appears alive.
Normal response time: 20 ms
Current response time: 2,000 ms
System status: technically alive, operationally unsafeArchitectural response:
- define timing expectations
- detect latency degradation
- apply timeouts
- prevent backlog from becoming cascading failure
3. Fail-Incorrect
The component gives a response, but the response is wrong.
Example:
- camera returns stale image
- sensor reports false ON
- encoder gives invalid position
- recipe parameter is loaded from wrong version
- inspection result belongs to previous frame
This is one of the most dangerous modes.
Why?
Because many systems are better at detecting “no data” than “wrong data.”
Architectural response:
- validate freshness
- validate correlation IDs
- validate timestamps
- cross-check sensors
- reject impossible state combinations
4. Intermittent Failure
The problem appears and disappears.
Example:
- loose cable disconnects only during vibration
- camera fails only when CPU load is high
- sensor flickers near threshold
- race condition appears once per thousand runs
These are hard because the system may pass tests and fail in production.
Architectural response:
- design evidence capture
- preserve fault history
- expose unstable health states
- avoid clearing faults too aggressively
5. Partial System Failure
One subsystem fails while others still work.
Example:
- vision is unavailable, but motion works
- MES connection is down, but local production can continue
- one camera fails in a multi-camera machine
- one axis is faulted, but IO is healthy
Architectural response:
- define subsystem boundaries
- define degraded modes
- prevent healthy subsystems from making unsafe assumptions about failed ones
6. Cascading Failure
One failure causes other failures.
Example:
Image processing slows down
↓
Frame queue grows
↓
Memory usage increases
↓
GC pauses increase
↓
UI updates become delayed
↓
Operator sees stale machine state
↓
Wrong recovery action is takenArchitectural response:
- use backpressure
- isolate subsystems
- fail fast at boundaries
- define containment zones
PART 4 — FAILURE PROPAGATION
Failures rarely stay local.
A local device issue can become a full machine incident if the system has poor containment.
+-------------+ +----------------+ +------------------+
| Camera | ---> | Acquisition | ---> | Processing |
| disconnects | | receives none | | waits forever |
+-------------+ +----------------+ +------------------+
|
v
+-------------+ +----------------+ +------------------+
| Operator | <--- | UI freezes or | <--- | Workflow stuck |
| confused | | shows stale | | mid-inspection |
+-------------+ +----------------+ +------------------+The root failure is camera disconnect.
But the system failure is larger:
- acquisition did not classify the failure clearly
- processing had no timeout boundary
- workflow had no recovery state
- UI did not show reliable status
- operator received no safe guidance
This is why industrial reliability is architectural.
Not:
“Add try/catch around the camera call.”
But:
“Define how camera failure is contained, propagated, reported, and recovered.”
PART 5 — FAILURE DETECTION VS FAILURE ASSUMPTION
A weak design says:
“We will handle the error when we receive it.”
A stronger design says:
“What if we never receive the error?”
In industrial systems, silence can be a failure signal.
Examples:
- no heartbeat from PLC
- no image from camera
- no position update from controller
- no completion event from motion command
- no response from robot
- no sensor transition after actuator command
No event does not mean success.
It may mean:
- device died
- cable disconnected
- event handler failed
- communication dropped
- controller is overloaded
- software missed the message
That is why reliability design uses assumptions like:
Expected signal did not arrive within allowed time
↓
Treat as abnormal
↓
Stop dependent operation
↓
Move to known safe/recovery stateDetection is reactive.
Failure assumption is proactive.
A reliable system asks:
- What signal proves the action completed?
- How long is it reasonable to wait?
- What if the signal never comes?
- What if the signal comes late?
- What if the signal contradicts another signal?
- What state should the machine enter?
PART 6 — RELIABILITY MODELING
In industrial systems, reliability is not only uptime.
A machine that stays running while producing bad results is not reliable.
A machine that keeps moving after losing position is not reliable.
A machine that hides failures from operators is not reliable.
A useful reliability model includes four dimensions.
1. Availability
Can the machine continue operating when expected?
Questions:
- Can production continue if one non-critical subsystem fails?
- Can the system restart cleanly?
- Can devices reconnect without full machine reboot?
- Can degraded operation be allowed safely?
2. Correctness
Is the machine doing the right thing?
Questions:
- Is the image matched to the correct wafer position?
- Is the recipe version correct?
- Is the axis actually at the expected position?
- Is the result associated with the correct run?
- Is the sensor data fresh?
Correctness is often more important than uptime.
3. Recoverability
Can the system return to a known good state?
Questions:
- After failure, do we know what step was active?
- Do we know which commands completed?
- Can the operator safely resume?
- Must the machine abort the whole run?
- Is manual intervention required?
- Can partial material be saved?
4. Safety
Can the system prevent harm or damage?
Questions:
- What is the safe state?
- Should motion stop?
- Should vacuum remain on or release?
- Should light/laser turn off?
- Should robot movement be inhibited?
- Should operator intervention be blocked?
Reliability modeling should answer this pattern:
When X fails:
How do we detect it?
How long can detection take?
What state is unsafe?
What state is safe?
What must stop?
What may continue?
Who needs to know?
Can we recover automatically?
When is operator/service intervention required?PART 7 — REAL-WORLD FAILURE SCENARIOS
Scenario 1: Intermittent Camera Disconnect Under Load
What it looks like:
- system works during lab testing
- camera disconnects during full-speed production
- issue appears only after high frame rate + motion + processing load
- restart temporarily fixes it
Why it is hard:
- no single code path always fails
- SDK may report generic timeout
- logs may show processing delay, not camera root cause
- hardware, driver, USB/Ethernet bandwidth, CPU load, and cable quality may all be involved
Actual layer:
Physical / Device / Communication / Resource LoadArchitecture lesson:
Camera health, acquisition timing, buffer pressure, and processing throughput must be modeled together.
Scenario 2: Works in Lab, Fails in Factory Noise
What it looks like:
- sensor behaves correctly in engineering lab
- in factory, sensor flickers randomly
- machine occasionally enters wrong branch
- operators report “sometimes it just stops”
Why it is hard:
- lab environment is cleaner
- electrical noise is not reproduced
- sensor signal may flicker faster than logs capture
- software may treat a single input transition as truth
Actual layer:
Electrical / IO / Control InterpretationArchitecture lesson:
Raw signals are not always reliable facts. They need interpretation, filtering, and validation against machine state.
Scenario 3: Memory Leak Causes Failure After Three Days
What it looks like:
- machine runs fine after startup
- after days of production, UI slows down
- inspection latency increases
- eventually acquisition fails or process crashes
Why it is hard:
- short tests pass
- leak may be in image buffers, native SDK handles, event subscriptions, or unmanaged memory
- failure symptom appears far from the cause
Actual layer:
Software Resource Management / Native Device IntegrationArchitecture lesson:
Long-running stability is a core reliability requirement. Industrial apps must be designed as long-lived processes.
Scenario 4: Race Condition Causes Rare Incorrect Motion
What it looks like:
- once in a while, axis moves at the wrong time
- logs show two valid commands
- system state changed between validation and execution
- issue cannot be reproduced easily
Why it is hard:
- each individual command looks legal
- timing window is small
- concurrency hides the real cause
- UI, workflow, and device events may interleave
Actual layer:
Application / Control / ConcurrencyArchitecture lesson:
Command validation and command execution must be tied to a consistent state model.
Scenario 5: Wrong State Allows Unsafe Command Acceptance
What it looks like:
- UI enables Start
- operator clicks Start
- machine accepts command
- one subsystem is not actually ready
- sequence fails or moves into unsafe condition
Why it is hard:
- UI may show aggregated Ready incorrectly
- subsystem state may be stale
- state model may be too simple
- “Ready” may mean different things for different subsystems
Actual layer:
State Modeling / Application / ControlArchitecture lesson:
Readiness must be explicit, scoped, and validated at the command boundary, not just displayed in the UI.
PART 8 — SOFTWARE DESIGN IMPLICATIONS
Reliability is not something you add at the end.
It affects architecture from the beginning.
1. Failure Boundaries
A failure boundary defines where an error is contained and translated.
+------------------+
| Camera SDK |
| raw errors |
+--------+---------+
|
v
+------------------+
| Camera Adapter |
| classifies fault |
+--------+---------+
|
v
+------------------+
| Vision Service |
| decides impact |
+--------+---------+
|
v
+------------------+
| Workflow |
| stop/recover |
+------------------+Bad design:
Camera SDK exception leaks everywhere.Good design:
Camera adapter converts SDK chaos into meaningful machine-level faults.2. Subsystem Isolation
Subsystems should not collapse together unnecessarily.
Example:
- vision failure should not freeze the UI
- MES failure should not necessarily stop local machine operation
- logging failure should not crash motion control
- one camera failure should not always kill the whole machine if degraded operation is allowed
Isolation requires:
- clear ownership
- explicit health states
- bounded queues
- timeout boundaries
- independent lifecycle management
3. Timeout Strategies
Timeouts are not just technical parameters.
They define the boundary between:
Still waitingand
This is now abnormalA timeout should be based on machine behavior, not random numbers.
Ask:
- How long should this physical action take?
- What is the worst expected time?
- What if the machine is cold, loaded, or under stress?
- What happens when the timeout fires?
- Is it safe to retry?
- Is operator intervention required?
4. State Validation
Every important command should be validated against current state.
Not just:
Can I call MoveAsync()?But:
Is the machine in a mode where motion is allowed?
Is the axis homed?
Are limits valid?
Are interlocks satisfied?
Is the recipe active?
Is the material clamped?
Is another command already in progress?
Is the position data fresh?The roadmap’s Domain 1 principle is very relevant here: machine software must be state-driven, not call-driven.
5. Defensive Design
Defensive design means the system expects bad inputs, bad timing, and bad states.
Examples:
- reject stale data
- reject impossible transitions
- treat missing heartbeat as failure
- validate command preconditions
- isolate device SDK failures
- prevent unbounded queue growth
- distinguish warning, fault, and fatal conditions
- require explicit recovery after serious faults
6. Observability Hooks
This topic is not about logging details, but architecture must leave hooks for diagnosis.
For every important failure, the system should preserve enough evidence to answer:
- What command was running?
- What device state was observed?
- What workflow step was active?
- What recipe/version was active?
- What changed just before failure?
- Was the system overloaded?
- Was this the first failure or a consequence?
Without this, production debugging becomes guessing.
Bad vs Good Reliability Design
Bad:
UI Button
↓
Direct SDK Call
↓
Try/Catch
↓
MessageBox("Error")Good:
UI Command
↓
Application Command Handler
↓
State / Mode / Interlock Validation
↓
Workflow Orchestrator
↓
Subsystem Boundary
↓
Device Adapter
↓
Classified Result / Fault
↓
State Transition / Recovery DecisionGood systems do not merely catch failures.
They classify, contain, propagate, and recover from them.
PART 9 — INTERVIEW / REAL-WORLD TALKING POINTS
How to Explain Failure Modeling Clearly
You can say:
In industrial software, I would not start by modeling only the normal workflow. I would first identify what can fail at each layer: physical hardware, IO, devices, communication, timing, application state, resources, and operator actions. Then I would define how each failure is detected, how it propagates, what safe state is required, and whether the system can recover automatically or needs operator/service intervention.
That is a strong answer because it shows system thinking.
Why Failure Modes Matter
You can say:
The component name is less important than the failure mode. A camera can fail-stop, fail-slow, return stale images, disconnect intermittently, or overload the processing pipeline. Each mode requires different containment and recovery behavior. Treating all of them as just “camera error” is too shallow for production machine software.
Common Mistakes Engineers Make
Common mistakes include:
- assuming device calls either succeed or throw
- treating timeout as generic exception
- trusting stale status
- letting SDK errors leak into workflow logic
- allowing UI state to become the source of truth
- missing partial failure scenarios
- retrying commands that should not be retried
- not defining safe states
- not modeling recovery states
- ignoring long-running resource exhaustion
- designing only for lab conditions
What Strong Engineers Understand
Strong industrial engineers understand:
Failures are not isolated technical events. They are system behavior events.
A camera timeout is not just a camera problem.
It may affect:
- acquisition
- processing
- workflow
- motion synchronization
- result correctness
- operator trust
- production throughput
- recovery safety
Strong engineers design boundaries so one failure does not silently corrupt the whole system.
Final Mental Model
Think of industrial reliability like this:
Failure Origin
↓
Failure Mode
↓
Detection Mechanism
↓
Containment Boundary
↓
State Transition
↓
Safe Action
↓
Recovery Path
↓
Diagnostic EvidenceA production-grade machine system is reliable not because nothing fails.
It is reliable because when things fail, the system:
- detects the problem early enough
- avoids unsafe action
- protects machine/material/operator
- keeps state understandable
- prevents cascading failure
- supports recovery
- leaves enough evidence to diagnose root cause
That is the core mindset behind Failure Modes & System Reliability Model.