Failure Modes & System Reliability Model

This topic belongs to Reliability, Fault Handling & Recovery, where machines must detect failures, fail safely, report clearly, and recover without making things worse. The roadmap explicitly frames this as a major mindset shift from enterprise software because industrial failures can stop production, damage equipment, scrap material, or create safety risk.

PART 1 — BIG PICTURE: WHY FAILURE MODELING COMES FIRST

Industrial machine software is not designed only for the happy path.

A business web system might ask:

“What should happen when the user clicks Submit?”

An industrial machine system must also ask:

“What if the camera disconnects while the stage is moving?” “What if the axis reaches the target but the encoder reports an impossible value?” “What if the image pipeline becomes overloaded after running for six hours?” “What if the operator presses Stop during a half-completed sequence?”

This is why strong industrial engineers think about failure before they think about implementation.

In industrial systems, failure is not exceptional. It is part of normal reality.

Machines deal with:

vibration
dust
heat
electrical noise
cable wear
bad sensors
device firmware bugs
timing drift
operator mistakes
long-running memory pressure
partial startup/shutdown
unstable factory environments

So the architecture must support:

partial failure
degraded operation
safe stop
clear fault reporting
controlled recovery
diagnosability after the fact

A good industrial system does not assume:

“Everything works unless there is an exception.”

It assumes:

“Anything can fail, some failures will be delayed, some will be misleading, and some will only appear under production conditions.”

Example:

text

Camera disconnects during inspection
        ↓
No new image arrives
        ↓
Image processing waits
        ↓
Inspection workflow does not complete
        ↓
Stage remains in inspection position
        ↓
Operator sees machine stuck
        ↓
Wrong manual recovery may cause more damage

The camera failure is only the origin. The real problem is how the whole system responds.

PART 2 — FAILURE CATEGORIES: LAYERED MODEL

A useful way to model industrial failure is by system layer.

text

+--------------------------------------------------+
| UI / HMI                                         |
| Operator screens, commands, alarms, status       |
+--------------------------------------------------+
| Application / Workflow                           |
| Recipes, sequences, inspection jobs, run logic   |
+--------------------------------------------------+
| Control Layer                                    |
| State machines, command gating, interlocks       |
+--------------------------------------------------+
| Communication Layer                              |
| Ethernet, serial, fieldbus, SDK calls, messages  |
+--------------------------------------------------+
| Device Layer                                     |
| Cameras, motion controllers, IO cards, PLCs      |
+--------------------------------------------------+
| Physical Layer                                   |
| Motors, stages, sensors, cables, mechanics       |
+--------------------------------------------------+

The same visible symptom can come from different layers.

For example, “axis did not move” could mean:

motor power is off
servo drive faulted
cable is loose
motion controller rejected command
communication timed out
interlock blocked motion
workflow sent command in wrong state
UI enabled a button when it should not have

That is why failure modeling must be layered.

1. Physical / Mechanical Failures

These happen in the real machine body.

Examples:

belt slips
stage jams
actuator stalls
vacuum cup loses grip
guide rail contamination increases friction
wafer is misaligned
mechanical backlash affects position repeatability

Software may not directly “see” the mechanical problem. It may only see symptoms:

motion timeout
position error
sensor mismatch
unexpected vibration
repeatability drift

Architectural implication:

Software should not assume that command accepted means physical action succeeded.

2. Electrical / IO Failures

These involve signals, wiring, voltage, and electrical noise.

Examples:

sensor signal flickers
digital input stuck ON
cable intermittently disconnects
emergency stop circuit changes state
electrical noise causes false trigger
analog signal drifts

Architectural implication:

Raw IO should usually be interpreted through validation, debounce, state context, and health checks.

3. Device / Hardware Failures

These happen inside devices controlled by software.

Examples:

camera disconnects
frame grabber stops delivering frames
motion controller enters fault state
light controller ignores command
robot controller reports alarm
IO module becomes unreachable

Architectural implication:

Each device should have an explicit health/state model, not just a wrapper around SDK calls.

This connects strongly to the roadmap’s hardware integration domain, where many real failures come from unstable drivers, communication drops, timeouts, partial initialization, and device contention.

4. Communication Failures

These happen between software and devices/controllers.

Examples:

timeout
dropped connection
delayed response
corrupted frame
duplicate message
out-of-order event
stale cached status
request succeeds but response is lost

Architectural implication:

Communication is not just transport. It affects system truth.

A motion command may have reached the controller even if the PC did not receive the response. That creates uncertainty.

5. Timing / Synchronization Failures

These are very common in inspection, motion, and automation systems.

Examples:

camera trigger arrives too early
image timestamp does not match stage position
light turns on after exposure starts
encoder position sample is delayed
UI shows old state as current
processing result belongs to previous wafer

Architectural implication:

Correctness depends not only on data, but on when the data was valid.

6. Data / State Inconsistency

The system’s internal model no longer matches machine reality.

Examples:

software thinks wafer is loaded, but sensor says empty
workflow thinks machine is Running, but motion controller is Faulted
recipe version changed during run
UI shows Ready while one subsystem is still initializing
inspection result is attached to wrong product ID

Architectural implication:

State must be modeled explicitly and validated across subsystem boundaries.

The roadmap also emphasizes state machines, deterministic workflow execution, interlocks, and fault handling as core machine-control concerns.

7. Software Logic Errors

These are normal software bugs, but with physical consequences.

Examples:

wrong state transition
race condition
missing interlock check
incorrect coordinate transform
unit conversion error
wrong recipe parameter applied
command allowed in unsafe mode

Architectural implication:

Business software bugs may corrupt data. Machine software bugs can move hardware incorrectly.

8. Resource Exhaustion

These appear after time or under load.

Examples:

memory leak after three days
image buffers not released
disk fills with inspection images
CPU spike delays control updates
queue grows faster than processing
UI becomes sluggish during high-throughput inspection

Architectural implication:

Long-running behavior is part of reliability, not just performance.

9. Human / Operator Errors

Operators are part of the system.

Examples:

wrong recipe selected
manual command issued in wrong context
alarm ignored
recovery step skipped
service mode left enabled
part loaded incorrectly

Architectural implication:

The system should make correct actions easy and unsafe actions difficult or impossible.

But this topic is not mainly about UI design. At system level, the key point is: operator actions must be modeled as inputs that can fail, arrive at bad times, or conflict with machine state.

PART 3 — FAILURE MODES: HOW THINGS FAIL

A common beginner mistake is to ask only:

“Which component failed?”

A stronger engineer asks:

“How did it fail?”

The failure mode often matters more than the component.

1. Fail-Stop

The component stops working clearly.

Example:

camera disconnects
motion controller stops responding
PLC connection drops
service crashes

This is often the easiest failure to detect.

text

Command → No response → Timeout → Fault

Architectural response:

mark subsystem unavailable
stop dependent workflows
enter safe state
require recovery or reconnect

2. Fail-Slow

The component still works, but too slowly.

Example:

camera frames arrive late
image processing latency increases
database writes become slow
device SDK call blocks longer than usual

This is dangerous because the system appears alive.

text

Normal response time: 20 ms
Current response time: 2,000 ms
System status: technically alive, operationally unsafe

Architectural response:

define timing expectations
detect latency degradation
apply timeouts
prevent backlog from becoming cascading failure

3. Fail-Incorrect

The component gives a response, but the response is wrong.

Example:

camera returns stale image
sensor reports false ON
encoder gives invalid position
recipe parameter is loaded from wrong version
inspection result belongs to previous frame

This is one of the most dangerous modes.

Why?

Because many systems are better at detecting “no data” than “wrong data.”

Architectural response:

validate freshness
validate correlation IDs
validate timestamps
cross-check sensors
reject impossible state combinations

4. Intermittent Failure

The problem appears and disappears.

Example:

loose cable disconnects only during vibration
camera fails only when CPU load is high
sensor flickers near threshold
race condition appears once per thousand runs

These are hard because the system may pass tests and fail in production.

Architectural response:

design evidence capture
preserve fault history
expose unstable health states
avoid clearing faults too aggressively

5. Partial System Failure

One subsystem fails while others still work.

Example:

vision is unavailable, but motion works
MES connection is down, but local production can continue
one camera fails in a multi-camera machine
one axis is faulted, but IO is healthy

Architectural response:

define subsystem boundaries
define degraded modes
prevent healthy subsystems from making unsafe assumptions about failed ones

6. Cascading Failure

One failure causes other failures.

Example:

text

Image processing slows down
        ↓
Frame queue grows
        ↓
Memory usage increases
        ↓
GC pauses increase
        ↓
UI updates become delayed
        ↓
Operator sees stale machine state
        ↓
Wrong recovery action is taken

Architectural response:

use backpressure
isolate subsystems
fail fast at boundaries
define containment zones

PART 4 — FAILURE PROPAGATION

Failures rarely stay local.

A local device issue can become a full machine incident if the system has poor containment.

text

+-------------+      +----------------+      +------------------+
| Camera      | ---> | Acquisition    | ---> | Processing       |
| disconnects |      | receives none  |      | waits forever    |
+-------------+      +----------------+      +------------------+
                                                        |
                                                        v
+-------------+      +----------------+      +------------------+
| Operator    | <--- | UI freezes or  | <--- | Workflow stuck   |
| confused    |      | shows stale    |      | mid-inspection   |
+-------------+      +----------------+      +------------------+

The root failure is camera disconnect.

But the system failure is larger:

acquisition did not classify the failure clearly
processing had no timeout boundary
workflow had no recovery state
UI did not show reliable status
operator received no safe guidance

This is why industrial reliability is architectural.

Not:

“Add try/catch around the camera call.”

But:

“Define how camera failure is contained, propagated, reported, and recovered.”

PART 5 — FAILURE DETECTION VS FAILURE ASSUMPTION

A weak design says:

“We will handle the error when we receive it.”

A stronger design says:

“What if we never receive the error?”

In industrial systems, silence can be a failure signal.

Examples:

no heartbeat from PLC
no image from camera
no position update from controller
no completion event from motion command
no response from robot
no sensor transition after actuator command

No event does not mean success.

It may mean:

device died
cable disconnected
event handler failed
communication dropped
controller is overloaded
software missed the message

That is why reliability design uses assumptions like:

text

Expected signal did not arrive within allowed time
        ↓
Treat as abnormal
        ↓
Stop dependent operation
        ↓
Move to known safe/recovery state

Detection is reactive.

Failure assumption is proactive.

A reliable system asks:

What signal proves the action completed?
How long is it reasonable to wait?
What if the signal never comes?
What if the signal comes late?
What if the signal contradicts another signal?
What state should the machine enter?

PART 6 — RELIABILITY MODELING

In industrial systems, reliability is not only uptime.

A machine that stays running while producing bad results is not reliable.

A machine that keeps moving after losing position is not reliable.

A machine that hides failures from operators is not reliable.

A useful reliability model includes four dimensions.

1. Availability

Can the machine continue operating when expected?

Questions:

Can production continue if one non-critical subsystem fails?
Can the system restart cleanly?
Can devices reconnect without full machine reboot?
Can degraded operation be allowed safely?

2. Correctness

Is the machine doing the right thing?

Questions:

Is the image matched to the correct wafer position?
Is the recipe version correct?
Is the axis actually at the expected position?
Is the result associated with the correct run?
Is the sensor data fresh?

Correctness is often more important than uptime.

3. Recoverability

Can the system return to a known good state?

Questions:

After failure, do we know what step was active?
Do we know which commands completed?
Can the operator safely resume?
Must the machine abort the whole run?
Is manual intervention required?
Can partial material be saved?

4. Safety

Can the system prevent harm or damage?

Questions:

What is the safe state?
Should motion stop?
Should vacuum remain on or release?
Should light/laser turn off?
Should robot movement be inhibited?
Should operator intervention be blocked?

Reliability modeling should answer this pattern:

text

When X fails:
    How do we detect it?
    How long can detection take?
    What state is unsafe?
    What state is safe?
    What must stop?
    What may continue?
    Who needs to know?
    Can we recover automatically?
    When is operator/service intervention required?

PART 7 — REAL-WORLD FAILURE SCENARIOS

Scenario 1: Intermittent Camera Disconnect Under Load

What it looks like:

system works during lab testing
camera disconnects during full-speed production
issue appears only after high frame rate + motion + processing load
restart temporarily fixes it

Why it is hard:

no single code path always fails
SDK may report generic timeout
logs may show processing delay, not camera root cause
hardware, driver, USB/Ethernet bandwidth, CPU load, and cable quality may all be involved

Actual layer:

text

Physical / Device / Communication / Resource Load

Architecture lesson:

Camera health, acquisition timing, buffer pressure, and processing throughput must be modeled together.

Scenario 2: Works in Lab, Fails in Factory Noise

What it looks like:

sensor behaves correctly in engineering lab
in factory, sensor flickers randomly
machine occasionally enters wrong branch
operators report “sometimes it just stops”

Why it is hard:

lab environment is cleaner
electrical noise is not reproduced
sensor signal may flicker faster than logs capture
software may treat a single input transition as truth

Actual layer:

text

Electrical / IO / Control Interpretation

Architecture lesson:

Raw signals are not always reliable facts. They need interpretation, filtering, and validation against machine state.

Scenario 3: Memory Leak Causes Failure After Three Days

What it looks like:

machine runs fine after startup
after days of production, UI slows down
inspection latency increases
eventually acquisition fails or process crashes

Why it is hard:

short tests pass
leak may be in image buffers, native SDK handles, event subscriptions, or unmanaged memory
failure symptom appears far from the cause

Actual layer:

text

Software Resource Management / Native Device Integration

Architecture lesson:

Long-running stability is a core reliability requirement. Industrial apps must be designed as long-lived processes.

Scenario 4: Race Condition Causes Rare Incorrect Motion

What it looks like:

once in a while, axis moves at the wrong time
logs show two valid commands
system state changed between validation and execution
issue cannot be reproduced easily

Why it is hard:

each individual command looks legal
timing window is small
concurrency hides the real cause
UI, workflow, and device events may interleave

Actual layer:

text

Application / Control / Concurrency

Architecture lesson:

Command validation and command execution must be tied to a consistent state model.

Scenario 5: Wrong State Allows Unsafe Command Acceptance

What it looks like:

UI enables Start
operator clicks Start
machine accepts command
one subsystem is not actually ready
sequence fails or moves into unsafe condition

Why it is hard:

UI may show aggregated Ready incorrectly
subsystem state may be stale
state model may be too simple
“Ready” may mean different things for different subsystems

Actual layer:

text

State Modeling / Application / Control

Architecture lesson:

Readiness must be explicit, scoped, and validated at the command boundary, not just displayed in the UI.

PART 8 — SOFTWARE DESIGN IMPLICATIONS

Reliability is not something you add at the end.

It affects architecture from the beginning.

1. Failure Boundaries

A failure boundary defines where an error is contained and translated.

text

+------------------+
| Camera SDK       |
| raw errors       |
+--------+---------+
         |
         v
+------------------+
| Camera Adapter   |
| classifies fault |
+--------+---------+
         |
         v
+------------------+
| Vision Service   |
| decides impact   |
+--------+---------+
         |
         v
+------------------+
| Workflow         |
| stop/recover     |
+------------------+

Bad design:

text

Camera SDK exception leaks everywhere.

Good design:

text

Camera adapter converts SDK chaos into meaningful machine-level faults.

2. Subsystem Isolation

Subsystems should not collapse together unnecessarily.

Example:

vision failure should not freeze the UI
MES failure should not necessarily stop local machine operation
logging failure should not crash motion control
one camera failure should not always kill the whole machine if degraded operation is allowed

Isolation requires:

clear ownership
explicit health states
bounded queues
timeout boundaries
independent lifecycle management

3. Timeout Strategies

Timeouts are not just technical parameters.

They define the boundary between:

text

Still waiting

and

text

This is now abnormal

A timeout should be based on machine behavior, not random numbers.

Ask:

How long should this physical action take?
What is the worst expected time?
What if the machine is cold, loaded, or under stress?
What happens when the timeout fires?
Is it safe to retry?
Is operator intervention required?

4. State Validation

Every important command should be validated against current state.

Not just:

text

Can I call MoveAsync()?

But:

text

Is the machine in a mode where motion is allowed?
Is the axis homed?
Are limits valid?
Are interlocks satisfied?
Is the recipe active?
Is the material clamped?
Is another command already in progress?
Is the position data fresh?

The roadmap’s Domain 1 principle is very relevant here: machine software must be state-driven, not call-driven.

5. Defensive Design

Defensive design means the system expects bad inputs, bad timing, and bad states.

Examples:

reject stale data
reject impossible transitions
treat missing heartbeat as failure
validate command preconditions
isolate device SDK failures
prevent unbounded queue growth
distinguish warning, fault, and fatal conditions
require explicit recovery after serious faults

6. Observability Hooks

This topic is not about logging details, but architecture must leave hooks for diagnosis.

For every important failure, the system should preserve enough evidence to answer:

What command was running?
What device state was observed?
What workflow step was active?
What recipe/version was active?
What changed just before failure?
Was the system overloaded?
Was this the first failure or a consequence?

Without this, production debugging becomes guessing.

Bad vs Good Reliability Design

Bad:

text

UI Button
   ↓
Direct SDK Call
   ↓
Try/Catch
   ↓
MessageBox("Error")

Good:

text

UI Command
   ↓
Application Command Handler
   ↓
State / Mode / Interlock Validation
   ↓
Workflow Orchestrator
   ↓
Subsystem Boundary
   ↓
Device Adapter
   ↓
Classified Result / Fault
   ↓
State Transition / Recovery Decision

Good systems do not merely catch failures.

They classify, contain, propagate, and recover from them.

PART 9 — INTERVIEW / REAL-WORLD TALKING POINTS

How to Explain Failure Modeling Clearly

You can say:

In industrial software, I would not start by modeling only the normal workflow. I would first identify what can fail at each layer: physical hardware, IO, devices, communication, timing, application state, resources, and operator actions. Then I would define how each failure is detected, how it propagates, what safe state is required, and whether the system can recover automatically or needs operator/service intervention.

That is a strong answer because it shows system thinking.

Why Failure Modes Matter

You can say:

The component name is less important than the failure mode. A camera can fail-stop, fail-slow, return stale images, disconnect intermittently, or overload the processing pipeline. Each mode requires different containment and recovery behavior. Treating all of them as just “camera error” is too shallow for production machine software.

Common Mistakes Engineers Make

Common mistakes include:

assuming device calls either succeed or throw
treating timeout as generic exception
trusting stale status
letting SDK errors leak into workflow logic
allowing UI state to become the source of truth
missing partial failure scenarios
retrying commands that should not be retried
not defining safe states
not modeling recovery states
ignoring long-running resource exhaustion
designing only for lab conditions

What Strong Engineers Understand

Strong industrial engineers understand:

Failures are not isolated technical events. They are system behavior events.

A camera timeout is not just a camera problem.

It may affect:

acquisition
processing
workflow
motion synchronization
result correctness
operator trust
production throughput
recovery safety

Strong engineers design boundaries so one failure does not silently corrupt the whole system.

Final Mental Model

Think of industrial reliability like this:

text

Failure Origin
     ↓
Failure Mode
     ↓
Detection Mechanism
     ↓
Containment Boundary
     ↓
State Transition
     ↓
Safe Action
     ↓
Recovery Path
     ↓
Diagnostic Evidence

A production-grade machine system is reliable not because nothing fails.

It is reliable because when things fail, the system:

detects the problem early enough
avoids unsafe action
protects machine/material/operator
keeps state understandable
prevents cascading failure
supports recovery
leaves enough evidence to diagnose root cause

That is the core mindset behind Failure Modes & System Reliability Model.

Streaming Pipelines Dotnet Real World

Failure Modes & System Reliability Model ​

PART 1 — BIG PICTURE: WHY FAILURE MODELING COMES FIRST ​

PART 2 — FAILURE CATEGORIES: LAYERED MODEL ​

1. Physical / Mechanical Failures ​

2. Electrical / IO Failures ​

3. Device / Hardware Failures ​

4. Communication Failures ​

5. Timing / Synchronization Failures ​

6. Data / State Inconsistency ​

7. Software Logic Errors ​

8. Resource Exhaustion ​

9. Human / Operator Errors ​

PART 3 — FAILURE MODES: HOW THINGS FAIL ​

1. Fail-Stop ​

2. Fail-Slow ​

3. Fail-Incorrect ​

4. Intermittent Failure ​

5. Partial System Failure ​

6. Cascading Failure ​

PART 4 — FAILURE PROPAGATION ​

PART 5 — FAILURE DETECTION VS FAILURE ASSUMPTION ​

PART 6 — RELIABILITY MODELING ​

1. Availability ​

2. Correctness ​

3. Recoverability ​

4. Safety ​

PART 7 — REAL-WORLD FAILURE SCENARIOS ​

Scenario 1: Intermittent Camera Disconnect Under Load ​

Scenario 2: Works in Lab, Fails in Factory Noise ​

Scenario 3: Memory Leak Causes Failure After Three Days ​

Scenario 4: Race Condition Causes Rare Incorrect Motion ​

Scenario 5: Wrong State Allows Unsafe Command Acceptance ​

PART 8 — SOFTWARE DESIGN IMPLICATIONS ​

1. Failure Boundaries ​

2. Subsystem Isolation ​

3. Timeout Strategies ​

4. State Validation ​

5. Defensive Design ​

6. Observability Hooks ​

Bad vs Good Reliability Design ​

PART 9 — INTERVIEW / REAL-WORLD TALKING POINTS ​

How to Explain Failure Modeling Clearly ​

Why Failure Modes Matter ​

Common Mistakes Engineers Make ​

What Strong Engineers Understand ​

Final Mental Model ​

Failure Modes & System Reliability Model

PART 1 — BIG PICTURE: WHY FAILURE MODELING COMES FIRST

PART 2 — FAILURE CATEGORIES: LAYERED MODEL

1. Physical / Mechanical Failures

2. Electrical / IO Failures

3. Device / Hardware Failures

4. Communication Failures

5. Timing / Synchronization Failures

6. Data / State Inconsistency

7. Software Logic Errors

8. Resource Exhaustion

9. Human / Operator Errors

PART 3 — FAILURE MODES: HOW THINGS FAIL

1. Fail-Stop

2. Fail-Slow

3. Fail-Incorrect

4. Intermittent Failure

5. Partial System Failure

6. Cascading Failure

PART 4 — FAILURE PROPAGATION

PART 5 — FAILURE DETECTION VS FAILURE ASSUMPTION

PART 6 — RELIABILITY MODELING

1. Availability

2. Correctness

3. Recoverability

4. Safety

PART 7 — REAL-WORLD FAILURE SCENARIOS

Scenario 1: Intermittent Camera Disconnect Under Load

Scenario 2: Works in Lab, Fails in Factory Noise

Scenario 3: Memory Leak Causes Failure After Three Days

Scenario 4: Race Condition Causes Rare Incorrect Motion

Scenario 5: Wrong State Allows Unsafe Command Acceptance

PART 8 — SOFTWARE DESIGN IMPLICATIONS

1. Failure Boundaries

2. Subsystem Isolation

3. Timeout Strategies

4. State Validation

5. Defensive Design

6. Observability Hooks

Bad vs Good Reliability Design

PART 9 — INTERVIEW / REAL-WORLD TALKING POINTS

How to Explain Failure Modeling Clearly

Why Failure Modes Matter

Common Mistakes Engineers Make

What Strong Engineers Understand

Final Mental Model