Below is a structured deep dive on Interlocks & Fault Handling, aligned to Domain 1’s source of truth, where this topic is explicitly defined as “interlocks and permissives,” “motion-level errors,” and “alarm handling and recovery.” This sits inside the broader machine-control domain, where software must enforce safe and deterministic execution around real physical motion and machine behavior.

PART 1 — WHY INTERLOCKS & FAULT HANDLING ARE CENTRAL

In business software, a bad command usually means a failed request, a rollback, or a user-visible error. In machine software, a bad command can mean a crashed stage, a broken end effector, a lost wafer, a damaged camera, or an operator put into an unsafe situation. That is why experienced machine engineers do not start from the assumption that commands are valid. They start from the assumption that unsafe conditions, invalid states, timing problems, and hardware failures will eventually happen.

That mindset is built into this domain. Machine control software must not only issue commands, but also validate whether a command is currently safe, monitor execution while it is happening, detect abnormal conditions, and push the machine into a controlled state when something goes wrong. This is consistent with the domain’s design principles: motion must be validated before execution, systems must be state-driven, and failures must be handled explicitly.

A few concrete examples make this obvious:

Guard open → stage move request The correct behavior is not “try and see what happens.” The command must be blocked before motion begins.
Scan start requested, but stage not homed The machine may have no trustworthy coordinate reference yet. Starting anyway is not a normal functional bug; it is a physical integrity problem.
Axis following error during motion Motion is already in progress. The system must detect the abnormality, stop safely, latch the fault, and prevent blind continuation.
Camera not ready when capture point is reached This may not be a motion safety problem, but it is still a machine integrity problem. If the process depends on synchronized capture, proceeding without readiness can corrupt results, waste time, or break process assumptions.

The key idea is this:

Safe machine behavior has two layers

Prevent bad actions before they start
Contain and recover from failures if prevention was not enough

That is the core of interlocks and fault handling.

PART 2 — WHAT INTERLOCKS & PERMISSIVES MEAN

These terms are often used loosely in real teams, and that causes confusion. Good teams define them precisely.

Interlock

An interlock is a condition that blocks an action because allowing that action would be unsafe, invalid, or operationally forbidden under the current conditions.

Examples:

Door open blocks motion
Axis not homed blocks absolute move
Chuck vacuum lost blocks wafer transfer
Maintenance access active blocks auto cycle

An interlock is usually thought of as a hard “do not proceed” rule.

Permissive

A permissive is a condition that must be true before an action is allowed to start.

Examples:

Vacuum present before lift
Air pressure healthy before actuator command
Camera armed before scan start
Motion controller enabled before move command

A permissive is more like a precondition for execution.

In practice, the line between interlock and permissive can blur. Many teams use:

permissive for “must be ready”
interlock for “must not violate safety/operating restrictions”

That distinction is useful.

Inhibit

An inhibit is a rule that intentionally suppresses or disables an action, often due to mode, configuration, service state, or supervisory control.

Examples:

Maintenance mode inhibits auto-start
Service session inhibits recipe execution
Diagnostic override inhibits material loading
Faulted subsystem inhibits dependent workflow transitions

An inhibit is often less about immediate physical danger and more about policy/control logic.

Motion-level fault

A motion-level fault is an abnormal condition detected within the motion subsystem or axis behavior itself.

Examples:

Following error
Motion timeout
Homing failure
Limit hit
Servo fault
Feedback mismatch

This is not just “command rejected.” It is usually a signal that motion behavior itself went outside acceptable bounds.

Machine-level alarm

A machine-level alarm is the structured reporting of an abnormal condition to the broader system and the operator. It may be triggered by motion faults, device failures, invalid process states, missing permissives, or repeated recoverable failures.

Examples:

Axis X following error
Guard open during auto operation
Camera trigger timeout
Vacuum loss at wafer chuck
Recipe validation failed before start

How they work together

They are related, but not the same:

Permissive: can I start?
Interlock: am I forbidden from doing this?
Inhibit: is this function intentionally disabled right now?
Motion fault: something went wrong in motion behavior
Alarm: structured machine-visible reporting and handling of the abnormal condition

A common failure in inexperienced teams is treating all of these as just booleans or popup messages. In real systems they are part of a larger control model: command gating, runtime monitoring, fault latching, and controlled recovery.

PART 3 — HOW SOFTWARE EVALUATES INTERLOCKS

A mature machine does not evaluate safety and readiness in only one place.

It evaluates interlocks at two different times:

Before accepting a command
During operation while the action is in progress

That distinction matters a lot.

Before accepting a command

Before a command is accepted, the software should validate:

machine mode
machine state
subsystem ownership
required permissives
active inhibits
blocking alarms/faults
safety-related external conditions
whether dependent subsystems are ready

Example: absolute stage move

The validator may require:

axis initialized
axis homed
no active servo fault
no guard-open interlock
machine not in E-stop state
requested target within soft limits
no maintenance inhibit on motion
no higher-priority motion owner currently holding control

If any of these fail, the command should be rejected explicitly.

During operation

Pre-checks are not enough because physical reality changes while the machine is running.

During motion or process execution, the system still has to monitor:

loss of vacuum
door opened mid-run
following error
unexpected stop
timeout
device ready lost
sensor disagreement
motion completion failure

A command that was safe at time T0 may become unsafe at time T1.

That is why good systems have runtime guard logic, not only start-time validation.

Why interlock logic should be explicit and centralized

Scattered checks are one of the most dangerous anti-patterns in machine software.

Bad pattern:

UI button disabled if door open
workflow step also checks door sometimes
motion service checks homed but not guard state
maintenance tool bypasses the workflow layer entirely

That creates holes.

A strong architecture has a central command validation/interlock layer that every command path must pass through. The UI may still disable buttons for usability, but the actual authority must live below the UI.

ASCII logic flow diagram

text

+----------------+
| Command Request|
| (Move / Start) |
+--------+-------+
         |
         v
+------------------------+
| Command Validator      |
| - mode checks          |
| - state checks         |
| - ownership checks     |
| - active fault checks  |
+-----------+------------+
            |
            v
+------------------------+
| Interlock/Permissive   |
| Evaluation             |
| - safety signals       |
| - homed?               |
| - limits valid?        |
| - device ready?        |
| - inhibit active?      |
+-----+-------------+----+
      |             |
   pass           fail
      |             |
      v             v
+-----------+   +------------------+
| Execute   |   | Reject Command   |
| Command   |   | + reason/alarm   |
+-----+-----+   +------------------+
      |
      v
+------------------------+
| Runtime Monitoring     |
| - timeout              |
| - following error      |
| - guard change         |
| - feedback mismatch    |
+-----+-------------+----+
      |             |
   ok               fault
      |             |
      v             v
+-----------+   +------------------+
| Complete  |   | Safe Stop /      |
| Normally  |   | Fault Handling   |
+-----------+   +------------------+

What this diagram means

This is the practical control shape used in real machines:

first decide whether a command is allowed
then execute
then continue checking during execution
if a fault appears, do not “just fail the task”; move into fault handling

That is the difference between machine control software and ordinary application workflows.

PART 4 — MOTION-LEVEL ERRORS

Motion faults deserve special treatment because the machine may already be moving when the fault is detected. That means the software is dealing with a live physical process, not a cleanly bounded method call.

1. Motion timeout

What it means physically

The axis did not reach the expected completion state within the allowed time. This may mean obstruction, poor tuning, controller issue, stalled motion, or wrong target assumptions.

What it looks like in software

move command issued
completion event never arrives, or in-position condition not reached in time
axis status remains busy too long
watchdog timer expires

What machine behavior should follow

stop or abort motion in a controlled way
latch a fault
block dependent operations
require investigation or recovery before continuing

Why it matters

A timeout is not just “slow software.” It often indicates physical non-completion.

2. Following error / position error

What it means physically

The commanded trajectory and actual motion diverged too far. The motor is not following the requested motion profile correctly.

Possible causes:

obstruction
overload
servo tuning problem
mechanical binding
slip or coupling issue
aggressive acceleration

What it looks like in software

controller reports following error
actual position deviates beyond threshold
servo alarm appears during move

What machine behavior should follow

immediate controlled stop or servo fault response
motion subsystem enters faulted state
machine must not assume final position is valid
downstream process steps must be blocked

Why it matters

This is one of the most important motion faults because the axis may have gone somewhere unsafe or unknown.

3. Unexpected stop

What it means physically

Motion stopped before normal completion, but not through the planned completion path.

Possible reasons:

drive disabled
safety chain event
operator stop
controller trip
external interlock change

What it looks like in software

motion done condition not matched with expected reason
axis status becomes stopped without in-position success
operation sequence loses its expected progression

What machine behavior should follow

determine stop cause
do not blindly resume unless semantics explicitly support it
mark position trustworthiness if needed
block dependent process continuation

4. Homing failure

What it means physically

The machine failed to establish a reliable reference point.

Possible causes:

sensor not found
switch failure
mechanical blockage
timeout
axis moved but reference edge not detected

What it looks like in software

home routine times out
home completed flag never set
home sensor sequence invalid
axis remains “unreferenced”

What machine behavior should follow

block absolute moves
alarm clearly that reference is not established
require retry/service intervention as appropriate

Why it matters

Without valid reference, the coordinate system may be meaningless.

5. Limit hit

What it means physically

The axis reached or crossed a hard or soft travel boundary.

What it looks like in software

hard-limit input active
soft-limit violation detected before or during move
move command rejected or interrupted

What machine behavior should follow

stop motion
enter controlled fault or interlock state depending on severity
require recovery path that respects mechanical constraints

Why it matters

Limit conditions usually indicate either bad command generation, lost position trust, or abnormal machine state.

6. Feedback mismatch

What it means physically

The software’s expectation of motion or position does not align with the feedback source.

Examples:

encoder says one thing, controller reports another
axis marked complete but in-position bit absent
commanded movement not reflected in measured displacement

What it looks like in software

inconsistent status sources
impossible state combinations
position delta too small or too large relative to command

What machine behavior should follow

treat axis state as untrustworthy
block further precision-dependent operations
often require re-home or service validation

Why motion faults are special

With many device faults, the machine can simply stop using the device and refuse the next step. Motion faults are different because the axis may already have moved partway, may be under load, may be in a dangerous location, or may have invalidated the machine’s spatial assumptions.

That is why experienced engineers ask:

Is the machine stopped?
Is it in a safe mechanical state?
Do we still trust position?
Can we back out automatically?
Must an operator inspect first?

Those are machine questions, not just software questions.

PART 5 — ALARM HANDLING & OPERATOR FEEDBACK

Alarm handling is not just showing a message box.

A real alarm system is how the machine:

classifies abnormal conditions
communicates them clearly
records them for diagnosis
determines what the machine is allowed to do next
guides recovery

What an alarm should communicate

A useful alarm should tell the operator and the system:

what happened
where it happened
how severe it is
what the machine did in response
what action is expected next

A good alarm usually includes:

fault/alarm code
concise operator-facing message
subsystem ownership
timestamp
machine state/context
optional technical detail for service logs
recovery expectation

Example of good structure:

text

ALM-MOT-X-0042
Axis X following error exceeded threshold during scan move.
Subsystem: Motion/XAxis
Time: 2026-04-22 08:14:17
Machine Action: Motion stopped, scan aborted.
Operator Action: Inspect stage path. Re-home axis before restart.

Warning vs recoverable alarm vs blocking fault

Warning

A condition worth surfacing, but not necessarily stopping current operation.

Example:

camera temperature slightly high
retry count increasing
air pressure fluctuating but still in range

Recoverable alarm

The machine cannot continue the current operation normally, but the issue may clear with a defined recovery action.

Example:

transient camera timeout
device initialization retry needed
temporary vacuum not reached on first attempt

Blocking fault

The machine must not continue until explicit recovery conditions are satisfied.

Example:

axis following error
guard open during auto move
lost home reference
hard limit triggered

Why alarm handling is more than a message

Because the alarm has control consequences.

An alarm system in a real machine often determines:

whether commands are blocked
whether current motion must stop
whether auto mode is exited
whether reset is allowed
whether a technician must inspect hardware
whether the fault should escalate after repeated occurrence

That is why vague alarms are so expensive. “Motion error” is weak. It does not help operations, service, or developers.

Operator guidance vs engineering diagnostics

Strong systems separate these concerns cleanly:

Operator-facing text should be clear and action-oriented
Engineering/service diagnostics should include deeper context: raw codes, axis status, signal states, controller details, retry history

Do not force operators to decode controller internals. Do not deprive service engineers of detail either.

PART 6 — RECOVERY & RESET

Recovery is one of the hardest parts of machine software because “fault cleared” and “safe to continue” are not the same thing.

What recovery means

Recovery means returning the machine from an abnormal condition to a controlled state where further action is safe and consistent.

That may involve:

stopping motion
re-establishing references
clearing transient device states
unloading material safely
requiring operator inspection
re-running initialization or homing
forcing workflow rollback or restart

When reset is allowed

Reset should only be allowed when:

the active fault condition is no longer present
the machine is in a stable state
any required recovery actions have been completed
continuing would not violate interlocks or trust assumptions

Bad systems let operators mash Reset until the red alarm disappears.

Good systems define explicit reset preconditions.

Why some faults can auto-recover and others cannot

Auto-recover candidates

transient communication glitch with safe retry semantics
non-critical device busy timeout with bounded retry count
temporary readiness loss before process commitment

Manual/operator/service recovery candidates

lost position reference
following error
physical obstruction
hard limit condition
safety guard opened during prohibited operation
repeated transient failure suggesting deeper instability

Reset does not mean safe continuation

This is one of the biggest beginner mistakes.

Example:

Axis following error occurs during scan
Operator clears alarm
Software allows Resume
But actual position is no longer trustworthy

That is a design failure.

The correct flow may be:

stop run
mark axis as reference-invalid or motion-invalid
require re-home
possibly require product disposition or workflow restart

Examples

Re-home after reference loss

If homing/reference trust is lost, reset alone is meaningless. The coordinate system must be re-established.

Retry after transient device timeout

If a camera arm command timed out before the scan even began, a bounded retry may be safe.

Operator intervention after obstruction

If a stage hit unexpected resistance or travel was blocked, the operator or technician may need to inspect the mechanism before reset is enabled.

ASCII state diagram

text

+-------+      Start       +---------+
| Ready | ---------------> | Running |
+---+---+                  +----+----+
    ^                           |
    |                           |
    | Reset allowed             | fault detected
    | after recovery            v
+---+---------------------+  +--------+
| Safe / Resettable State |<-| Faulted|
+-----------+-------------+  +----+---+
            ^                    |
            | recovery action    |
            | (stop, inspect,    |
            | re-home, retry,    |
            | clear cause)       |
            +--------------------+

What this diagram means

The machine should not jump directly from Faulted to Ready just because someone pressed Reset. There is usually an intermediate condition where the cause is cleared, recovery actions are done, and the system verifies that reset is now legitimate.

That intermediate state is often where mature systems differ from fragile ones.

PART 7 — REAL-WORLD FAILURE SCENARIOS

1. Interlock logic missing in one code path

What it looks like

Manual jog path checks door status, but a background recovery move path does not.

Why it happens

The team implemented safety gating in multiple layers informally instead of enforcing one common command gate.

Production consequence

Most of the time the machine behaves correctly, but a rare path violates safety expectations.

How experienced engineers handle it

They make all motion-producing paths go through the same validator/interlock authority, including service tools and recovery routines.

2. Same action checked in UI but not in service layer

What it looks like

Start button disabled when not homed, but an automation script or workflow service can still issue Start.

Why it happens

Developers treat UI enable/disable as business logic enforcement.

Production consequence

Unsafe or invalid commands slip through non-UI paths.

How experienced engineers handle it

They treat UI checks as convenience only. Actual enforcement belongs in the machine command layer.

3. Alarm cleared but subsystem still unhealthy

What it looks like

Servo fault alarm cleared on screen, but drive still not enabled or axis not re-referenced.

Why it happens

Alarm state and subsystem health are modeled separately but not tied properly.

Production consequence

Machine appears healthy, but next command fails or behaves inconsistently.

How experienced engineers handle it

They define reset semantics in terms of real subsystem state, not just alarm acknowledgment.

4. Operator retries without real recovery

What it looks like

Repeated reset-start-reset-start loops after a wafer chuck vacuum failure.

Why it happens

The system lets the workflow restart without proving that vacuum is stable.

Production consequence

Intermittent damage, lost product, hidden root cause, operator frustration.

How experienced engineers handle it

They add bounded retries, escalation rules, and recovery preconditions. Repeated faults are treated as stronger evidence of real instability.

5. Repeated transient faults hide deeper root cause

What it looks like

Camera timeouts recover on retry for hours, until throughput collapses or a critical sequence fails.

Why it happens

The software classifies each occurrence as transient without looking at frequency trends.

Production consequence

Slow degradation is ignored.

How experienced engineers handle it

They track fault history and escalation counts. Five “recoverable” alarms in 20 minutes may become a blocking maintenance condition.

6. Race condition allows move before inhibit becomes active

What it looks like

Mode switch to maintenance starts, but a queued auto move sneaks through before inhibit is fully latched.

Why it happens

State change propagation is asynchronous and command acceptance is not serialized against it.

Production consequence

The machine executes a command in the wrong operating mode.

How experienced engineers handle it

They define atomic command gating boundaries, serialize state transitions that affect permissions, and log exact timing around acceptance decisions.

This kind of problem is especially common in machine software because control is asynchronous and stateful, which the source-of-truth domain explicitly highlights as a core characteristic of motion systems.

PART 8 — SOFTWARE DESIGN IMPLICATIONS

Interlocks and faults are not side concerns. They require architecture.

The Domain 1 source emphasizes that these systems must be state-driven, must validate motion before execution, and must handle failures explicitly. That directly implies a design with explicit command gating and explicit fault state modeling.

Why this needs explicit architecture

Because the machine has:

long-running operations
multiple command sources
asynchronous state changes
dependency between subsystems
physical consequences when assumptions fail

Ad hoc checks cannot scale safely.

Good architecture traits

1. Centralized validation

Every significant command passes through a common validator.

2. Explicit fault model

Faults are structured, typed, latched appropriately, and connected to recovery rules.

3. Consistent command gating

The same action is judged the same way regardless of whether it came from UI, workflow engine, script, or service tool.

4. Traceable recovery rules

Each significant fault has known reset conditions and documented consequences.

Bad approach

boolean flags scattered across classes
exceptions used as the main interlock mechanism
UI button state treated as authority
vague alarm text like “operation failed”
reset just clears flags
no distinction between rejectable command and runtime fault

Good approach

centralized motion guard/interlock service
structured fault manager
machine/subsystem state model
explicit alarm classification
reset/recovery policies tied to actual health/state
strong logs and event history around fault transitions

ASCII component diagram

text

+------------------+       +----------------------+
| UI / HMI         | ----> | Command Application  |
| Workflow / Script|       | Layer                |
+---------+--------+       +----------+-----------+
          |                           |
          |                           v
          |                +----------------------+
          |                | Command Validator    |
          |                | + Interlock Engine   |
          |                +----------+-----------+
          |                           |
          |                     allow | reject
          |                           |
          |                           v
          |                +----------------------+
          |                | Subsystem Controllers|
          |                | Motion / Camera / IO |
          |                +----------+-----------+
          |                           |
          |                           v
          |                +----------------------+
          +--------------> | Fault / Alarm Manager|
                           | Recovery Rules       |
                           +----------------------+

What this diagram means

The UI and workflow should not directly “decide safety.” They request actions.

The command validator/interlock engine is the enforcement point.

Subsystem controllers execute commands and emit status/faults.

The fault/alarm manager classifies abnormalities, latches them, exposes them to operators, and enforces recovery behavior.

That separation is one of the biggest markers of a mature machine software codebase.

PART 9 — INTERVIEW / REAL-WORLD TALKING POINTS

How to explain interlocks and fault handling clearly

A strong explanation is:

Interlocks are the rules that prevent unsafe or invalid actions before they happen. Fault handling is what the machine does when abnormal conditions are detected during or after execution. Good machine software needs both: command gating up front, and controlled containment plus recovery when reality diverges from plan.

Why interlocks are different from alarms

A crisp way to explain it:

An interlock blocks an action. An alarm reports and manages an abnormal condition. Some interlocks raise alarms, and some alarms create blocking conditions, but they are not the same thing. Interlocks are about permission to act; alarms are about abnormal state and recovery.

Common mistakes software engineers make when entering machine software

putting critical checks in the UI only
treating reset as “clear the error and continue”
assuming command acceptance is enough without runtime monitoring
mixing warnings, recoverable issues, and blocking faults together
using vague exceptions instead of structured machine faults
scattering interlock checks across code paths
failing to distinguish “command rejected” from “action started but faulted mid-flight”

What strong engineers understand

Strong engineers understand that:

physical motion changes the meaning of software failure
state trust is part of recovery, especially for position and reference
all command sources must be gated consistently
repeated recoverable alarms often indicate a deeper system issue
recovery rules are as important as detection rules
safe behavior comes from architecture, not from a few careful if-statements

Interview-ready summary

If you want a concise principal-level answer:

In industrial machine software, interlocks and permissives are the control boundary that decides whether an action is allowed under the current machine, mode, and hardware conditions. Fault handling is the structured response when those conditions break down during execution. The hard part is not just detecting problems; it is preserving deterministic behavior, moving the machine into a safe state, preventing unsafe retries, and defining recovery rules that reflect actual subsystem health rather than just cleared alarms. That is why mature systems use centralized command gating, explicit fault models, and clear reset semantics.

This matches the source-of-truth emphasis that machine software must validate motion before execution, behave deterministically, and handle failures explicitly.

If you want, next I can turn this into the same kind of study prompt template you’ve been using for the next Domain 1 topic.

Streaming Pipelines Dotnet Real World

PART 1 — WHY INTERLOCKS & FAULT HANDLING ARE CENTRAL ​

PART 2 — WHAT INTERLOCKS & PERMISSIVES MEAN ​

Interlock ​

Permissive ​

Inhibit ​

Motion-level fault ​

Machine-level alarm ​

How they work together ​

PART 3 — HOW SOFTWARE EVALUATES INTERLOCKS ​

Before accepting a command ​

During operation ​

Why interlock logic should be explicit and centralized ​

ASCII logic flow diagram ​

What this diagram means ​

PART 4 — MOTION-LEVEL ERRORS ​

1. Motion timeout ​

What it means physically ​

What it looks like in software ​

What machine behavior should follow ​

Why it matters ​

2. Following error / position error ​

What it means physically ​

What it looks like in software ​

What machine behavior should follow ​

Why it matters ​

3. Unexpected stop ​

What it means physically ​

What it looks like in software ​

What machine behavior should follow ​

4. Homing failure ​

What it means physically ​

What it looks like in software ​

What machine behavior should follow ​

Why it matters ​

5. Limit hit ​

What it means physically ​

What it looks like in software ​

What machine behavior should follow ​

Why it matters ​

6. Feedback mismatch ​

What it means physically ​

What it looks like in software ​

What machine behavior should follow ​

Why motion faults are special ​

PART 5 — ALARM HANDLING & OPERATOR FEEDBACK ​

What an alarm should communicate ​

Warning vs recoverable alarm vs blocking fault ​

Warning ​

Recoverable alarm ​

Blocking fault ​

Why alarm handling is more than a message ​

Operator guidance vs engineering diagnostics ​

PART 6 — RECOVERY & RESET ​

What recovery means ​

When reset is allowed ​

Why some faults can auto-recover and others cannot ​

Auto-recover candidates ​

Manual/operator/service recovery candidates ​

Reset does not mean safe continuation ​

Examples ​

Re-home after reference loss ​

Retry after transient device timeout ​

Operator intervention after obstruction ​

ASCII state diagram ​

What this diagram means ​

PART 7 — REAL-WORLD FAILURE SCENARIOS ​

1. Interlock logic missing in one code path ​

What it looks like ​

Why it happens ​

Production consequence ​

How experienced engineers handle it ​

2. Same action checked in UI but not in service layer ​

What it looks like ​

Why it happens ​

Production consequence ​

How experienced engineers handle it ​

3. Alarm cleared but subsystem still unhealthy ​

What it looks like ​

Why it happens ​

PART 1 — WHY INTERLOCKS & FAULT HANDLING ARE CENTRAL

PART 2 — WHAT INTERLOCKS & PERMISSIVES MEAN

Interlock

Permissive

Inhibit

Motion-level fault

Machine-level alarm

How they work together

PART 3 — HOW SOFTWARE EVALUATES INTERLOCKS

Before accepting a command

During operation

Why interlock logic should be explicit and centralized

ASCII logic flow diagram

What this diagram means

PART 4 — MOTION-LEVEL ERRORS

1. Motion timeout

What it means physically

What it looks like in software

What machine behavior should follow

Why it matters

2. Following error / position error

What it means physically

What it looks like in software

What machine behavior should follow

Why it matters

3. Unexpected stop

What it means physically

What it looks like in software

What machine behavior should follow

4. Homing failure

What it means physically

What it looks like in software

What machine behavior should follow

Why it matters

5. Limit hit

What it means physically

What it looks like in software

What machine behavior should follow

Why it matters

6. Feedback mismatch

What it means physically

What it looks like in software

What machine behavior should follow

Why motion faults are special

PART 5 — ALARM HANDLING & OPERATOR FEEDBACK

What an alarm should communicate

Warning vs recoverable alarm vs blocking fault

Warning

Recoverable alarm

Blocking fault

Why alarm handling is more than a message

Operator guidance vs engineering diagnostics

PART 6 — RECOVERY & RESET

What recovery means

When reset is allowed

Why some faults can auto-recover and others cannot

Auto-recover candidates

Manual/operator/service recovery candidates

Reset does not mean safe continuation

Examples

Re-home after reference loss

Retry after transient device timeout

Operator intervention after obstruction

ASCII state diagram

What this diagram means

PART 7 — REAL-WORLD FAILURE SCENARIOS

1. Interlock logic missing in one code path

What it looks like

Why it happens

Production consequence

How experienced engineers handle it

2. Same action checked in UI but not in service layer

What it looks like

Why it happens

Production consequence

How experienced engineers handle it

3. Alarm cleared but subsystem still unhealthy

What it looks like

Why it happens

Production consequence