Below is a structured deep dive on Interlocks & Fault Handling, aligned to Domain 1’s source of truth, where this topic is explicitly defined as “interlocks and permissives,” “motion-level errors,” and “alarm handling and recovery.” This sits inside the broader machine-control domain, where software must enforce safe and deterministic execution around real physical motion and machine behavior.
PART 1 — WHY INTERLOCKS & FAULT HANDLING ARE CENTRAL
In business software, a bad command usually means a failed request, a rollback, or a user-visible error. In machine software, a bad command can mean a crashed stage, a broken end effector, a lost wafer, a damaged camera, or an operator put into an unsafe situation. That is why experienced machine engineers do not start from the assumption that commands are valid. They start from the assumption that unsafe conditions, invalid states, timing problems, and hardware failures will eventually happen.
That mindset is built into this domain. Machine control software must not only issue commands, but also validate whether a command is currently safe, monitor execution while it is happening, detect abnormal conditions, and push the machine into a controlled state when something goes wrong. This is consistent with the domain’s design principles: motion must be validated before execution, systems must be state-driven, and failures must be handled explicitly.
A few concrete examples make this obvious:
Guard open → stage move request The correct behavior is not “try and see what happens.” The command must be blocked before motion begins.
Scan start requested, but stage not homed The machine may have no trustworthy coordinate reference yet. Starting anyway is not a normal functional bug; it is a physical integrity problem.
Axis following error during motion Motion is already in progress. The system must detect the abnormality, stop safely, latch the fault, and prevent blind continuation.
Camera not ready when capture point is reached This may not be a motion safety problem, but it is still a machine integrity problem. If the process depends on synchronized capture, proceeding without readiness can corrupt results, waste time, or break process assumptions.
The key idea is this:
Safe machine behavior has two layers
- Prevent bad actions before they start
- Contain and recover from failures if prevention was not enough
That is the core of interlocks and fault handling.
PART 2 — WHAT INTERLOCKS & PERMISSIVES MEAN
These terms are often used loosely in real teams, and that causes confusion. Good teams define them precisely.
Interlock
An interlock is a condition that blocks an action because allowing that action would be unsafe, invalid, or operationally forbidden under the current conditions.
Examples:
- Door open blocks motion
- Axis not homed blocks absolute move
- Chuck vacuum lost blocks wafer transfer
- Maintenance access active blocks auto cycle
An interlock is usually thought of as a hard “do not proceed” rule.
Permissive
A permissive is a condition that must be true before an action is allowed to start.
Examples:
- Vacuum present before lift
- Air pressure healthy before actuator command
- Camera armed before scan start
- Motion controller enabled before move command
A permissive is more like a precondition for execution.
In practice, the line between interlock and permissive can blur. Many teams use:
- permissive for “must be ready”
- interlock for “must not violate safety/operating restrictions”
That distinction is useful.
Inhibit
An inhibit is a rule that intentionally suppresses or disables an action, often due to mode, configuration, service state, or supervisory control.
Examples:
- Maintenance mode inhibits auto-start
- Service session inhibits recipe execution
- Diagnostic override inhibits material loading
- Faulted subsystem inhibits dependent workflow transitions
An inhibit is often less about immediate physical danger and more about policy/control logic.
Motion-level fault
A motion-level fault is an abnormal condition detected within the motion subsystem or axis behavior itself.
Examples:
- Following error
- Motion timeout
- Homing failure
- Limit hit
- Servo fault
- Feedback mismatch
This is not just “command rejected.” It is usually a signal that motion behavior itself went outside acceptable bounds.
Machine-level alarm
A machine-level alarm is the structured reporting of an abnormal condition to the broader system and the operator. It may be triggered by motion faults, device failures, invalid process states, missing permissives, or repeated recoverable failures.
Examples:
- Axis X following error
- Guard open during auto operation
- Camera trigger timeout
- Vacuum loss at wafer chuck
- Recipe validation failed before start
How they work together
They are related, but not the same:
- Permissive: can I start?
- Interlock: am I forbidden from doing this?
- Inhibit: is this function intentionally disabled right now?
- Motion fault: something went wrong in motion behavior
- Alarm: structured machine-visible reporting and handling of the abnormal condition
A common failure in inexperienced teams is treating all of these as just booleans or popup messages. In real systems they are part of a larger control model: command gating, runtime monitoring, fault latching, and controlled recovery.
PART 3 — HOW SOFTWARE EVALUATES INTERLOCKS
A mature machine does not evaluate safety and readiness in only one place.
It evaluates interlocks at two different times:
- Before accepting a command
- During operation while the action is in progress
That distinction matters a lot.
Before accepting a command
Before a command is accepted, the software should validate:
- machine mode
- machine state
- subsystem ownership
- required permissives
- active inhibits
- blocking alarms/faults
- safety-related external conditions
- whether dependent subsystems are ready
Example: absolute stage move
The validator may require:
- axis initialized
- axis homed
- no active servo fault
- no guard-open interlock
- machine not in E-stop state
- requested target within soft limits
- no maintenance inhibit on motion
- no higher-priority motion owner currently holding control
If any of these fail, the command should be rejected explicitly.
During operation
Pre-checks are not enough because physical reality changes while the machine is running.
During motion or process execution, the system still has to monitor:
- loss of vacuum
- door opened mid-run
- following error
- unexpected stop
- timeout
- device ready lost
- sensor disagreement
- motion completion failure
A command that was safe at time T0 may become unsafe at time T1.
That is why good systems have runtime guard logic, not only start-time validation.
Why interlock logic should be explicit and centralized
Scattered checks are one of the most dangerous anti-patterns in machine software.
Bad pattern:
- UI button disabled if door open
- workflow step also checks door sometimes
- motion service checks homed but not guard state
- maintenance tool bypasses the workflow layer entirely
That creates holes.
A strong architecture has a central command validation/interlock layer that every command path must pass through. The UI may still disable buttons for usability, but the actual authority must live below the UI.
ASCII logic flow diagram
+----------------+
| Command Request|
| (Move / Start) |
+--------+-------+
|
v
+------------------------+
| Command Validator |
| - mode checks |
| - state checks |
| - ownership checks |
| - active fault checks |
+-----------+------------+
|
v
+------------------------+
| Interlock/Permissive |
| Evaluation |
| - safety signals |
| - homed? |
| - limits valid? |
| - device ready? |
| - inhibit active? |
+-----+-------------+----+
| |
pass fail
| |
v v
+-----------+ +------------------+
| Execute | | Reject Command |
| Command | | + reason/alarm |
+-----+-----+ +------------------+
|
v
+------------------------+
| Runtime Monitoring |
| - timeout |
| - following error |
| - guard change |
| - feedback mismatch |
+-----+-------------+----+
| |
ok fault
| |
v v
+-----------+ +------------------+
| Complete | | Safe Stop / |
| Normally | | Fault Handling |
+-----------+ +------------------+What this diagram means
This is the practical control shape used in real machines:
- first decide whether a command is allowed
- then execute
- then continue checking during execution
- if a fault appears, do not “just fail the task”; move into fault handling
That is the difference between machine control software and ordinary application workflows.
PART 4 — MOTION-LEVEL ERRORS
Motion faults deserve special treatment because the machine may already be moving when the fault is detected. That means the software is dealing with a live physical process, not a cleanly bounded method call.
1. Motion timeout
What it means physically
The axis did not reach the expected completion state within the allowed time. This may mean obstruction, poor tuning, controller issue, stalled motion, or wrong target assumptions.
What it looks like in software
- move command issued
- completion event never arrives, or in-position condition not reached in time
- axis status remains busy too long
- watchdog timer expires
What machine behavior should follow
- stop or abort motion in a controlled way
- latch a fault
- block dependent operations
- require investigation or recovery before continuing
Why it matters
A timeout is not just “slow software.” It often indicates physical non-completion.
2. Following error / position error
What it means physically
The commanded trajectory and actual motion diverged too far. The motor is not following the requested motion profile correctly.
Possible causes:
- obstruction
- overload
- servo tuning problem
- mechanical binding
- slip or coupling issue
- aggressive acceleration
What it looks like in software
- controller reports following error
- actual position deviates beyond threshold
- servo alarm appears during move
What machine behavior should follow
- immediate controlled stop or servo fault response
- motion subsystem enters faulted state
- machine must not assume final position is valid
- downstream process steps must be blocked
Why it matters
This is one of the most important motion faults because the axis may have gone somewhere unsafe or unknown.
3. Unexpected stop
What it means physically
Motion stopped before normal completion, but not through the planned completion path.
Possible reasons:
- drive disabled
- safety chain event
- operator stop
- controller trip
- external interlock change
What it looks like in software
- motion done condition not matched with expected reason
- axis status becomes stopped without in-position success
- operation sequence loses its expected progression
What machine behavior should follow
- determine stop cause
- do not blindly resume unless semantics explicitly support it
- mark position trustworthiness if needed
- block dependent process continuation
4. Homing failure
What it means physically
The machine failed to establish a reliable reference point.
Possible causes:
- sensor not found
- switch failure
- mechanical blockage
- timeout
- axis moved but reference edge not detected
What it looks like in software
- home routine times out
- home completed flag never set
- home sensor sequence invalid
- axis remains “unreferenced”
What machine behavior should follow
- block absolute moves
- alarm clearly that reference is not established
- require retry/service intervention as appropriate
Why it matters
Without valid reference, the coordinate system may be meaningless.
5. Limit hit
What it means physically
The axis reached or crossed a hard or soft travel boundary.
What it looks like in software
- hard-limit input active
- soft-limit violation detected before or during move
- move command rejected or interrupted
What machine behavior should follow
- stop motion
- enter controlled fault or interlock state depending on severity
- require recovery path that respects mechanical constraints
Why it matters
Limit conditions usually indicate either bad command generation, lost position trust, or abnormal machine state.
6. Feedback mismatch
What it means physically
The software’s expectation of motion or position does not align with the feedback source.
Examples:
- encoder says one thing, controller reports another
- axis marked complete but in-position bit absent
- commanded movement not reflected in measured displacement
What it looks like in software
- inconsistent status sources
- impossible state combinations
- position delta too small or too large relative to command
What machine behavior should follow
- treat axis state as untrustworthy
- block further precision-dependent operations
- often require re-home or service validation
Why motion faults are special
With many device faults, the machine can simply stop using the device and refuse the next step. Motion faults are different because the axis may already have moved partway, may be under load, may be in a dangerous location, or may have invalidated the machine’s spatial assumptions.
That is why experienced engineers ask:
- Is the machine stopped?
- Is it in a safe mechanical state?
- Do we still trust position?
- Can we back out automatically?
- Must an operator inspect first?
Those are machine questions, not just software questions.
PART 5 — ALARM HANDLING & OPERATOR FEEDBACK
Alarm handling is not just showing a message box.
A real alarm system is how the machine:
- classifies abnormal conditions
- communicates them clearly
- records them for diagnosis
- determines what the machine is allowed to do next
- guides recovery
What an alarm should communicate
A useful alarm should tell the operator and the system:
- what happened
- where it happened
- how severe it is
- what the machine did in response
- what action is expected next
A good alarm usually includes:
- fault/alarm code
- concise operator-facing message
- subsystem ownership
- timestamp
- machine state/context
- optional technical detail for service logs
- recovery expectation
Example of good structure:
ALM-MOT-X-0042
Axis X following error exceeded threshold during scan move.
Subsystem: Motion/XAxis
Time: 2026-04-22 08:14:17
Machine Action: Motion stopped, scan aborted.
Operator Action: Inspect stage path. Re-home axis before restart.Warning vs recoverable alarm vs blocking fault
Warning
A condition worth surfacing, but not necessarily stopping current operation.
Example:
- camera temperature slightly high
- retry count increasing
- air pressure fluctuating but still in range
Recoverable alarm
The machine cannot continue the current operation normally, but the issue may clear with a defined recovery action.
Example:
- transient camera timeout
- device initialization retry needed
- temporary vacuum not reached on first attempt
Blocking fault
The machine must not continue until explicit recovery conditions are satisfied.
Example:
- axis following error
- guard open during auto move
- lost home reference
- hard limit triggered
Why alarm handling is more than a message
Because the alarm has control consequences.
An alarm system in a real machine often determines:
- whether commands are blocked
- whether current motion must stop
- whether auto mode is exited
- whether reset is allowed
- whether a technician must inspect hardware
- whether the fault should escalate after repeated occurrence
That is why vague alarms are so expensive. “Motion error” is weak. It does not help operations, service, or developers.
Operator guidance vs engineering diagnostics
Strong systems separate these concerns cleanly:
- Operator-facing text should be clear and action-oriented
- Engineering/service diagnostics should include deeper context: raw codes, axis status, signal states, controller details, retry history
Do not force operators to decode controller internals. Do not deprive service engineers of detail either.
PART 6 — RECOVERY & RESET
Recovery is one of the hardest parts of machine software because “fault cleared” and “safe to continue” are not the same thing.
What recovery means
Recovery means returning the machine from an abnormal condition to a controlled state where further action is safe and consistent.
That may involve:
- stopping motion
- re-establishing references
- clearing transient device states
- unloading material safely
- requiring operator inspection
- re-running initialization or homing
- forcing workflow rollback or restart
When reset is allowed
Reset should only be allowed when:
- the active fault condition is no longer present
- the machine is in a stable state
- any required recovery actions have been completed
- continuing would not violate interlocks or trust assumptions
Bad systems let operators mash Reset until the red alarm disappears.
Good systems define explicit reset preconditions.
Why some faults can auto-recover and others cannot
Auto-recover candidates
- transient communication glitch with safe retry semantics
- non-critical device busy timeout with bounded retry count
- temporary readiness loss before process commitment
Manual/operator/service recovery candidates
- lost position reference
- following error
- physical obstruction
- hard limit condition
- safety guard opened during prohibited operation
- repeated transient failure suggesting deeper instability
Reset does not mean safe continuation
This is one of the biggest beginner mistakes.
Example:
- Axis following error occurs during scan
- Operator clears alarm
- Software allows Resume
- But actual position is no longer trustworthy
That is a design failure.
The correct flow may be:
- stop run
- mark axis as reference-invalid or motion-invalid
- require re-home
- possibly require product disposition or workflow restart
Examples
Re-home after reference loss
If homing/reference trust is lost, reset alone is meaningless. The coordinate system must be re-established.
Retry after transient device timeout
If a camera arm command timed out before the scan even began, a bounded retry may be safe.
Operator intervention after obstruction
If a stage hit unexpected resistance or travel was blocked, the operator or technician may need to inspect the mechanism before reset is enabled.
ASCII state diagram
+-------+ Start +---------+
| Ready | ---------------> | Running |
+---+---+ +----+----+
^ |
| |
| Reset allowed | fault detected
| after recovery v
+---+---------------------+ +--------+
| Safe / Resettable State |<-| Faulted|
+-----------+-------------+ +----+---+
^ |
| recovery action |
| (stop, inspect, |
| re-home, retry, |
| clear cause) |
+--------------------+What this diagram means
The machine should not jump directly from Faulted to Ready just because someone pressed Reset. There is usually an intermediate condition where the cause is cleared, recovery actions are done, and the system verifies that reset is now legitimate.
That intermediate state is often where mature systems differ from fragile ones.
PART 7 — REAL-WORLD FAILURE SCENARIOS
1. Interlock logic missing in one code path
What it looks like
Manual jog path checks door status, but a background recovery move path does not.
Why it happens
The team implemented safety gating in multiple layers informally instead of enforcing one common command gate.
Production consequence
Most of the time the machine behaves correctly, but a rare path violates safety expectations.
How experienced engineers handle it
They make all motion-producing paths go through the same validator/interlock authority, including service tools and recovery routines.
2. Same action checked in UI but not in service layer
What it looks like
Start button disabled when not homed, but an automation script or workflow service can still issue Start.
Why it happens
Developers treat UI enable/disable as business logic enforcement.
Production consequence
Unsafe or invalid commands slip through non-UI paths.
How experienced engineers handle it
They treat UI checks as convenience only. Actual enforcement belongs in the machine command layer.
3. Alarm cleared but subsystem still unhealthy
What it looks like
Servo fault alarm cleared on screen, but drive still not enabled or axis not re-referenced.
Why it happens
Alarm state and subsystem health are modeled separately but not tied properly.
Production consequence
Machine appears healthy, but next command fails or behaves inconsistently.
How experienced engineers handle it
They define reset semantics in terms of real subsystem state, not just alarm acknowledgment.
4. Operator retries without real recovery
What it looks like
Repeated reset-start-reset-start loops after a wafer chuck vacuum failure.
Why it happens
The system lets the workflow restart without proving that vacuum is stable.
Production consequence
Intermittent damage, lost product, hidden root cause, operator frustration.
How experienced engineers handle it
They add bounded retries, escalation rules, and recovery preconditions. Repeated faults are treated as stronger evidence of real instability.
5. Repeated transient faults hide deeper root cause
What it looks like
Camera timeouts recover on retry for hours, until throughput collapses or a critical sequence fails.
Why it happens
The software classifies each occurrence as transient without looking at frequency trends.
Production consequence
Slow degradation is ignored.
How experienced engineers handle it
They track fault history and escalation counts. Five “recoverable” alarms in 20 minutes may become a blocking maintenance condition.
6. Race condition allows move before inhibit becomes active
What it looks like
Mode switch to maintenance starts, but a queued auto move sneaks through before inhibit is fully latched.
Why it happens
State change propagation is asynchronous and command acceptance is not serialized against it.
Production consequence
The machine executes a command in the wrong operating mode.
How experienced engineers handle it
They define atomic command gating boundaries, serialize state transitions that affect permissions, and log exact timing around acceptance decisions.
This kind of problem is especially common in machine software because control is asynchronous and stateful, which the source-of-truth domain explicitly highlights as a core characteristic of motion systems.
PART 8 — SOFTWARE DESIGN IMPLICATIONS
Interlocks and faults are not side concerns. They require architecture.
The Domain 1 source emphasizes that these systems must be state-driven, must validate motion before execution, and must handle failures explicitly. That directly implies a design with explicit command gating and explicit fault state modeling.
Why this needs explicit architecture
Because the machine has:
- long-running operations
- multiple command sources
- asynchronous state changes
- dependency between subsystems
- physical consequences when assumptions fail
Ad hoc checks cannot scale safely.
Good architecture traits
1. Centralized validation
Every significant command passes through a common validator.
2. Explicit fault model
Faults are structured, typed, latched appropriately, and connected to recovery rules.
3. Consistent command gating
The same action is judged the same way regardless of whether it came from UI, workflow engine, script, or service tool.
4. Traceable recovery rules
Each significant fault has known reset conditions and documented consequences.
Bad approach
- boolean flags scattered across classes
- exceptions used as the main interlock mechanism
- UI button state treated as authority
- vague alarm text like “operation failed”
- reset just clears flags
- no distinction between rejectable command and runtime fault
Good approach
- centralized motion guard/interlock service
- structured fault manager
- machine/subsystem state model
- explicit alarm classification
- reset/recovery policies tied to actual health/state
- strong logs and event history around fault transitions
ASCII component diagram
+------------------+ +----------------------+
| UI / HMI | ----> | Command Application |
| Workflow / Script| | Layer |
+---------+--------+ +----------+-----------+
| |
| v
| +----------------------+
| | Command Validator |
| | + Interlock Engine |
| +----------+-----------+
| |
| allow | reject
| |
| v
| +----------------------+
| | Subsystem Controllers|
| | Motion / Camera / IO |
| +----------+-----------+
| |
| v
| +----------------------+
+--------------> | Fault / Alarm Manager|
| Recovery Rules |
+----------------------+What this diagram means
The UI and workflow should not directly “decide safety.” They request actions.
The command validator/interlock engine is the enforcement point.
Subsystem controllers execute commands and emit status/faults.
The fault/alarm manager classifies abnormalities, latches them, exposes them to operators, and enforces recovery behavior.
That separation is one of the biggest markers of a mature machine software codebase.
PART 9 — INTERVIEW / REAL-WORLD TALKING POINTS
How to explain interlocks and fault handling clearly
A strong explanation is:
Interlocks are the rules that prevent unsafe or invalid actions before they happen. Fault handling is what the machine does when abnormal conditions are detected during or after execution. Good machine software needs both: command gating up front, and controlled containment plus recovery when reality diverges from plan.
Why interlocks are different from alarms
A crisp way to explain it:
An interlock blocks an action. An alarm reports and manages an abnormal condition. Some interlocks raise alarms, and some alarms create blocking conditions, but they are not the same thing. Interlocks are about permission to act; alarms are about abnormal state and recovery.
Common mistakes software engineers make when entering machine software
- putting critical checks in the UI only
- treating reset as “clear the error and continue”
- assuming command acceptance is enough without runtime monitoring
- mixing warnings, recoverable issues, and blocking faults together
- using vague exceptions instead of structured machine faults
- scattering interlock checks across code paths
- failing to distinguish “command rejected” from “action started but faulted mid-flight”
What strong engineers understand
Strong engineers understand that:
- physical motion changes the meaning of software failure
- state trust is part of recovery, especially for position and reference
- all command sources must be gated consistently
- repeated recoverable alarms often indicate a deeper system issue
- recovery rules are as important as detection rules
- safe behavior comes from architecture, not from a few careful if-statements
Interview-ready summary
If you want a concise principal-level answer:
In industrial machine software, interlocks and permissives are the control boundary that decides whether an action is allowed under the current machine, mode, and hardware conditions. Fault handling is the structured response when those conditions break down during execution. The hard part is not just detecting problems; it is preserving deterministic behavior, moving the machine into a safe state, preventing unsafe retries, and defining recovery rules that reflect actual subsystem health rather than just cleared alarms. That is why mature systems use centralized command gating, explicit fault models, and clear reset semantics.
This matches the source-of-truth emphasis that machine software must validate motion before execution, behave deterministically, and handle failures explicitly.
If you want, next I can turn this into the same kind of study prompt template you’ve been using for the next Domain 1 topic.