Skip to content

Below is a structured deep dive on Interlocks & Fault Handling, aligned to Domain 1’s source of truth, where this topic is explicitly defined as “interlocks and permissives,” “motion-level errors,” and “alarm handling and recovery.” This sits inside the broader machine-control domain, where software must enforce safe and deterministic execution around real physical motion and machine behavior.

PART 1 — WHY INTERLOCKS & FAULT HANDLING ARE CENTRAL

In business software, a bad command usually means a failed request, a rollback, or a user-visible error. In machine software, a bad command can mean a crashed stage, a broken end effector, a lost wafer, a damaged camera, or an operator put into an unsafe situation. That is why experienced machine engineers do not start from the assumption that commands are valid. They start from the assumption that unsafe conditions, invalid states, timing problems, and hardware failures will eventually happen.

That mindset is built into this domain. Machine control software must not only issue commands, but also validate whether a command is currently safe, monitor execution while it is happening, detect abnormal conditions, and push the machine into a controlled state when something goes wrong. This is consistent with the domain’s design principles: motion must be validated before execution, systems must be state-driven, and failures must be handled explicitly.

A few concrete examples make this obvious:

  • Guard open → stage move request The correct behavior is not “try and see what happens.” The command must be blocked before motion begins.

  • Scan start requested, but stage not homed The machine may have no trustworthy coordinate reference yet. Starting anyway is not a normal functional bug; it is a physical integrity problem.

  • Axis following error during motion Motion is already in progress. The system must detect the abnormality, stop safely, latch the fault, and prevent blind continuation.

  • Camera not ready when capture point is reached This may not be a motion safety problem, but it is still a machine integrity problem. If the process depends on synchronized capture, proceeding without readiness can corrupt results, waste time, or break process assumptions.

The key idea is this:

Safe machine behavior has two layers

  1. Prevent bad actions before they start
  2. Contain and recover from failures if prevention was not enough

That is the core of interlocks and fault handling.


PART 2 — WHAT INTERLOCKS & PERMISSIVES MEAN

These terms are often used loosely in real teams, and that causes confusion. Good teams define them precisely.

Interlock

An interlock is a condition that blocks an action because allowing that action would be unsafe, invalid, or operationally forbidden under the current conditions.

Examples:

  • Door open blocks motion
  • Axis not homed blocks absolute move
  • Chuck vacuum lost blocks wafer transfer
  • Maintenance access active blocks auto cycle

An interlock is usually thought of as a hard “do not proceed” rule.

Permissive

A permissive is a condition that must be true before an action is allowed to start.

Examples:

  • Vacuum present before lift
  • Air pressure healthy before actuator command
  • Camera armed before scan start
  • Motion controller enabled before move command

A permissive is more like a precondition for execution.

In practice, the line between interlock and permissive can blur. Many teams use:

  • permissive for “must be ready”
  • interlock for “must not violate safety/operating restrictions”

That distinction is useful.

Inhibit

An inhibit is a rule that intentionally suppresses or disables an action, often due to mode, configuration, service state, or supervisory control.

Examples:

  • Maintenance mode inhibits auto-start
  • Service session inhibits recipe execution
  • Diagnostic override inhibits material loading
  • Faulted subsystem inhibits dependent workflow transitions

An inhibit is often less about immediate physical danger and more about policy/control logic.

Motion-level fault

A motion-level fault is an abnormal condition detected within the motion subsystem or axis behavior itself.

Examples:

  • Following error
  • Motion timeout
  • Homing failure
  • Limit hit
  • Servo fault
  • Feedback mismatch

This is not just “command rejected.” It is usually a signal that motion behavior itself went outside acceptable bounds.

Machine-level alarm

A machine-level alarm is the structured reporting of an abnormal condition to the broader system and the operator. It may be triggered by motion faults, device failures, invalid process states, missing permissives, or repeated recoverable failures.

Examples:

  • Axis X following error
  • Guard open during auto operation
  • Camera trigger timeout
  • Vacuum loss at wafer chuck
  • Recipe validation failed before start

How they work together

They are related, but not the same:

  • Permissive: can I start?
  • Interlock: am I forbidden from doing this?
  • Inhibit: is this function intentionally disabled right now?
  • Motion fault: something went wrong in motion behavior
  • Alarm: structured machine-visible reporting and handling of the abnormal condition

A common failure in inexperienced teams is treating all of these as just booleans or popup messages. In real systems they are part of a larger control model: command gating, runtime monitoring, fault latching, and controlled recovery.


PART 3 — HOW SOFTWARE EVALUATES INTERLOCKS

A mature machine does not evaluate safety and readiness in only one place.

It evaluates interlocks at two different times:

  1. Before accepting a command
  2. During operation while the action is in progress

That distinction matters a lot.

Before accepting a command

Before a command is accepted, the software should validate:

  • machine mode
  • machine state
  • subsystem ownership
  • required permissives
  • active inhibits
  • blocking alarms/faults
  • safety-related external conditions
  • whether dependent subsystems are ready

Example: absolute stage move

The validator may require:

  • axis initialized
  • axis homed
  • no active servo fault
  • no guard-open interlock
  • machine not in E-stop state
  • requested target within soft limits
  • no maintenance inhibit on motion
  • no higher-priority motion owner currently holding control

If any of these fail, the command should be rejected explicitly.

During operation

Pre-checks are not enough because physical reality changes while the machine is running.

During motion or process execution, the system still has to monitor:

  • loss of vacuum
  • door opened mid-run
  • following error
  • unexpected stop
  • timeout
  • device ready lost
  • sensor disagreement
  • motion completion failure

A command that was safe at time T0 may become unsafe at time T1.

That is why good systems have runtime guard logic, not only start-time validation.

Why interlock logic should be explicit and centralized

Scattered checks are one of the most dangerous anti-patterns in machine software.

Bad pattern:

  • UI button disabled if door open
  • workflow step also checks door sometimes
  • motion service checks homed but not guard state
  • maintenance tool bypasses the workflow layer entirely

That creates holes.

A strong architecture has a central command validation/interlock layer that every command path must pass through. The UI may still disable buttons for usability, but the actual authority must live below the UI.

ASCII logic flow diagram

text
+----------------+
| Command Request|
| (Move / Start) |
+--------+-------+
         |
         v
+------------------------+
| Command Validator      |
| - mode checks          |
| - state checks         |
| - ownership checks     |
| - active fault checks  |
+-----------+------------+
            |
            v
+------------------------+
| Interlock/Permissive   |
| Evaluation             |
| - safety signals       |
| - homed?               |
| - limits valid?        |
| - device ready?        |
| - inhibit active?      |
+-----+-------------+----+
      |             |
   pass           fail
      |             |
      v             v
+-----------+   +------------------+
| Execute   |   | Reject Command   |
| Command   |   | + reason/alarm   |
+-----+-----+   +------------------+
      |
      v
+------------------------+
| Runtime Monitoring     |
| - timeout              |
| - following error      |
| - guard change         |
| - feedback mismatch    |
+-----+-------------+----+
      |             |
   ok               fault
      |             |
      v             v
+-----------+   +------------------+
| Complete  |   | Safe Stop /      |
| Normally  |   | Fault Handling   |
+-----------+   +------------------+

What this diagram means

This is the practical control shape used in real machines:

  • first decide whether a command is allowed
  • then execute
  • then continue checking during execution
  • if a fault appears, do not “just fail the task”; move into fault handling

That is the difference between machine control software and ordinary application workflows.


PART 4 — MOTION-LEVEL ERRORS

Motion faults deserve special treatment because the machine may already be moving when the fault is detected. That means the software is dealing with a live physical process, not a cleanly bounded method call.

1. Motion timeout

What it means physically

The axis did not reach the expected completion state within the allowed time. This may mean obstruction, poor tuning, controller issue, stalled motion, or wrong target assumptions.

What it looks like in software

  • move command issued
  • completion event never arrives, or in-position condition not reached in time
  • axis status remains busy too long
  • watchdog timer expires

What machine behavior should follow

  • stop or abort motion in a controlled way
  • latch a fault
  • block dependent operations
  • require investigation or recovery before continuing

Why it matters

A timeout is not just “slow software.” It often indicates physical non-completion.


2. Following error / position error

What it means physically

The commanded trajectory and actual motion diverged too far. The motor is not following the requested motion profile correctly.

Possible causes:

  • obstruction
  • overload
  • servo tuning problem
  • mechanical binding
  • slip or coupling issue
  • aggressive acceleration

What it looks like in software

  • controller reports following error
  • actual position deviates beyond threshold
  • servo alarm appears during move

What machine behavior should follow

  • immediate controlled stop or servo fault response
  • motion subsystem enters faulted state
  • machine must not assume final position is valid
  • downstream process steps must be blocked

Why it matters

This is one of the most important motion faults because the axis may have gone somewhere unsafe or unknown.


3. Unexpected stop

What it means physically

Motion stopped before normal completion, but not through the planned completion path.

Possible reasons:

  • drive disabled
  • safety chain event
  • operator stop
  • controller trip
  • external interlock change

What it looks like in software

  • motion done condition not matched with expected reason
  • axis status becomes stopped without in-position success
  • operation sequence loses its expected progression

What machine behavior should follow

  • determine stop cause
  • do not blindly resume unless semantics explicitly support it
  • mark position trustworthiness if needed
  • block dependent process continuation

4. Homing failure

What it means physically

The machine failed to establish a reliable reference point.

Possible causes:

  • sensor not found
  • switch failure
  • mechanical blockage
  • timeout
  • axis moved but reference edge not detected

What it looks like in software

  • home routine times out
  • home completed flag never set
  • home sensor sequence invalid
  • axis remains “unreferenced”

What machine behavior should follow

  • block absolute moves
  • alarm clearly that reference is not established
  • require retry/service intervention as appropriate

Why it matters

Without valid reference, the coordinate system may be meaningless.


5. Limit hit

What it means physically

The axis reached or crossed a hard or soft travel boundary.

What it looks like in software

  • hard-limit input active
  • soft-limit violation detected before or during move
  • move command rejected or interrupted

What machine behavior should follow

  • stop motion
  • enter controlled fault or interlock state depending on severity
  • require recovery path that respects mechanical constraints

Why it matters

Limit conditions usually indicate either bad command generation, lost position trust, or abnormal machine state.


6. Feedback mismatch

What it means physically

The software’s expectation of motion or position does not align with the feedback source.

Examples:

  • encoder says one thing, controller reports another
  • axis marked complete but in-position bit absent
  • commanded movement not reflected in measured displacement

What it looks like in software

  • inconsistent status sources
  • impossible state combinations
  • position delta too small or too large relative to command

What machine behavior should follow

  • treat axis state as untrustworthy
  • block further precision-dependent operations
  • often require re-home or service validation

Why motion faults are special

With many device faults, the machine can simply stop using the device and refuse the next step. Motion faults are different because the axis may already have moved partway, may be under load, may be in a dangerous location, or may have invalidated the machine’s spatial assumptions.

That is why experienced engineers ask:

  • Is the machine stopped?
  • Is it in a safe mechanical state?
  • Do we still trust position?
  • Can we back out automatically?
  • Must an operator inspect first?

Those are machine questions, not just software questions.


PART 5 — ALARM HANDLING & OPERATOR FEEDBACK

Alarm handling is not just showing a message box.

A real alarm system is how the machine:

  • classifies abnormal conditions
  • communicates them clearly
  • records them for diagnosis
  • determines what the machine is allowed to do next
  • guides recovery

What an alarm should communicate

A useful alarm should tell the operator and the system:

  • what happened
  • where it happened
  • how severe it is
  • what the machine did in response
  • what action is expected next

A good alarm usually includes:

  • fault/alarm code
  • concise operator-facing message
  • subsystem ownership
  • timestamp
  • machine state/context
  • optional technical detail for service logs
  • recovery expectation

Example of good structure:

text
ALM-MOT-X-0042
Axis X following error exceeded threshold during scan move.
Subsystem: Motion/XAxis
Time: 2026-04-22 08:14:17
Machine Action: Motion stopped, scan aborted.
Operator Action: Inspect stage path. Re-home axis before restart.

Warning vs recoverable alarm vs blocking fault

Warning

A condition worth surfacing, but not necessarily stopping current operation.

Example:

  • camera temperature slightly high
  • retry count increasing
  • air pressure fluctuating but still in range

Recoverable alarm

The machine cannot continue the current operation normally, but the issue may clear with a defined recovery action.

Example:

  • transient camera timeout
  • device initialization retry needed
  • temporary vacuum not reached on first attempt

Blocking fault

The machine must not continue until explicit recovery conditions are satisfied.

Example:

  • axis following error
  • guard open during auto move
  • lost home reference
  • hard limit triggered

Why alarm handling is more than a message

Because the alarm has control consequences.

An alarm system in a real machine often determines:

  • whether commands are blocked
  • whether current motion must stop
  • whether auto mode is exited
  • whether reset is allowed
  • whether a technician must inspect hardware
  • whether the fault should escalate after repeated occurrence

That is why vague alarms are so expensive. “Motion error” is weak. It does not help operations, service, or developers.

Operator guidance vs engineering diagnostics

Strong systems separate these concerns cleanly:

  • Operator-facing text should be clear and action-oriented
  • Engineering/service diagnostics should include deeper context: raw codes, axis status, signal states, controller details, retry history

Do not force operators to decode controller internals. Do not deprive service engineers of detail either.


PART 6 — RECOVERY & RESET

Recovery is one of the hardest parts of machine software because “fault cleared” and “safe to continue” are not the same thing.

What recovery means

Recovery means returning the machine from an abnormal condition to a controlled state where further action is safe and consistent.

That may involve:

  • stopping motion
  • re-establishing references
  • clearing transient device states
  • unloading material safely
  • requiring operator inspection
  • re-running initialization or homing
  • forcing workflow rollback or restart

When reset is allowed

Reset should only be allowed when:

  • the active fault condition is no longer present
  • the machine is in a stable state
  • any required recovery actions have been completed
  • continuing would not violate interlocks or trust assumptions

Bad systems let operators mash Reset until the red alarm disappears.

Good systems define explicit reset preconditions.

Why some faults can auto-recover and others cannot

Auto-recover candidates

  • transient communication glitch with safe retry semantics
  • non-critical device busy timeout with bounded retry count
  • temporary readiness loss before process commitment

Manual/operator/service recovery candidates

  • lost position reference
  • following error
  • physical obstruction
  • hard limit condition
  • safety guard opened during prohibited operation
  • repeated transient failure suggesting deeper instability

Reset does not mean safe continuation

This is one of the biggest beginner mistakes.

Example:

  • Axis following error occurs during scan
  • Operator clears alarm
  • Software allows Resume
  • But actual position is no longer trustworthy

That is a design failure.

The correct flow may be:

  • stop run
  • mark axis as reference-invalid or motion-invalid
  • require re-home
  • possibly require product disposition or workflow restart

Examples

Re-home after reference loss

If homing/reference trust is lost, reset alone is meaningless. The coordinate system must be re-established.

Retry after transient device timeout

If a camera arm command timed out before the scan even began, a bounded retry may be safe.

Operator intervention after obstruction

If a stage hit unexpected resistance or travel was blocked, the operator or technician may need to inspect the mechanism before reset is enabled.

ASCII state diagram

text
+-------+      Start       +---------+
| Ready | ---------------> | Running |
+---+---+                  +----+----+
    ^                           |
    |                           |
    | Reset allowed             | fault detected
    | after recovery            v
+---+---------------------+  +--------+
| Safe / Resettable State |<-| Faulted|
+-----------+-------------+  +----+---+
            ^                    |
            | recovery action    |
            | (stop, inspect,    |
            | re-home, retry,    |
            | clear cause)       |
            +--------------------+

What this diagram means

The machine should not jump directly from Faulted to Ready just because someone pressed Reset. There is usually an intermediate condition where the cause is cleared, recovery actions are done, and the system verifies that reset is now legitimate.

That intermediate state is often where mature systems differ from fragile ones.


PART 7 — REAL-WORLD FAILURE SCENARIOS

1. Interlock logic missing in one code path

What it looks like

Manual jog path checks door status, but a background recovery move path does not.

Why it happens

The team implemented safety gating in multiple layers informally instead of enforcing one common command gate.

Production consequence

Most of the time the machine behaves correctly, but a rare path violates safety expectations.

How experienced engineers handle it

They make all motion-producing paths go through the same validator/interlock authority, including service tools and recovery routines.


2. Same action checked in UI but not in service layer

What it looks like

Start button disabled when not homed, but an automation script or workflow service can still issue Start.

Why it happens

Developers treat UI enable/disable as business logic enforcement.

Production consequence

Unsafe or invalid commands slip through non-UI paths.

How experienced engineers handle it

They treat UI checks as convenience only. Actual enforcement belongs in the machine command layer.


3. Alarm cleared but subsystem still unhealthy

What it looks like

Servo fault alarm cleared on screen, but drive still not enabled or axis not re-referenced.

Why it happens

Alarm state and subsystem health are modeled separately but not tied properly.

Production consequence

Machine appears healthy, but next command fails or behaves inconsistently.

How experienced engineers handle it

They define reset semantics in terms of real subsystem state, not just alarm acknowledgment.


4. Operator retries without real recovery

What it looks like

Repeated reset-start-reset-start loops after a wafer chuck vacuum failure.

Why it happens

The system lets the workflow restart without proving that vacuum is stable.

Production consequence

Intermittent damage, lost product, hidden root cause, operator frustration.

How experienced engineers handle it

They add bounded retries, escalation rules, and recovery preconditions. Repeated faults are treated as stronger evidence of real instability.


5. Repeated transient faults hide deeper root cause

What it looks like

Camera timeouts recover on retry for hours, until throughput collapses or a critical sequence fails.

Why it happens

The software classifies each occurrence as transient without looking at frequency trends.

Production consequence

Slow degradation is ignored.

How experienced engineers handle it

They track fault history and escalation counts. Five “recoverable” alarms in 20 minutes may become a blocking maintenance condition.


6. Race condition allows move before inhibit becomes active

What it looks like

Mode switch to maintenance starts, but a queued auto move sneaks through before inhibit is fully latched.

Why it happens

State change propagation is asynchronous and command acceptance is not serialized against it.

Production consequence

The machine executes a command in the wrong operating mode.

How experienced engineers handle it

They define atomic command gating boundaries, serialize state transitions that affect permissions, and log exact timing around acceptance decisions.

This kind of problem is especially common in machine software because control is asynchronous and stateful, which the source-of-truth domain explicitly highlights as a core characteristic of motion systems.


PART 8 — SOFTWARE DESIGN IMPLICATIONS

Interlocks and faults are not side concerns. They require architecture.

The Domain 1 source emphasizes that these systems must be state-driven, must validate motion before execution, and must handle failures explicitly. That directly implies a design with explicit command gating and explicit fault state modeling.

Why this needs explicit architecture

Because the machine has:

  • long-running operations
  • multiple command sources
  • asynchronous state changes
  • dependency between subsystems
  • physical consequences when assumptions fail

Ad hoc checks cannot scale safely.

Good architecture traits

1. Centralized validation

Every significant command passes through a common validator.

2. Explicit fault model

Faults are structured, typed, latched appropriately, and connected to recovery rules.

3. Consistent command gating

The same action is judged the same way regardless of whether it came from UI, workflow engine, script, or service tool.

4. Traceable recovery rules

Each significant fault has known reset conditions and documented consequences.

Bad approach

  • boolean flags scattered across classes
  • exceptions used as the main interlock mechanism
  • UI button state treated as authority
  • vague alarm text like “operation failed”
  • reset just clears flags
  • no distinction between rejectable command and runtime fault

Good approach

  • centralized motion guard/interlock service
  • structured fault manager
  • machine/subsystem state model
  • explicit alarm classification
  • reset/recovery policies tied to actual health/state
  • strong logs and event history around fault transitions

ASCII component diagram

text
+------------------+       +----------------------+
| UI / HMI         | ----> | Command Application  |
| Workflow / Script|       | Layer                |
+---------+--------+       +----------+-----------+
          |                           |
          |                           v
          |                +----------------------+
          |                | Command Validator    |
          |                | + Interlock Engine   |
          |                +----------+-----------+
          |                           |
          |                     allow | reject
          |                           |
          |                           v
          |                +----------------------+
          |                | Subsystem Controllers|
          |                | Motion / Camera / IO |
          |                +----------+-----------+
          |                           |
          |                           v
          |                +----------------------+
          +--------------> | Fault / Alarm Manager|
                           | Recovery Rules       |
                           +----------------------+

What this diagram means

The UI and workflow should not directly “decide safety.” They request actions.

The command validator/interlock engine is the enforcement point.

Subsystem controllers execute commands and emit status/faults.

The fault/alarm manager classifies abnormalities, latches them, exposes them to operators, and enforces recovery behavior.

That separation is one of the biggest markers of a mature machine software codebase.


PART 9 — INTERVIEW / REAL-WORLD TALKING POINTS

How to explain interlocks and fault handling clearly

A strong explanation is:

Interlocks are the rules that prevent unsafe or invalid actions before they happen. Fault handling is what the machine does when abnormal conditions are detected during or after execution. Good machine software needs both: command gating up front, and controlled containment plus recovery when reality diverges from plan.

Why interlocks are different from alarms

A crisp way to explain it:

An interlock blocks an action. An alarm reports and manages an abnormal condition. Some interlocks raise alarms, and some alarms create blocking conditions, but they are not the same thing. Interlocks are about permission to act; alarms are about abnormal state and recovery.

Common mistakes software engineers make when entering machine software

  • putting critical checks in the UI only
  • treating reset as “clear the error and continue”
  • assuming command acceptance is enough without runtime monitoring
  • mixing warnings, recoverable issues, and blocking faults together
  • using vague exceptions instead of structured machine faults
  • scattering interlock checks across code paths
  • failing to distinguish “command rejected” from “action started but faulted mid-flight”

What strong engineers understand

Strong engineers understand that:

  • physical motion changes the meaning of software failure
  • state trust is part of recovery, especially for position and reference
  • all command sources must be gated consistently
  • repeated recoverable alarms often indicate a deeper system issue
  • recovery rules are as important as detection rules
  • safe behavior comes from architecture, not from a few careful if-statements

Interview-ready summary

If you want a concise principal-level answer:

In industrial machine software, interlocks and permissives are the control boundary that decides whether an action is allowed under the current machine, mode, and hardware conditions. Fault handling is the structured response when those conditions break down during execution. The hard part is not just detecting problems; it is preserving deterministic behavior, moving the machine into a safe state, preventing unsafe retries, and defining recovery rules that reflect actual subsystem health rather than just cleared alarms. That is why mature systems use centralized command gating, explicit fault models, and clear reset semantics.

This matches the source-of-truth emphasis that machine software must validate motion before execution, behave deterministically, and handle failures explicitly.

If you want, next I can turn this into the same kind of study prompt template you’ve been using for the next Domain 1 topic.

Docs-first project memory for AI-assisted implementation.