Skip to content

PART 1 — WHY MACHINE SOFTWARE NEEDS EXPLICIT STATES

Industrial machine software needs an explicit current state because the machine is never just “doing a method call.” It is always in some real operational condition that affects what is safe, valid, and possible next. In Domain 1, “State Machines for Machine Control” is explicitly called out as its own topic, focused on machine states vs workflow steps, state transitions, and hierarchical state design.

In normal business software, a request comes in, code runs, and the transaction ends. In machine software, the system is long-running, asynchronous, and coupled to physical reality. That means software has to answer questions like:

  • Is the machine idle and ready?
  • Is it starting but not yet operational?
  • Is it running normally?
  • Is it paused in a controlled way?
  • Is it faulted and unsafe to continue?
  • Is it recovering and therefore not ready for a new start?

That is why machine control is fundamentally stateful. The machine’s current state is not decoration. It is the primary context for deciding whether commands are allowed, whether subsystem actions should continue, and what recovery path is valid.

A weak team often starts with booleans:

  • IsRunning
  • IsPaused
  • HasFault
  • IsRecovering
  • IsStarting
  • StopRequested

At first this feels flexible. In practice it becomes dangerous.

Because then you get combinations like:

  • IsRunning = true
  • IsPaused = true
  • HasFault = true

Now what is the machine actually doing?

That is the core reason flags break down. They describe fragments of truth, not the operational truth the machine must obey.

Here is the real production problem: when state is ambiguous, behavior becomes ambiguous. And in a machine, ambiguous behavior is not just messy code. It can mean:

  • motion starts while recovery is incomplete
  • UI enables the wrong command
  • workflow resumes from the wrong point
  • subsystems drift out of sync
  • hardware is put into unsafe conditions

Example: wafer inspection start readiness

A wafer inspection system may appear “ready” from the operator’s perspective, but in reality safe scanning cannot start until all of these are true:

  • stage is homed
  • vacuum/chuck is stable
  • recipe is loaded
  • camera is initialized
  • illumination is valid
  • no active interlock blocks motion
  • previous recovery sequence is complete

If you do not model state explicitly, this logic leaks everywhere. One service checks some flags, the UI checks another set, the workflow checks a third set. Very quickly the system stops having a single truth.

A stronger design makes the machine state explicit and authoritative.

text
+--------+     Start      +----------+    Ready OK    +---------+
|  Idle  | -------------> | Starting | -------------> | Running |
+--------+                +----------+                +---------+
    ^                         |   ^                        |
    |                         |   | Pause                  | Fault
    |   Recovery Complete     |   +------------------+     v
    |                         v                      |  +---------+
    |                    +----------+ <-------------+  | Paused  |
    |                    | Faulted  |                  +---------+
    |                    +----------+                       |
    |                         | Recover                     | Resume
    +-------------------------+-----------------------------+
                              v
                         +------------+
                         | Recovering |
                         +------------+

What this diagram means

This is not a workflow step chart. It is the machine’s operational state model. It tells you what the machine is, not what a specific sequence step is doing.

The value of this is huge:

  • command validity becomes clear
  • UI can reflect real condition
  • logs become understandable
  • recovery paths become explicit
  • engineers can reason about behavior under interruption

Experienced engineers treat the state model as part of the machine’s control contract, not as a UI convenience.


PART 2 — MACHINE STATE VS WORKFLOW STEP

This distinction is one of the most important in industrial software.

Machine state

Machine state describes the overall operational condition of the machine or subsystem.

Examples:

  • Idle
  • Starting
  • Running
  • Paused
  • Faulted
  • Recovering

This answers: What condition is the machine in right now?

Workflow step

A workflow step describes the current action inside a process sequence.

Examples:

  • Load wafer
  • Move stage to scan start
  • Autofocus
  • Acquire image strip
  • Advance to next scan line
  • Unload wafer

This answers: What operation is the process currently executing?

These are not the same thing.

A machine can be in Running state while the workflow step is Move stage to scan start position.

Later, the machine is still in Running, but the workflow step is Acquire image strip.

Then the operator hits pause. The workflow step may still logically be “Acquire image strip,” but the machine state becomes Paused.

That distinction matters because workflow describes process progression, while state describes operational condition and command validity.

Why teams confuse them

Teams new to machine software often use workflow steps as if they were system states:

  • “The machine is in ScanStartMove”
  • “The machine is in AutoFocus”
  • “The machine is in Unload”

That looks fine at first, but it creates fragile logic because you lose operational meaning.

For example:

  • Is AutoFocus a running state or a paused state?
  • Can start be issued from Unload?
  • Can recovery happen from ScanStartMove?
  • Is Unload a safe condition or an interrupted condition?

These questions are awkward because workflow steps are not meant to define machine-wide operational semantics.

Better mental model

text
+-----------------------------------------------------------+
| MACHINE STATE LAYER                                       |
|-----------------------------------------------------------|
| Idle | Starting | Running | Paused | Faulted | Recovering |
+-----------------------------------------------------------+

+-----------------------------------------------------------+
| WORKFLOW STEP LAYER                                       |
|-----------------------------------------------------------|
| LoadWafer -> Align -> MoveToScanStart -> Scan -> Unload   |
+-----------------------------------------------------------+

What this diagram means

The top layer tells you the machine’s operational condition.

The bottom layer tells you which process step is active.

They coexist, but they serve different purposes.

A more realistic example:

text
Machine State : Running
Workflow Step : MoveToScanStart

Machine State : Running
Workflow Step : Scan

Machine State : Paused
Workflow Step : Scan

Machine State : Faulted
Workflow Step : Scan

Notice how the workflow step may remain associated with the interrupted operation, while the machine state changes based on operational condition.

What goes wrong when they are mixed

When teams mix state and step:

  • command enablement becomes inconsistent
  • pause/resume semantics become messy
  • recovery becomes step-specific spaghetti
  • fault handling spreads across workflow code
  • UI shows process detail instead of real operational truth

Experienced engineers separate them clearly:

  • state model controls allowed behavior
  • workflow model controls process progression

That separation is one of the foundations of robust machine software.


PART 3 — STATE TRANSITIONS

A transition is the controlled movement from one state to another.

Examples:

  • Idle -> Starting
  • Starting -> Running
  • Running -> Paused
  • Running -> Faulted
  • Faulted -> Recovering
  • Recovering -> Idle

A transition should never be casual. In industrial software, it should happen only because a defined trigger occurred and the guard conditions were satisfied.

What should trigger transitions

Typical triggers include:

1. Operator commands

Examples:

  • Start
  • Pause
  • Resume
  • Stop
  • Abort
  • Reset

These are intent signals. They do not automatically mean the transition is valid.

For example, Start should not force Idle -> Running. Usually it requests Idle -> Starting, and only when startup conditions succeed does the system move to Running.

2. Hardware or subsystem events

Examples:

  • stage homed
  • camera initialized
  • chuck vacuum achieved
  • guard door closed
  • interlock cleared

These often complete or unblock a transition.

3. Internal completion events

Examples:

  • startup sequence finished
  • stop sequence completed
  • pause deceleration finished
  • recovery routine completed

These are especially important because physical operations take time.

4. Fault events

Examples:

  • motion controller alarm
  • camera timeout
  • sensor disagreement
  • axis following error
  • interlock violation

These often force transitions into Faulted or another protected state.

Allowed vs invalid transitions

Not all state changes are legal.

For example:

  • Idle -> Running might be invalid if startup checks are mandatory
  • Faulted -> Running is usually invalid
  • Recovering -> Starting may be invalid until recovery finishes
  • Paused -> Starting is usually nonsensical

A good machine state model makes these explicit.

text
                +----------+
                |  Idle    |
                +----------+
                     |
                     | Start command accepted
                     v
                +----------+
                | Starting |
                +----------+
                 /   |    \
                /    |     \
               /     |      \
   startup ok /  stop req    \ fault
             v       v         v
        +---------+ +---------+ +---------+
        | Running | | Stopping| | Faulted |
        +---------+ +---------+ +---------+
          /   |   \       |          |
         /    |    \      |          | Recover command
        /     |     \     |          v
   pause   stop    fault  |     +------------+
    req     req            +---->| Recovering |
     v       v                   +------------+
 +---------+ +---------+               |
 | Paused  | |Stopping |               | recovery complete
 +---------+ +---------+               v
    |   \                               +------+
    |    \                              | Idle |
    |     \ fault                       +------+
    |      v
    |   +---------+
    +-> | Faulted |
Resume  +---------+

What this diagram means

This is closer to how real machine software thinks. You can see:

  • commands request transitions
  • completion events finish transitions
  • faults can interrupt multiple states
  • recovery is its own state, not a hidden implementation detail

Why transition rules must be explicit

If transition rules are not centralized, they end up scattered:

  • UI directly changes state
  • workflow code changes state
  • device event handler changes state
  • alarm handler changes state

Now the machine has multiple writers of truth.

That leads to race conditions and contradictory state changes.

Experienced engineers usually enforce a rule like this:

State changes happen only through a controlled transition mechanism.

That mechanism checks:

  • current state
  • requested trigger
  • guards/permissives
  • transition side effects
  • notification/logging

This is one of the biggest differences between toy machine code and production machine code.


PART 4 — HIERARCHICAL STATE DESIGN

Real machines usually need more than one state layer. Domain 1 explicitly calls out hierarchical state design as part of this topic.

Because in a real machine, there is no single flat truth that captures everything cleanly.

You may need:

  • machine-level state
  • subsystem/module-level state
  • device-level state

Example

  • machine = Running
  • motion subsystem = Busy
  • camera subsystem = Waiting
  • wafer handler = Idle
  • one axis drive = Faulted

This is normal. The machine is a composition of coordinated parts, not one monolithic actor.

Why multiple state layers are needed

A flat model breaks down because:

  • the machine may be Running overall while a subsystem waits for another subsystem
  • one device may be Faulted before the machine-level state has fully transitioned
  • a recovery routine may target only one subsystem
  • some devices have their own internal lifecycle independent of current workflow step

A useful hierarchy

text
Machine
|
+-- Machine State
|    |
|    +-- Idle
|    +-- Starting
|    +-- Running
|    +-- Paused
|    +-- Faulted
|    +-- Recovering
|
+-- Workflow State
|    |
|    +-- NoJob
|    +-- LoadWafer
|    +-- Align
|    +-- MoveToScanStart
|    +-- Scan
|    +-- Unload
|
+-- Subsystems
     |
     +-- Motion Subsystem
     |    |
     |    +-- NotReady
     |    +-- Ready
     |    +-- Busy
     |    +-- Stopping
     |    +-- Faulted
     |
     +-- Camera Subsystem
     |    |
     |    +-- Offline
     |    +-- Initializing
     |    +-- Ready
     |    +-- Acquiring
     |    +-- Faulted
     |
     +-- Wafer Handler
          |
          +-- Homing
          +-- Ready
          +-- Loading
          +-- Unloading
          +-- Faulted

What this diagram means

This is not three competing truths. It is a structured decomposition.

  • machine state describes top-level operational condition
  • workflow state describes process position
  • subsystem states describe local behavior and readiness

This lets you manage complexity without pretending everything belongs in one enum.

Important design idea: state ownership

Each level should have a clear owner.

For example:

  • machine controller owns machine state
  • workflow engine owns workflow step
  • motion manager owns motion subsystem state
  • camera manager owns camera state

Then the machine-level controller derives or reacts to subsystem conditions, rather than directly faking them.

Example of hierarchy in real behavior

Imagine the machine is in Running.

The motion subsystem is Busy moving to scan start.

The camera is Ready.

During motion, the motion controller reports a servo alarm.

Now:

  • motion subsystem transitions to Faulted
  • machine controller observes that fault
  • machine transitions from Running to Faulted
  • workflow remains associated with MoveToScanStart as interrupted context

That is hierarchical design working correctly. The local failure occurs at the correct layer, then propagates upward in a controlled way.

What goes wrong without hierarchy

If there is only one flat machine state:

  • subsystem detail gets lost
  • recovery becomes opaque
  • local faults become global chaos
  • debugging becomes harder because you do not know which layer changed first

Experienced engineers use hierarchy to reduce ambiguity, not to make the design fancy.


PART 5 — EVENTS, COMMANDS, AND STATE CHANGES

A common mistake is to think commands directly change state.

In strong machine software, commands usually request change. Events and completion conditions usually confirm change.

That distinction matters because the machine is interacting with physical reality.

Different kinds of triggers

Operator command

Examples:

  • Start button pressed
  • Pause requested
  • Reset fault
  • Abort cycle

This expresses intent.

Hardware event

Examples:

  • home sensor detected
  • vacuum stable
  • axis in position
  • guard door opened
  • camera disconnected

This expresses something observed from the physical system.

Internal completion event

Examples:

  • startup sequence completed
  • pause deceleration completed
  • stop routine completed
  • recovery cleanup finished

This expresses that software-driven actions have actually finished.

Fault event

Examples:

  • motion timeout
  • drive alarm
  • image acquisition failure
  • inconsistent sensor state

This expresses abnormal condition.

Why state must not change silently

Suppose an operator presses Start.

Weak design:

  • UI handler sets MachineState = Running

Strong design:

  • UI publishes StartRequested
  • machine controller checks permissives
  • machine state becomes Starting
  • startup actions execute
  • when startup completes successfully, StartupCompleted
  • machine state becomes Running

That sequence matters because the machine is not “running” just because someone clicked a button.

Sequence diagram

text
Operator        UI/HMI        Machine Controller     Subsystems
   |              |                  |                  |
   | Press Start  |                  |                  |
   |------------->|                  |                  |
   |              | StartRequested   |                  |
   |              |----------------->|                  |
   |              |                  | Check guards     |
   |              |                  |----------------->|
   |              |                  |<-----------------|
   |              |                  | Transition:      |
   |              |                  | Idle->Starting   |
   |              |                  |                  |
   |              |                  | Execute startup  |
   |              |                  |----------------->|
   |              |                  |<-----------------|
   |              |                  | StartupCompleted |
   |              |                  | Transition:      |
   |              |                  | Starting->Running|
   |              |<-----------------| StateChanged     |
   | UI updates   |                  |                  |

What this diagram means

The state change is not arbitrary. It is driven by explicit signals and validated progress.

This is why event-driven transitions are common in machine software:

  • physical actions are asynchronous
  • subsystems report completion later
  • interruptions can happen mid-transition
  • the system needs observable causality

Practical rule

A very good rule is:

  • commands express intent
  • events report facts
  • state transitions consume those signals under explicit rules

That keeps the design understandable.


PART 6 — REAL-WORLD FAILURE SCENARIOS

Here are the kinds of failures that happen when state modeling is weak.

1. UI shows machine as Running but subsystem is actually Faulted

What it looks like in production

The HMI still shows green “Running,” but image acquisition stopped, or motion no longer responds. Operators think the machine is hung.

Why it happens

The subsystem fault is stored locally but never propagated properly to machine state. Or the UI reads cached machine state but not subsystem health.

How experienced engineers handle it

They make subsystem faults explicit events and define machine-level fault propagation rules. They also log state transitions and root events so the sequence is visible.


2. Machine accepts Start while still recovering

What it looks like in production

Operator clears an alarm and quickly presses Start. The machine begins a new cycle before cleanup is complete. Axes may not be re-referenced, outputs may still be latched, or leftover product context may remain.

Why it happens

Recovery was treated as a hidden internal action, not as an explicit state. So the system looks idle before it is truly ready.

How experienced engineers handle it

They model Recovering explicitly and block Start until recovery completion criteria are satisfied. Recovery is treated as a first-class operational condition, not background housekeeping.


3. State transition occurs too early before physical completion

What it looks like in production

Software changes from Starting to Running as soon as a motion command is sent, not when homing or initialization is actually complete. Then subsequent workflow actions begin too early.

Why it happens

The code assumes command issuance equals action completion.

How experienced engineers handle it

They distinguish request, in-progress, and completion. They transition on observed completion conditions, not on command dispatch.


4. Multiple flags imply contradictory states

What it looks like in production

Different screens or services disagree:

  • one component thinks paused
  • another thinks running
  • another thinks faulted but recoverable

Engineers spend hours reading code to infer the real condition.

Why it happens

State was represented as distributed booleans with no authoritative model.

How experienced engineers handle it

They replace boolean explosion with explicit state models and transition rules. They allow local detail where needed, but operational state remains authoritative and normalized.


5. Subsystem state and machine state drift apart

What it looks like in production

Motion subsystem says Stopping, machine says Idle, workflow still thinks Scan. Restart behavior becomes unpredictable.

Why it happens

No clear ownership. Multiple parts of the system mutate state independently. Some transitions are event-driven, some are direct assignments.

How experienced engineers handle it

They define clear state owners and propagation paths. They also use event logs or timeline views to reconstruct who changed what and why.


PART 7 — SOFTWARE DESIGN IMPLICATIONS

The state model affects architecture directly.

Domain 1 emphasizes that these systems must be state-driven, deterministic, and safe. That is exactly why explicit state modeling matters here.

1. State ownership must be clear

Every state should have an owner.

Bad example:

  • UI sets machine state
  • workflow sets machine state
  • device manager sets machine state
  • alarm service sets machine state

Good example:

  • machine controller owns machine state transitions
  • subsystem managers own subsystem state transitions
  • workflow engine owns workflow steps
  • others submit commands/events, not direct mutations

2. Explicit state machines are usually better than scattered flags

You do not always need a heavy framework. But you do need explicitness.

At minimum:

  • defined state enum/model
  • defined triggers/events
  • defined transition rules
  • guard conditions
  • observable state change notifications
  • transition logging

3. Centralized transition rules matter

The transition logic should live in one place per state owner.

That gives you:

  • consistent validation
  • easier testing
  • cleaner debugging
  • safer evolution

4. State and action should be separated

This is subtle and important.

  • State = what condition the machine is in
  • Action = what software is doing because of that condition or trigger

For example:

  • transition to Stopping
  • then execute stop actions
  • later transition to Idle when stop completion is confirmed

If you mix state and action, transitions become side-effect soup.

5. Recovery-aware state modeling is essential

Recovery is not an exception. In industrial machines, recovery is normal system behavior.

You need states and transitions that acknowledge:

  • cleanup after interrupted work
  • re-homing or re-initialization
  • subsystem reconciliation
  • operator-guided reset
  • safe re-entry to ready condition

Good vs bad architecture

text
BAD APPROACH
------------

UI Button Handler
   |
   +--> sets IsRunning = true
   +--> starts workflow task
   +--> clears some flags
   +--> motion service updates other flags
   +--> alarm service may set HasFault later

Result:
- hidden state changes
- contradictory flags
- difficult debugging
- weak safety semantics


GOOD APPROACH
-------------

Operator Command / Hardware Event / Internal Event
                    |
                    v
           +---------------------+
           | State Owner         |
           | (Machine Controller)|
           +---------------------+
                    |
                    +--> validate trigger
                    +--> check guards
                    +--> perform transition
                    +--> publish StateChanged
                    +--> invoke actions/orchestration
                    |
                    v
           +---------------------+
           | Observability       |
           | logs / timeline /   |
           | diagnostics         |
           +---------------------+

What this diagram means

The good design creates one authoritative decision point for machine state, with clear trigger handling and observable outcomes.

That is how strong engineers keep large machine systems understandable.


PART 8 — INTERVIEW / REAL-WORLD TALKING POINTS

Here is how I would explain this in an interview.

How to explain state machines clearly

A strong answer is:

In industrial machine software, state machines are used to model the machine’s operational condition explicitly, because the system is long-running, asynchronous, and coupled to physical hardware. The state model defines what the machine is allowed to do next, how it reacts to commands and faults, and how it recovers safely. Without an explicit state model, teams end up with scattered flags, contradictory behavior, and unsafe transitions.

That is clear, practical, and senior-level.

Why machine state is different from workflow step

Another strong interview line:

Machine state and workflow step solve different problems. Machine state describes the operational condition of the machine, such as Idle, Running, Paused, Faulted, or Recovering. Workflow step describes where the current process is, such as Load, Align, Scan, or Unload. Mixing them creates fragile logic because workflow progression and operational control have different semantics.

That is one of the most important distinctions for this topic, and it aligns directly with the Domain 1 source of truth.

Common mistakes software engineers make when entering machine software

The common mistakes are:

  • treating state like a UI label instead of a control contract
  • assuming command issued means action completed
  • mixing machine state and workflow step
  • allowing multiple components to mutate state directly
  • hiding recovery inside miscellaneous code instead of modeling it explicitly
  • using boolean explosion instead of authoritative state models

What strong engineers understand about hierarchical state design

Strong engineers understand that real machines are layered systems.

They know:

  • machine-wide state is not enough by itself
  • subsystems need local states with clear ownership
  • faults usually originate at lower layers and propagate upward
  • workflow context, machine state, and subsystem state must stay distinct but coordinated
  • observable transitions are essential for debugging and safe recovery

A concise senior-level summary

If you need a final concise explanation for work or interviews, I would say this:

In industrial machine software, explicit state modeling is critical because machine behavior is asynchronous, physical, interruptible, and safety-sensitive. A good design separates machine operational state from workflow step, defines legal transitions explicitly, uses hierarchical state ownership across machine and subsystems, and ensures all meaningful state changes are observable and traceable. That is what keeps the system deterministic, diagnosable, and safe.


Closing summary

State machines in machine control are not academic formalism. They are one of the core tools for making real industrial software reliable.

They help you answer, at all times:

  • What condition is the machine in?
  • What is allowed now?
  • What triggered this change?
  • Which layer owns this state?
  • How do faults and recovery behave?
  • Can operators and engineers trust what the system says?

When those answers are explicit, the machine becomes much easier to reason about, debug, and evolve.

When they are implicit, the codebase usually becomes fragile very quickly.

This topic is directly aligned with Domain 1’s definition of “State Machines for Machine Control,” including machine states vs workflow steps, state transitions, and hierarchical state design.

If you want, I can continue with the next Domain 1 topic in the same style.

Docs-first project memory for AI-assisted implementation.