Skip to content

State Management at System Level

This topic sits right in the middle of industrial software architecture. A real machine is never in just one state. It is in many states at once.

The motion subsystem has a state. The vision subsystem has a state. The safety subsystem has a state. The current workflow has a state. The operator mode has a state. The machine as a whole has a state that people see on the HMI.

The hard part is not creating those states. The hard part is keeping them consistent enough that the software makes correct decisions and the operator sees something trustworthy.

This belongs squarely in the industrial software architecture layer, where stateful components, orchestrator patterns, separation of UI/workflow/device logic, and long-running behavior all matter. The source-of-truth roadmap also makes this explicit through topics such as stateful vs stateless components, orchestrator patterns, device manager patterns, session/run models, and error propagation strategy.


PART 1 — WHY SYSTEM-LEVEL STATE IS HARD

In business software, state inconsistency is often annoying. In machine software, state inconsistency can be dangerous.

A machine is composed of multiple subsystems that operate semi-independently:

  • motion
  • vision
  • IO
  • safety
  • recipe/configuration
  • workflow/orchestration

Each subsystem has its own local truth. But the machine must still present a coherent overall truth.

For example:

  • motion says: “axes are stopped”
  • vision says: “camera not ready”
  • safety says: “door open”
  • workflow says: “inspection step active”
  • HMI says: “Running”

That is already a broken system.

The reason this is hard is that the machine is not a single-threaded object graph where everything updates instantly. It is a long-lived, asynchronous, multi-source system. Industrial architecture is explicitly stateful, event-driven, long-running, and highly concurrent, which is why stale status, race conditions, and inconsistent behavior are recurring architectural risks.

A few classic examples:

Example 1: Machine shows “Running” but one subsystem is faulted The workflow engine may still believe the current run is active, while the motion controller has already faulted and stopped. If the top-level machine state is derived badly, the system reports “Running” because the workflow has not yet transitioned.

Example 2: UI shows “Ready” but a device is not actually ready The device reconnect process may still be in progress, but the UI is reading cached readiness from a previous cycle.

Example 3: System allows a command based on mixed-time data The command gate checks “no active alarm” from one store and “axes homed” from another store, but one of them is 300 ms stale. The command becomes logically valid in software and physically invalid in reality.

That is why strong industrial systems do not treat system state as a cosmetic status label. They treat it as a control surface.


PART 2 — TYPES OF STATE IN MACHINE SYSTEMS

There are several layers of state, and they are not interchangeable.

1. Machine-level state

This is the overall state presented to the rest of the system and to the operator.

Examples:

  • Initializing
  • Ready
  • Running
  • Paused
  • Faulted
  • Stopping
  • Maintenance

This is usually derived or aggregated, not directly observed.

2. Subsystem-level state

Each subsystem has its own operational state.

Examples:

  • Motion: NotHomed / Homing / Ready / Moving / Error
  • Vision: Offline / Initializing / Armed / Acquiring / Error
  • Safety: Safe / NotSafe / EStop / GuardOpen
  • IO: Connected / Degraded / Faulted

3. Device-level state

This is the state of actual hardware endpoints or adapters.

Examples:

  • camera connected/disconnected
  • PLC heartbeat alive/missed
  • light controller initialized/not initialized
  • axis drive enabled/disabled
  • encoder feedback valid/invalid

4. Workflow state

This describes process execution, not hardware readiness.

Examples:

  • Idle
  • LoadingRecipe
  • Aligning
  • Inspecting
  • Unloading
  • Recovering
  • Aborting

5. Transient state

Short-lived state that exists during transitions.

Examples:

  • Starting
  • Stopping
  • Reconnecting
  • ClearingAlarm
  • ApplyingRecipe
  • WaitingForTrigger

6. Persistent state

State that survives restart or must be restored.

Examples:

  • current recipe
  • calibration data
  • last known lot/run context
  • latched alarms
  • service/maintenance counters

These categories map directly to the broader architecture domains in the roadmap: session/run/lot execution models, device health monitoring, alarm handling, long-lived process architecture, configuration architecture, and machine history.

State relationship diagram

text
+------------------------------------------------------+
|                  Machine-Level State                 |
|         Ready / Running / Paused / Faulted           |
+--------------------------+---------------------------+
                           |
         +-----------------+-----------------+
         |                 |                 |
         v                 v                 v
+----------------+ +----------------+ +----------------+
| Motion State   | | Vision State   | | Safety State   |
| Ready/Moving   | | Armed/Error    | | Safe/EStop     |
+--------+-------+ +--------+-------+ +--------+-------+
         |                  |                  |
         v                  v                  v
+----------------+ +----------------+ +----------------+
| Axis/Drive     | | Camera/Light   | | Safety IO/PLC  |
| Device State   | | Device State   | | Device State   |
+----------------+ +----------------+ +----------------+

                           +
                           |
                           v

+------------------------------------------------------+
|                  Workflow State                      |
| Idle / Aligning / Inspecting / Recovering / Abort    |
+------------------------------------------------------+

How to read this diagram

The key point is that machine state is not the same thing as subsystem state, and subsystem state is not the same thing as workflow state.

A machine may be:

  • workflow = Inspecting
  • motion = Stopped
  • vision = Error
  • safety = Safe
  • machine = Faulted

That is a perfectly valid combination.

A weak design collapses all of these into one enum or one “current status” string. A strong design keeps them distinct, then defines clear rules for how they relate.


PART 3 — STATE OWNERSHIP

A piece of state is safe only when it has a clear owner.

Ownership answers three questions:

  • who updates it
  • who reads it
  • who is authoritative

Without ownership, state becomes gossip.

Good ownership examples

Device health state Owned by the device adapter or device manager.

Examples:

  • camera online/offline
  • controller initialized/not initialized
  • heartbeat good/bad

The UI may display it. The orchestrator may react to it. But neither should invent or overwrite it.

Workflow state Owned by the orchestration layer.

Examples:

  • current sequence step
  • inspection run phase
  • pause requested
  • abort in progress

A device should not decide that the workflow is “Inspecting.” It can only report facts that the orchestrator uses.

Machine-level state Usually owned by a machine state aggregator or machine controller layer.

This layer consumes authoritative subsystem/workflow state and publishes the coherent top-level machine state.

Why multiple writers are dangerous

Suppose both the UI layer and the orchestration layer can set MachineState = Ready.

Now imagine:

  • workflow sets Ready because sequence ended
  • safety later detects guard open
  • UI has not yet processed safety event
  • operator still sees Ready
  • command button stays enabled

This is how accidental multiple writers create unsafe or confusing behavior.

A practical rule

For every important state field, ask:

  • Who owns this?
  • Who is allowed to change it?
  • Who only observes it?
  • What other state is derived from it?

If you cannot answer quickly, the design is already weak.


PART 4 — STATE PROPAGATION

State does not just exist. It moves.

In most machine systems, the propagation path is something like:

  • device changes
  • subsystem updates local state
  • application/orchestrator reacts
  • machine-level state is recomputed
  • UI receives updated projection

Flow diagram

text
+-------------+      +----------------+      +------------------+
| Device/SDK  | ---> | Subsystem      | ---> | Application /    |
| or PLC      |      | State Owner    |      | Orchestrator     |
+-------------+      +----------------+      +------------------+
        |                     |                        |
        | raw fact            | authoritative update   | derived state
        v                     v                        v
  "camera lost"        VisionState=Error        MachineState=Faulted
                                                     |
                                                     v
                                             +------------------+
                                             | UI / HMI / Logs  |
                                             +------------------+

Why ordering matters

Now consider two events:

  1. motion stopped
  2. motion faulted

If the system processes them out of order, the machine might temporarily compute:

  • axes stopped
  • no fault
  • workflow still active

and report “Paused” or “Ready” before moving to “Faulted.”

That temporary inconsistency may only last 100 ms, but in industrial software that is long enough for:

  • UI flicker
  • incorrect command enablement
  • wrong log sequence
  • a bad automatic recovery decision

Why delay matters

State propagation delay is not just a UI issue.

If an orchestrator consumes delayed state, it can execute a command using yesterday’s truth in today’s situation.

Common delay sources:

  • async event queues
  • polling intervals
  • cross-thread marshaling
  • vendor SDK callbacks arriving late
  • batched update mechanisms
  • lock contention

This is why event-driven design and concurrency design are tightly coupled with state design in industrial systems. The roadmap explicitly groups event-driven models, queues, polling, producer-consumer pipelines, synchronization, and race conditions as architectural concerns because they directly shape state propagation correctness.


PART 5 — CONSISTENCY PROBLEMS

This is where real systems get ugly.

1. Stale state

A component is reading a state snapshot that is no longer current.

Example UI still shows camera ready because the disconnect event has not been applied yet.

Why it happens

  • polling delay
  • queue lag
  • cached projections
  • delayed marshaling to UI thread

2. Conflicting state

Two state sources claim incompatible truths.

Example

  • workflow says “Running”
  • safety says “EStop active”
  • machine summary says “Running”

The summary logic is wrong, or one state source was not included.

3. Partial updates

A broader state transition requires multiple fields to change, but only some are updated before observers react.

Example During abort:

  • WorkflowState = Aborting
  • CanStart = true accidentally still true
  • CurrentRun not yet cleared

A command gate reads halfway through the update and enables Start too early.

4. Out-of-order updates

Events arrive or are applied in the wrong order.

Example

  • device emits Recovered
  • delayed Faulted arrives later
  • state ends in Faulted even though device is healthy now

5. Race conditions

Two concurrent operations both update related state without a defined sequencing model.

Example

  • operator presses Pause
  • watchdog triggers Fault
  • sequence completion event arrives
  • final state depends on timing, not rules

Consistency diagram

text
Time --->

Event Stream:
  E1: WorkflowStarted
  E2: VisionFault
  E3: UIRefresh
  E4: MachineStateRecomputed

Bad ordering:
  E1 ------> E3 ------> E4 ------> E2

Observed result:
  UI shows "Running"
  even though vision is already faulted in reality

Why asynchronous systems make this hard

Because there is no single “now.”

Different parts of the system observe different versions of reality at slightly different times. A strong architecture accepts that and creates explicit rules for:

  • event ordering
  • update atomicity
  • state recomputation
  • transition precedence
  • observer consistency expectations

PART 6 — DERIVED STATE VS SOURCE STATE

This distinction is one of the most important in real machine software.

Source state

This is state that comes directly from an authoritative owner.

Examples:

  • axis homed = true
  • safety guard open = false
  • camera connected = true
  • workflow current step = Inspecting
  • alarm list contains MotionFault

Derived state

This is state computed from source state.

Examples:

  • machine ready
  • machine can start
  • machine is faulted
  • UI command enablement
  • alarm banner severity
  • production availability

Example: “Machine Ready”

“Ready” is almost never a source state. It is derived.

It may be computed from rules like:

  • safety is safe
  • no active blocking alarms
  • all required devices initialized
  • motion homed
  • no workflow running
  • recipe valid
  • no pending recovery
text
MachineReady =
    SafetyState == Safe
AND NoBlockingAlarms
AND MotionState == Ready
AND VisionState == Ready
AND WorkflowState == Idle
AND RecipeState == Valid

The exact formula varies, but the principle does not.

Risks of derived state

Risk 1: hidden derivation rules If “Ready” is computed in five places, each place drifts.

Risk 2: missing dependencies If Ready ignores safety or active alarms, it becomes dangerous.

Risk 3: invalid timing assumptions If derived state is recomputed from stale source state, the result is technically correct but operationally false.

Strong practice

Keep source state explicit. Keep derived state rules centralized. Treat derived state as a projection, not as an independent truth.


PART 7 — REAL-WORLD FAILURE SCENARIOS

Scenario 1 — System shows incorrect overall state

What it looks like HMI banner says “Ready,” but the machine refuses to start, or a subsystem panel shows an error.

Why it happens

  • machine summary omitted one subsystem
  • summary derived from stale cache
  • “ready” rule duplicated in multiple places

How engineers debug it

  • inspect raw subsystem states at the same timestamp
  • compare machine-state derivation inputs
  • trace when summary was last recomputed
  • verify which component owns the displayed status

Scenario 2 — Subsystem failure not reflected in machine state

What it looks like Camera disconnects, but machine remains “Running” or “Ready.”

Why it happens

  • failure event did not propagate
  • subsystem state updated, machine aggregator did not subscribe
  • fault classified as non-blocking when it should be blocking

How engineers debug it

  • reconstruct event timeline
  • inspect subscription path from device to machine aggregator
  • verify fault severity and aggregation rules
  • check whether event was dropped, delayed, or filtered

Scenario 3 — Race condition causes temporary invalid state

What it looks like During Stop or Abort, the UI briefly re-enables Start or Manual Move.

Why it happens

  • partial update sequence
  • independent observers reacting to different fields
  • transition flags and command guards not updated atomically

How engineers debug it

  • add timestamped state-transition logs
  • capture full state snapshot before and after each transition
  • look for non-atomic updates across related fields
  • reproduce under load or with artificial delays

Scenario 4 — UI reacts to outdated state

What it looks like Operator clicks Start because button remained enabled for half a second after a fault.

Why it happens

  • UI projection lagged behind core state
  • UI bound to cached view-model field rather than authoritative projection
  • command gate implemented locally in UI, not centrally in core logic

How engineers debug it

  • compare UI projection timestamp to core state timestamp
  • inspect where enablement logic lives
  • verify that core command handler re-validates state, not just UI button status

Scenario 5 — Inconsistent state leads to wrong command execution

What it looks like System accepts a transition into inspection even though one device is reconnecting.

Why it happens

  • readiness derived too loosely
  • one subsystem reports “last known good” instead of current unavailable state
  • command handler trusted derived state without validating critical source conditions

How engineers debug it

  • replay exact pre-command state snapshot
  • inspect authoritative owners for each readiness input
  • verify precedence of reconnecting/error/not-ready conditions
  • test with deliberate event delay injection

PART 8 — SOFTWARE DESIGN IMPLICATIONS

This is where architecture either saves you or kills you.

Why state must be modeled explicitly

In industrial systems, state is not a side effect. It is part of the design.

If you do not model it explicitly, it still exists, but now it is fragmented across:

  • booleans
  • view models
  • service fields
  • SDK callbacks
  • ad hoc caches
  • command guards
  • log messages

That is when nobody can answer simple questions like:

  • Is the machine actually ready?
  • Who decided it was faulted?
  • Why is Start disabled?
  • Which state change happened first?

Bad approach

text
UI ViewModel sets:
  IsReady
  IsRunning
  CanStart

Workflow service sets:
  CurrentMode
  IsBusy

Device manager sets:
  IsConnected
  ErrorText

Some command handler checks all of them directly.

Problems:

  • many writers
  • no authoritative ownership
  • mixed source and derived state
  • impossible to reason about ordering
  • UI accidentally becomes part of control logic

Good approach

text
+------------------------+
| Device State Owners    |
| Motion / Vision / IO   |
+-----------+------------+
            |
            v
+------------------------+
| Workflow State Owner   |
| Orchestrator           |
+-----------+------------+
            |
            v
+------------------------+
| Machine State Model    |
| Aggregation + Rules    |
+-----------+------------+
            |
            +-------> UI Projections
            |
            +-------> Command Guards
            |
            +-------> Logs / Diagnostics

Component diagram explanation

A stronger design typically has:

  • authoritative local state owners for each subsystem
  • an orchestrator that owns process/workflow state
  • a machine state model that derives top-level state from authoritative inputs
  • read-only projections for UI and diagnostics
  • command guards that validate against core state, not UI copies

Design principles that matter

1. Clear ownership Every state field has one authoritative writer.

2. Controlled updates Do not let random services patch machine status.

3. Consistent propagation Define how state moves through the system and what ordering guarantees exist.

4. Separation of source and derived state Do not let derived summaries behave like independent truths.

5. Centralized derivation rules Compute things like Ready, Faulted, CanStart, CanHome in one place.

6. Snapshot-oriented diagnostics Log state transitions with enough context to reconstruct what happened.

These principles line up with the roadmap’s emphasis on layered architecture, stateful components, orchestrator patterns, error propagation strategy, device health monitoring, fault handling, and root-cause-friendly observability.


PART 9 — INTERVIEW / REAL-WORLD TALKING POINTS

Here is how I would explain it in an interview or architecture discussion.

How to explain system-level state clearly

“System-level state in industrial machine software is not just one enum for the whole machine. It is the coordinated result of multiple state layers: device state, subsystem state, workflow state, and machine-level derived state. The challenge is maintaining consistency across asynchronous components so the system makes correct decisions and the operator sees trustworthy status.”

Why consistency is difficult

“Consistency is difficult because state comes from multiple sources at different times. Devices report asynchronously, workflows transition independently, and UI projections lag behind core logic. The real problem is not storing state; it is ownership, propagation, ordering, and derivation.”

Common mistakes engineers make

  • mixing source state and derived state
  • allowing multiple writers for the same state
  • duplicating readiness logic in UI, workflow, and services
  • using cached status for control decisions
  • not modeling transient states like Starting, Stopping, Recovering
  • treating machine state as presentation only instead of control logic

What strong engineers understand

Strong engineers understand that:

  • every important state needs a clear owner
  • top-level machine state is usually derived, not directly observed
  • asynchronous propagation means temporary inconsistency must be expected and controlled
  • command decisions should validate against authoritative core state
  • diagnostics must capture state transitions well enough to reconstruct failures
  • correctness matters more than convenience when state influences motion, alarms, or recovery

A concise interview-quality summary

“A good industrial state model separates local truth from global truth. Devices own device facts. Orchestration owns workflow progress. The machine layer derives overall status from authoritative inputs. UI observes projections rather than inventing state. The hard part is managing propagation, ordering, and consistency so the software never acts on an invalid picture of the machine.”


Final mental model

Think of system-level state like this:

  • local state tells you what each part believes about itself
  • system state tells you what the machine is allowed to do
  • derived state tells operators and higher-level logic what the overall condition means
  • good architecture makes those relationships explicit
  • bad architecture lets state leak everywhere until nobody trusts it

In industrial software, once nobody trusts the state model, everything gets worse:

  • operators stop trusting the UI
  • engineers add more ad hoc checks
  • workflows become defensive and tangled
  • recovery becomes unpredictable
  • debugging becomes timeline archaeology

That is why system-level state is a first-class architectural problem, not a detail.

This topic fits naturally under Industrial Software Architecture, which in your source-of-truth roadmap already emphasizes stateful components, orchestrator patterns, separation of UI/workflow/device logic, and error propagation, and it connects closely to concurrency, reliability, and observability concerns elsewhere in the roadmap.

If you want, I can continue with the next topic in the same style.

Docs-first project memory for AI-assisted implementation.