State Management at System Level

This topic sits right in the middle of industrial software architecture. A real machine is never in just one state. It is in many states at once.

The motion subsystem has a state. The vision subsystem has a state. The safety subsystem has a state. The current workflow has a state. The operator mode has a state. The machine as a whole has a state that people see on the HMI.

The hard part is not creating those states. The hard part is keeping them consistent enough that the software makes correct decisions and the operator sees something trustworthy.

This belongs squarely in the industrial software architecture layer, where stateful components, orchestrator patterns, separation of UI/workflow/device logic, and long-running behavior all matter. The source-of-truth roadmap also makes this explicit through topics such as stateful vs stateless components, orchestrator patterns, device manager patterns, session/run models, and error propagation strategy.

PART 1 — WHY SYSTEM-LEVEL STATE IS HARD

In business software, state inconsistency is often annoying. In machine software, state inconsistency can be dangerous.

A machine is composed of multiple subsystems that operate semi-independently:

motion
vision
IO
safety
recipe/configuration
workflow/orchestration

Each subsystem has its own local truth. But the machine must still present a coherent overall truth.

For example:

motion says: “axes are stopped”
vision says: “camera not ready”
safety says: “door open”
workflow says: “inspection step active”
HMI says: “Running”

That is already a broken system.

The reason this is hard is that the machine is not a single-threaded object graph where everything updates instantly. It is a long-lived, asynchronous, multi-source system. Industrial architecture is explicitly stateful, event-driven, long-running, and highly concurrent, which is why stale status, race conditions, and inconsistent behavior are recurring architectural risks.

A few classic examples:

Example 1: Machine shows “Running” but one subsystem is faulted The workflow engine may still believe the current run is active, while the motion controller has already faulted and stopped. If the top-level machine state is derived badly, the system reports “Running” because the workflow has not yet transitioned.

Example 2: UI shows “Ready” but a device is not actually ready The device reconnect process may still be in progress, but the UI is reading cached readiness from a previous cycle.

Example 3: System allows a command based on mixed-time data The command gate checks “no active alarm” from one store and “axes homed” from another store, but one of them is 300 ms stale. The command becomes logically valid in software and physically invalid in reality.

That is why strong industrial systems do not treat system state as a cosmetic status label. They treat it as a control surface.

PART 2 — TYPES OF STATE IN MACHINE SYSTEMS

There are several layers of state, and they are not interchangeable.

1. Machine-level state

This is the overall state presented to the rest of the system and to the operator.

Examples:

Initializing
Ready
Running
Paused
Faulted
Stopping
Maintenance

This is usually derived or aggregated, not directly observed.

2. Subsystem-level state

Each subsystem has its own operational state.

Examples:

Motion: NotHomed / Homing / Ready / Moving / Error
Vision: Offline / Initializing / Armed / Acquiring / Error
Safety: Safe / NotSafe / EStop / GuardOpen
IO: Connected / Degraded / Faulted

3. Device-level state

This is the state of actual hardware endpoints or adapters.

Examples:

camera connected/disconnected
PLC heartbeat alive/missed
light controller initialized/not initialized
axis drive enabled/disabled
encoder feedback valid/invalid

4. Workflow state

This describes process execution, not hardware readiness.

Examples:

Idle
LoadingRecipe
Aligning
Inspecting
Unloading
Recovering
Aborting

5. Transient state

Short-lived state that exists during transitions.

Examples:

Starting
Stopping
Reconnecting
ClearingAlarm
ApplyingRecipe
WaitingForTrigger

6. Persistent state

State that survives restart or must be restored.

Examples:

current recipe
calibration data
last known lot/run context
latched alarms
service/maintenance counters

These categories map directly to the broader architecture domains in the roadmap: session/run/lot execution models, device health monitoring, alarm handling, long-lived process architecture, configuration architecture, and machine history.

State relationship diagram

text

+------------------------------------------------------+
|                  Machine-Level State                 |
|         Ready / Running / Paused / Faulted           |
+--------------------------+---------------------------+
                           |
         +-----------------+-----------------+
         |                 |                 |
         v                 v                 v
+----------------+ +----------------+ +----------------+
| Motion State   | | Vision State   | | Safety State   |
| Ready/Moving   | | Armed/Error    | | Safe/EStop     |
+--------+-------+ +--------+-------+ +--------+-------+
         |                  |                  |
         v                  v                  v
+----------------+ +----------------+ +----------------+
| Axis/Drive     | | Camera/Light   | | Safety IO/PLC  |
| Device State   | | Device State   | | Device State   |
+----------------+ +----------------+ +----------------+

                           +
                           |
                           v

+------------------------------------------------------+
|                  Workflow State                      |
| Idle / Aligning / Inspecting / Recovering / Abort    |
+------------------------------------------------------+

How to read this diagram

The key point is that machine state is not the same thing as subsystem state, and subsystem state is not the same thing as workflow state.

A machine may be:

workflow = Inspecting
motion = Stopped
vision = Error
safety = Safe
machine = Faulted

That is a perfectly valid combination.

A weak design collapses all of these into one enum or one “current status” string. A strong design keeps them distinct, then defines clear rules for how they relate.

PART 3 — STATE OWNERSHIP

A piece of state is safe only when it has a clear owner.

Ownership answers three questions:

who updates it
who reads it
who is authoritative

Without ownership, state becomes gossip.

Good ownership examples

Device health state Owned by the device adapter or device manager.

Examples:

camera online/offline
controller initialized/not initialized
heartbeat good/bad

The UI may display it. The orchestrator may react to it. But neither should invent or overwrite it.

Workflow state Owned by the orchestration layer.

Examples:

current sequence step
inspection run phase
pause requested
abort in progress

A device should not decide that the workflow is “Inspecting.” It can only report facts that the orchestrator uses.

Machine-level state Usually owned by a machine state aggregator or machine controller layer.

This layer consumes authoritative subsystem/workflow state and publishes the coherent top-level machine state.

Why multiple writers are dangerous

Suppose both the UI layer and the orchestration layer can set MachineState = Ready.

Now imagine:

workflow sets Ready because sequence ended
safety later detects guard open
UI has not yet processed safety event
operator still sees Ready
command button stays enabled

This is how accidental multiple writers create unsafe or confusing behavior.

A practical rule

For every important state field, ask:

Who owns this?
Who is allowed to change it?
Who only observes it?
What other state is derived from it?

If you cannot answer quickly, the design is already weak.

PART 4 — STATE PROPAGATION

State does not just exist. It moves.

In most machine systems, the propagation path is something like:

device changes
subsystem updates local state
application/orchestrator reacts
machine-level state is recomputed
UI receives updated projection

Flow diagram

text

+-------------+      +----------------+      +------------------+
| Device/SDK  | ---> | Subsystem      | ---> | Application /    |
| or PLC      |      | State Owner    |      | Orchestrator     |
+-------------+      +----------------+      +------------------+
        |                     |                        |
        | raw fact            | authoritative update   | derived state
        v                     v                        v
  "camera lost"        VisionState=Error        MachineState=Faulted
                                                     |
                                                     v
                                             +------------------+
                                             | UI / HMI / Logs  |
                                             +------------------+

Why ordering matters

Now consider two events:

motion stopped
motion faulted

If the system processes them out of order, the machine might temporarily compute:

axes stopped
no fault
workflow still active

and report “Paused” or “Ready” before moving to “Faulted.”

That temporary inconsistency may only last 100 ms, but in industrial software that is long enough for:

UI flicker
incorrect command enablement
wrong log sequence
a bad automatic recovery decision

Why delay matters

State propagation delay is not just a UI issue.

If an orchestrator consumes delayed state, it can execute a command using yesterday’s truth in today’s situation.

Common delay sources:

async event queues
polling intervals
cross-thread marshaling
vendor SDK callbacks arriving late
batched update mechanisms
lock contention

This is why event-driven design and concurrency design are tightly coupled with state design in industrial systems. The roadmap explicitly groups event-driven models, queues, polling, producer-consumer pipelines, synchronization, and race conditions as architectural concerns because they directly shape state propagation correctness.

PART 5 — CONSISTENCY PROBLEMS

This is where real systems get ugly.

1. Stale state

A component is reading a state snapshot that is no longer current.

Example UI still shows camera ready because the disconnect event has not been applied yet.

Why it happens

polling delay
queue lag
cached projections
delayed marshaling to UI thread

2. Conflicting state

Two state sources claim incompatible truths.

Example

workflow says “Running”
safety says “EStop active”
machine summary says “Running”

The summary logic is wrong, or one state source was not included.

3. Partial updates

A broader state transition requires multiple fields to change, but only some are updated before observers react.

Example During abort:

WorkflowState = Aborting
CanStart = true accidentally still true
CurrentRun not yet cleared

A command gate reads halfway through the update and enables Start too early.

4. Out-of-order updates

Events arrive or are applied in the wrong order.

Example

device emits Recovered
delayed Faulted arrives later
state ends in Faulted even though device is healthy now

5. Race conditions

Two concurrent operations both update related state without a defined sequencing model.

Example

operator presses Pause
watchdog triggers Fault
sequence completion event arrives
final state depends on timing, not rules

Consistency diagram

text

Time --->

Event Stream:
  E1: WorkflowStarted
  E2: VisionFault
  E3: UIRefresh
  E4: MachineStateRecomputed

Bad ordering:
  E1 ------> E3 ------> E4 ------> E2

Observed result:
  UI shows "Running"
  even though vision is already faulted in reality

Why asynchronous systems make this hard

Because there is no single “now.”

Different parts of the system observe different versions of reality at slightly different times. A strong architecture accepts that and creates explicit rules for:

event ordering
update atomicity
state recomputation
transition precedence
observer consistency expectations

PART 6 — DERIVED STATE VS SOURCE STATE

This distinction is one of the most important in real machine software.

Source state

This is state that comes directly from an authoritative owner.

Examples:

axis homed = true
safety guard open = false
camera connected = true
workflow current step = Inspecting
alarm list contains MotionFault

Derived state

This is state computed from source state.

Examples:

machine ready
machine can start
machine is faulted
UI command enablement
alarm banner severity
production availability

Example: “Machine Ready”

“Ready” is almost never a source state. It is derived.

It may be computed from rules like:

safety is safe
no active blocking alarms
all required devices initialized
motion homed
no workflow running
recipe valid
no pending recovery

text

MachineReady =
    SafetyState == Safe
AND NoBlockingAlarms
AND MotionState == Ready
AND VisionState == Ready
AND WorkflowState == Idle
AND RecipeState == Valid

The exact formula varies, but the principle does not.

Risks of derived state

Risk 1: hidden derivation rules If “Ready” is computed in five places, each place drifts.

Risk 2: missing dependencies If Ready ignores safety or active alarms, it becomes dangerous.

Risk 3: invalid timing assumptions If derived state is recomputed from stale source state, the result is technically correct but operationally false.

Strong practice

Keep source state explicit. Keep derived state rules centralized. Treat derived state as a projection, not as an independent truth.

PART 7 — REAL-WORLD FAILURE SCENARIOS

Scenario 1 — System shows incorrect overall state

What it looks like HMI banner says “Ready,” but the machine refuses to start, or a subsystem panel shows an error.

Why it happens

machine summary omitted one subsystem
summary derived from stale cache
“ready” rule duplicated in multiple places

How engineers debug it

inspect raw subsystem states at the same timestamp
compare machine-state derivation inputs
trace when summary was last recomputed
verify which component owns the displayed status

Scenario 2 — Subsystem failure not reflected in machine state

What it looks like Camera disconnects, but machine remains “Running” or “Ready.”

Why it happens

failure event did not propagate
subsystem state updated, machine aggregator did not subscribe
fault classified as non-blocking when it should be blocking

How engineers debug it

reconstruct event timeline
inspect subscription path from device to machine aggregator
verify fault severity and aggregation rules
check whether event was dropped, delayed, or filtered

Scenario 3 — Race condition causes temporary invalid state

What it looks like During Stop or Abort, the UI briefly re-enables Start or Manual Move.

Why it happens

partial update sequence
independent observers reacting to different fields
transition flags and command guards not updated atomically

How engineers debug it

add timestamped state-transition logs
capture full state snapshot before and after each transition
look for non-atomic updates across related fields
reproduce under load or with artificial delays

Scenario 4 — UI reacts to outdated state

What it looks like Operator clicks Start because button remained enabled for half a second after a fault.

Why it happens

UI projection lagged behind core state
UI bound to cached view-model field rather than authoritative projection
command gate implemented locally in UI, not centrally in core logic

How engineers debug it

compare UI projection timestamp to core state timestamp
inspect where enablement logic lives
verify that core command handler re-validates state, not just UI button status

Scenario 5 — Inconsistent state leads to wrong command execution

What it looks like System accepts a transition into inspection even though one device is reconnecting.

Why it happens

readiness derived too loosely
one subsystem reports “last known good” instead of current unavailable state
command handler trusted derived state without validating critical source conditions

How engineers debug it

replay exact pre-command state snapshot
inspect authoritative owners for each readiness input
verify precedence of reconnecting/error/not-ready conditions
test with deliberate event delay injection

PART 8 — SOFTWARE DESIGN IMPLICATIONS

This is where architecture either saves you or kills you.

Why state must be modeled explicitly

In industrial systems, state is not a side effect. It is part of the design.

If you do not model it explicitly, it still exists, but now it is fragmented across:

booleans
view models
service fields
SDK callbacks
ad hoc caches
command guards
log messages

That is when nobody can answer simple questions like:

Is the machine actually ready?
Who decided it was faulted?
Why is Start disabled?
Which state change happened first?

Bad approach

text

UI ViewModel sets:
  IsReady
  IsRunning
  CanStart

Workflow service sets:
  CurrentMode
  IsBusy

Device manager sets:
  IsConnected
  ErrorText

Some command handler checks all of them directly.

Problems:

many writers
no authoritative ownership
mixed source and derived state
impossible to reason about ordering
UI accidentally becomes part of control logic

Good approach

text

+------------------------+
| Device State Owners    |
| Motion / Vision / IO   |
+-----------+------------+
            |
            v
+------------------------+
| Workflow State Owner   |
| Orchestrator           |
+-----------+------------+
            |
            v
+------------------------+
| Machine State Model    |
| Aggregation + Rules    |
+-----------+------------+
            |
            +-------> UI Projections
            |
            +-------> Command Guards
            |
            +-------> Logs / Diagnostics

Component diagram explanation

A stronger design typically has:

authoritative local state owners for each subsystem
an orchestrator that owns process/workflow state
a machine state model that derives top-level state from authoritative inputs
read-only projections for UI and diagnostics
command guards that validate against core state, not UI copies

Design principles that matter

1. Clear ownership Every state field has one authoritative writer.

2. Controlled updates Do not let random services patch machine status.

3. Consistent propagation Define how state moves through the system and what ordering guarantees exist.

4. Separation of source and derived state Do not let derived summaries behave like independent truths.

5. Centralized derivation rules Compute things like Ready, Faulted, CanStart, CanHome in one place.

6. Snapshot-oriented diagnostics Log state transitions with enough context to reconstruct what happened.

These principles line up with the roadmap’s emphasis on layered architecture, stateful components, orchestrator patterns, error propagation strategy, device health monitoring, fault handling, and root-cause-friendly observability.

PART 9 — INTERVIEW / REAL-WORLD TALKING POINTS

Here is how I would explain it in an interview or architecture discussion.

How to explain system-level state clearly

“System-level state in industrial machine software is not just one enum for the whole machine. It is the coordinated result of multiple state layers: device state, subsystem state, workflow state, and machine-level derived state. The challenge is maintaining consistency across asynchronous components so the system makes correct decisions and the operator sees trustworthy status.”

Why consistency is difficult

“Consistency is difficult because state comes from multiple sources at different times. Devices report asynchronously, workflows transition independently, and UI projections lag behind core logic. The real problem is not storing state; it is ownership, propagation, ordering, and derivation.”

Common mistakes engineers make

mixing source state and derived state
allowing multiple writers for the same state
duplicating readiness logic in UI, workflow, and services
using cached status for control decisions
not modeling transient states like Starting, Stopping, Recovering
treating machine state as presentation only instead of control logic

What strong engineers understand

Strong engineers understand that:

every important state needs a clear owner
top-level machine state is usually derived, not directly observed
asynchronous propagation means temporary inconsistency must be expected and controlled
command decisions should validate against authoritative core state
diagnostics must capture state transitions well enough to reconstruct failures
correctness matters more than convenience when state influences motion, alarms, or recovery

A concise interview-quality summary

“A good industrial state model separates local truth from global truth. Devices own device facts. Orchestration owns workflow progress. The machine layer derives overall status from authoritative inputs. UI observes projections rather than inventing state. The hard part is managing propagation, ordering, and consistency so the software never acts on an invalid picture of the machine.”

Final mental model

Think of system-level state like this:

local state tells you what each part believes about itself
system state tells you what the machine is allowed to do
derived state tells operators and higher-level logic what the overall condition means
good architecture makes those relationships explicit
bad architecture lets state leak everywhere until nobody trusts it

In industrial software, once nobody trusts the state model, everything gets worse:

operators stop trusting the UI
engineers add more ad hoc checks
workflows become defensive and tangled
recovery becomes unpredictable
debugging becomes timeline archaeology

That is why system-level state is a first-class architectural problem, not a detail.

This topic fits naturally under Industrial Software Architecture, which in your source-of-truth roadmap already emphasizes stateful components, orchestrator patterns, separation of UI/workflow/device logic, and error propagation, and it connects closely to concurrency, reliability, and observability concerns elsewhere in the roadmap.

If you want, I can continue with the next topic in the same style.

Streaming Pipelines Dotnet Real World

State Management at System Level ​

PART 1 — WHY SYSTEM-LEVEL STATE IS HARD ​

PART 2 — TYPES OF STATE IN MACHINE SYSTEMS ​

1. Machine-level state ​

2. Subsystem-level state ​

3. Device-level state ​

4. Workflow state ​

5. Transient state ​

6. Persistent state ​

State relationship diagram ​

How to read this diagram ​

PART 3 — STATE OWNERSHIP ​

Good ownership examples ​

Why multiple writers are dangerous ​

A practical rule ​

PART 4 — STATE PROPAGATION ​

Flow diagram ​

Why ordering matters ​

Why delay matters ​

PART 5 — CONSISTENCY PROBLEMS ​

1. Stale state ​

2. Conflicting state ​

3. Partial updates ​

4. Out-of-order updates ​

5. Race conditions ​

Consistency diagram ​

Why asynchronous systems make this hard ​

PART 6 — DERIVED STATE VS SOURCE STATE ​

Source state ​

Derived state ​

Example: “Machine Ready” ​

Risks of derived state ​

Strong practice ​

PART 7 — REAL-WORLD FAILURE SCENARIOS ​

Scenario 1 — System shows incorrect overall state ​

Scenario 2 — Subsystem failure not reflected in machine state ​

Scenario 3 — Race condition causes temporary invalid state ​

Scenario 4 — UI reacts to outdated state ​

Scenario 5 — Inconsistent state leads to wrong command execution ​

PART 8 — SOFTWARE DESIGN IMPLICATIONS ​

Why state must be modeled explicitly ​

Bad approach ​

Good approach ​

Component diagram explanation ​

Design principles that matter ​

PART 9 — INTERVIEW / REAL-WORLD TALKING POINTS ​

How to explain system-level state clearly ​

Why consistency is difficult ​

Common mistakes engineers make ​

What strong engineers understand ​

A concise interview-quality summary ​

Final mental model ​

State Management at System Level

PART 1 — WHY SYSTEM-LEVEL STATE IS HARD

PART 2 — TYPES OF STATE IN MACHINE SYSTEMS

1. Machine-level state

2. Subsystem-level state

3. Device-level state

4. Workflow state

5. Transient state

6. Persistent state

State relationship diagram

How to read this diagram

PART 3 — STATE OWNERSHIP

Good ownership examples

Why multiple writers are dangerous

A practical rule

PART 4 — STATE PROPAGATION

Flow diagram

Why ordering matters

Why delay matters

PART 5 — CONSISTENCY PROBLEMS

1. Stale state

2. Conflicting state

3. Partial updates

4. Out-of-order updates

5. Race conditions

Consistency diagram

Why asynchronous systems make this hard

PART 6 — DERIVED STATE VS SOURCE STATE

Source state

Derived state

Example: “Machine Ready”

Risks of derived state

Strong practice

PART 7 — REAL-WORLD FAILURE SCENARIOS

Scenario 1 — System shows incorrect overall state

Scenario 2 — Subsystem failure not reflected in machine state

Scenario 3 — Race condition causes temporary invalid state

Scenario 4 — UI reacts to outdated state

Scenario 5 — Inconsistent state leads to wrong command execution

PART 8 — SOFTWARE DESIGN IMPLICATIONS

Why state must be modeled explicitly

Bad approach

Good approach

Component diagram explanation

Design principles that matter

PART 9 — INTERVIEW / REAL-WORLD TALKING POINTS

How to explain system-level state clearly

Why consistency is difficult

Common mistakes engineers make

What strong engineers understand

A concise interview-quality summary

Final mental model