Skip to content

Below is a principal-level explanation of Workflow & Process Coordination, aligned to your source of truth: Domain 1 explicitly includes Machine Workflow & Sequencing, with emphasis on step-by-step sequencing, synchronization between subsystems, deterministic workflow execution, operational control semantics, and fault handling. The roadmap also ties this to long-running workflows, stateful components, error propagation, concurrency, and recovery.

PART 1 — WHAT A WORKFLOW IS IN MACHINE SOFTWARE

In industrial machine software, a workflow is the explicit model of a real machine process.

It is not just “some code that runs in order.” It is a representation of a physical operation such as:

  • inspection cycle
  • pick-and-place cycle
  • wafer alignment procedure
  • calibration routine
  • unload / load sequence
  • recovery procedure

A workflow answers questions like:

  • What step are we in right now?
  • What must complete before the next step can begin?
  • What conditions must be true to continue?
  • What happens if we pause, stop, timeout, or fail?
  • What has already been done, and what remains?

That last question is critical. In business software, if a method fails, you often retry or roll back a transaction. In machine software, the machine may already have moved, clamped a part, energized a vacuum, captured an image, or opened a valve. The physical world does not roll back automatically.

Workflow vs orchestration vs state machine

These three are related, but they are not the same.

Workflow The process definition itself. It describes the business-of-the-machine sequence: load wafer, align, autofocus, scan, review, unload.

Orchestration The coordination logic that drives subsystems during that workflow. It decides when to command motion, when to wait for vision readiness, when to validate interlocks, when to branch, when to raise alarms.

State machine The execution-control model. It governs allowed states and transitions such as Idle -> Starting -> Running -> Paused -> Stopping -> Faulted -> Recovering.

A useful mental model is:

  • workflow = what process the machine is performing
  • orchestration = how the system coordinates components to perform it
  • state machine = how execution is controlled safely and predictably

A machine can have one workflow model, an orchestration layer that executes it, and a state model that constrains what execution states are valid. That separation is usually healthier than collapsing everything into one giant state enum.


PART 2 — STRUCTURING WORKFLOW STEPS

A workflow is built from explicit steps.

Typical machine workflow steps include:

  • move to position
  • home axis
  • wait for sensor
  • acquire image
  • validate result
  • actuate clamp / vacuum / IO
  • compute next target
  • confirm subsystem ready
  • branch based on outcome
  • finalize / cleanup

These steps are not all equal. Some are:

  • action steps: command something
  • wait steps: wait for completion or condition
  • decision steps: choose next branch
  • validation steps: verify safety / readiness / quality
  • recovery steps: clear partial state or bring machine to a safe point

Step dependencies

In real systems, a step depends on more than “previous step finished.”

A step may require:

  • motion complete
  • position within tolerance
  • no active interlock
  • device initialized
  • sensor stable for N ms
  • image acquisition buffer ready
  • recipe parameter validated
  • operator acknowledgment received

So good workflow design treats dependencies explicitly, not implicitly.

Sequencing rules

A robust workflow usually follows this pattern:

  1. validate prerequisites
  2. issue command
  3. observe progress
  4. detect completion or timeout
  5. verify postconditions
  6. transition to next step

That sounds simple, but many bad systems skip steps 1, 4, or 5.

Conditional branching

Machine workflows often branch on:

  • recipe options
  • product type
  • inspection outcome
  • sensor results
  • subsystem capability
  • fault condition
  • operator choice during recovery

ASCII workflow diagram

text
+------------------+
| Start Cycle      |
+------------------+
         |
         v
+------------------+
| Validate Ready   |
| - recipe loaded  |
| - no alarms      |
| - interlocks ok  |
+------------------+
         |
         v
+------------------+
| Move to Start    |
+------------------+
         |
         v
+------------------+
| Wait Motion Done |
+------------------+
         |
         v
+------------------+
| Acquire Data     |
+------------------+
         |
         v
+------------------+
| Validate Result  |
+------------------+
      /       \
     /pass     \fail
    v           v
+------------------+    +----------------------+
| Next Process Step|    | Recovery / Retry     |
+------------------+    +----------------------+
         |                         |
         v                         v
+------------------+    +----------------------+
| Complete Cycle   |    | Operator Decision    |
+------------------+    +----------------------+

What this diagram means

This is not just business flow. Each box usually maps to:

  • a command to one or more subsystems
  • a wait for asynchronous completion
  • timeout and fault logic
  • state tracking
  • interruption handling points

That is why machine workflows need explicit modeling.


PART 3 — LONG-RUNNING WORKFLOWS

Machine workflows are often long-running.

They may last:

  • a few seconds for a simple transfer
  • minutes for calibration
  • tens of minutes for a batch operation
  • hours for full inspection lots or maintenance procedures

That changes the design completely.

A long-running workflow is not just a method call that takes longer. It has to survive:

  • asynchronous device completions
  • delays and timeouts
  • operator intervention
  • pause / stop / abort requests
  • device reconnects
  • transient bad measurements
  • partial success
  • power cycles in some architectures
  • stale or reordered events
  • subsystem availability changes

Why it is different from a simple function call

A normal function call assumes:

  • one call stack
  • immediate control
  • one thread of execution
  • predictable return path

A machine workflow usually involves:

  • multiple asynchronous subsystems
  • external events arriving later
  • long waits
  • state that must outlive one method frame
  • interruption requests from outside
  • progress tracking visible to operators and logs

So a workflow engine in a machine is usually closer to a persistent execution model than to a normal procedural method.

What must be tracked

For long-running execution, you typically track:

  • workflow instance id
  • current step
  • step status
  • workflow status
  • start time / duration
  • active command ids or correlation ids
  • last known subsystem statuses
  • retry count
  • pause / stop / abort requested flags
  • partial completion markers
  • fault context
  • operator action requirements

If you do not track these explicitly, debugging becomes miserable.


PART 4 — COORDINATING SUBSYSTEMS WITHIN WORKFLOW

A workflow coordinates subsystems such as:

  • motion
  • vision
  • sensors
  • IO
  • vacuum / pneumatics
  • robot handlers
  • measurement devices

The workflow itself should not become a dumping ground for device-specific details. It should coordinate at the right level.

For example, a wafer inspection step may conceptually say:

  • move stage to scan position
  • wait until stage settled
  • trigger camera acquisition
  • validate frame received
  • evaluate focus metric
  • decide continue or refocus

The workflow is about process intent and cross-subsystem coordination, not low-level driver mechanics.

ASCII sequence diagram

text
Workflow        MotionCtrl        SensorSvc        VisionSvc        IO/Actuator
   |                |                |                |                |
   |--StartStep---->|                |                |                |
   |                |--MoveTo(X,Y)-->|                |                |
   |                |<--Moving-------|                |                |
   |                |<--InPosition---|                |                |
   |<--MotionDone---|                |                |                |
   |--CheckReady-------------------->|                |                |
   |<-------------SensorOK-----------|                |                |
   |--TriggerAcquire--------------------------------->|                |
   |<-------------------------------FrameReady--------|                |
   |--SetOutput------------------------------------------------------->|
   |<----------------------------------------------------OutputDone----|
   |--AdvanceToNextStep-->|

What this diagram means

Notice the workflow does not assume a command is complete because the method returned.

Instead it follows a realistic pattern:

  • command subsystem
  • wait for actual completion signal
  • validate readiness
  • trigger next subsystem
  • continue only when postconditions are true

That is the essence of process coordination.

Common dependency patterns

Real workflows usually depend on one of these:

  • completion dependency: do not continue until prior action completed
  • condition dependency: do not continue until condition becomes true
  • stability dependency: do not continue until value is stable for some interval
  • mutual exclusion dependency: do not start because another subsystem owns the resource
  • safe-state dependency: do not start until machine is in a known safe condition

Strong engineers make these dependencies visible in the design.


PART 5 — HANDLING INTERRUPTIONS

Industrial workflows must handle interruption as a first-class concern.

Typical interruption types:

  • pause
  • resume
  • stop
  • abort

These are not synonyms.

Pause

Pause usually means:

  • finish to a safe pause boundary if possible
  • hold resources in a consistent state
  • remember current progress
  • allow later continuation

Example: finish current image acquisition, then stop advancing.

Resume

Resume means:

  • confirm prerequisites still hold
  • restore execution context
  • continue from a valid re-entry point
  • revalidate any stale assumptions

Resume is often harder than pause.

Stop

Stop usually means:

  • request orderly termination
  • finish current safe unit of work
  • perform cleanup
  • bring machine to a controlled state

Abort

Abort means:

  • terminate as fast as safely possible
  • may cut short normal sequencing
  • may leave work incomplete
  • often transitions to faulted / recovery-needed state

What happens at different timing points

At a step boundary

This is the easiest case.

You can often:

  • record step complete
  • check interruption request
  • transition to Paused or Stopped
  • avoid starting the next step

Mid-step

This is much harder.

Example:

  • stage is moving
  • vacuum is engaging
  • image acquisition is underway
  • robot arm is between positions

You cannot always just “stop now.” You need step-specific interruption semantics.

A good question for every step is:

What does pause/stop/abort mean while this step is active?

During waiting

This is where many workflows get stuck.

The workflow may be waiting for:

  • motion done
  • sensor ready
  • acquisition complete
  • PLC acknowledgment
  • timeout window

If pause/stop arrives during waiting, the engine must decide:

  • keep waiting until safe completion?
  • cancel the underlying action?
  • transition wait state?
  • ignore completion events that arrive after cancellation?

That last one is a major source of bugs.

ASCII workflow-state view

text
          +---------+
          |  Idle   |
          +---------+
               |
               v
          +---------+
          | Starting|
          +---------+
               |
               v
          +---------+
          | Running |
          +---------+
           /   |   \
    pause / stop| abort
         v      v     v
   +---------+ +---------+ +---------+
   | Pausing | |Stopping | |Aborting |
   +---------+ +---------+ +---------+
        |          |           |
        v          v           v
   +---------+ +---------+ +---------+
   | Paused  | | Stopped | | Faulted |
   +---------+ +---------+ +---------+
        |
      resume
        |
        v
   +---------+
   | Running |
   +---------+

Why interruption handling is complex

Because the workflow is coordinating real operations that may already be in progress, and each subsystem may have different cancellation behavior.

Motion may decelerate. Camera capture may already be triggered. PLC may already have latched a command. A valve may already be open. A part may already be clamped.

So interruption is not just a control flag. It is a coordination problem across real subsystems.


PART 6 — PARTIAL COMPLETION & RECOVERY

This is one of the most important ideas in machine workflows.

When failure happens, you need to know:

  • what has already completed
  • what is in progress
  • what definitely did not happen
  • what physical state the machine is now in
  • what the safe next action is

That is why strong workflow systems track completion markers, not just current step.

Typical recovery choices

When a workflow fails mid-process, the system may:

  • retry the current step
  • repeat the whole sub-sequence
  • perform compensating cleanup
  • continue from the next safe checkpoint
  • move to a recovery workflow
  • require operator intervention

Retry step

Good when:

  • action is idempotent or safely repeatable
  • failure is transient
  • physical state remains valid

Bad when:

  • repeating may duplicate actuation
  • side effects already happened
  • the environment changed

Rollback

In machine software, rollback is limited.

You can sometimes:

  • move back to safe position
  • release clamp
  • clear output
  • discard partial data
  • mark part for reject

But you often cannot “undo” the physical world in the same clean way as database rollback.

Continue safely

Possible if:

  • completed steps are trusted
  • next step does not require redoing previous action
  • machine state is still consistent

Require operator action

Often necessary when:

  • material position is uncertain
  • a gripper may still hold a part
  • a sensor disagrees with expected state
  • a human must inspect or reset hardware

Checkpoint mentality

Good workflow design often uses checkpoints like:

  • recipe validated
  • hardware initialized
  • part clamped
  • stage homed
  • scan region 1 complete
  • lot step N complete

These checkpoints make recovery tractable.


PART 7 — REAL-WORLD FAILURE SCENARIOS

1. Workflow stuck waiting for event

What it looks like Machine shows “Running,” but progress never advances. No obvious fault. Operator says it hangs randomly.

Why it happens

  • expected completion event never arrived
  • event arrived before wait subscription was active
  • timeout missing or too large
  • event correlation id mismatch
  • subsystem completed physically, but status propagation failed

How engineers debug it

  • inspect workflow timeline and current wait condition
  • verify whether command was actually issued
  • check raw device communication logs
  • confirm event/callback path fired
  • compare command id vs completion id
  • look for race between command issue and event subscription

This is a classic asynchronous coordination bug.


2. Step completes but next step starts too early

What it looks like System starts acquisition before motion has truly settled, or starts clamp before positioning fully finished.

Why it happens

  • using “command accepted” as “command completed”
  • completion signal means “in position” but not “stable”
  • stale cached state read as current
  • no postcondition validation
  • subsystem reports ready earlier than physically safe

How engineers debug it

  • compare timestamps of command, completion, and next-step start
  • inspect whether completion semantics are misunderstood
  • add separate “settled” or “postcondition verified” state
  • instrument actual device values around transition time

Strong engineers learn to distrust naive “done” signals.


3. Interruption leaves system inconsistent

What it looks like Pause requested during cycle. UI says paused, but one actuator is still active, or workflow resumes from the wrong place.

Why it happens

  • interruption only updated workflow flag, not subsystem behavior
  • no defined mid-step interruption policy
  • transition to Paused happened before underlying step quiesced
  • cleanup action not modeled

How engineers debug it

  • reconstruct exact step at interruption time
  • inspect whether pause was handled at boundary or mid-step
  • verify command cancellation / safe-stop path
  • check whether completion event from old step was consumed after pause

This is why interruption semantics must be defined per step type.


4. Retry causes duplicate actions

What it looks like Part gets clamped twice, image stored twice, output signal sent twice, item counted twice.

Why it happens

  • step retried without idempotency design
  • workflow did not know action had already succeeded
  • acknowledgement was delayed, causing false timeout
  • completion state not persisted before retry

How engineers debug it

  • inspect retry reason and timing
  • determine whether original action actually completed
  • check if step had idempotency token or duplicate suppression
  • separate “command issued,” “command acknowledged,” and “effect confirmed”

Retry is dangerous when physical side effects exist.


5. Condition check incorrect due to stale data

What it looks like Workflow proceeds because sensor says safe, but that value was from an earlier cycle.

Why it happens

  • polling cache not refreshed
  • async status propagation lag
  • data timestamp ignored
  • condition check subscribed to wrong source
  • race between state update and decision

How engineers debug it

  • inspect timestamp and source of condition data
  • distinguish latest observed value from last published value
  • validate freshness window
  • trace the path from device read to workflow decision

In machine software, stale data is often more dangerous than missing data.


PART 8 — SOFTWARE DESIGN IMPLICATIONS

Workflow logic should be explicitly modeled, not hidden across random service methods.

Your source-of-truth already points in this direction: machine workflow/sequencing, deterministic execution, operational control semantics, interlocks/fault handling, and state-driven design all belong here.

Why workflow must be explicit

Because you need to reason about:

  • current step
  • allowed next step
  • wait conditions
  • interruption points
  • recovery points
  • fault ownership
  • step completion evidence

If these are scattered through event handlers, timers, callbacks, and service classes, you no longer have a workflow. You have an accident waiting to happen.

Good vs bad approaches

Bad: implicit workflows in code

text
UI button click
  -> serviceA.DoThing()
      -> if ok call serviceB.Start()
          -> callback somewhere sets flag
              -> timer somewhere checks flag
                  -> maybe call serviceC()

Why this fails:

  • execution flow is invisible
  • step boundaries are unclear
  • state lives in booleans everywhere
  • pause/stop/recovery become chaotic
  • race conditions become normal
  • debugging requires reading half the codebase

Good: structured workflow model

text
Workflow Definition
   -> Step definitions
   -> Transition rules
   -> Preconditions / postconditions
   -> Interruption policy
   -> Retry / recovery policy

Workflow Executor
   -> Runs current step
   -> Tracks state and progress
   -> Waits for completion
   -> Applies transition rules
   -> Handles pause/stop/abort/fault

Subsystem Services
   -> Motion
   -> Vision
   -> IO
   -> Sensors

ASCII component view

text
+--------------------------+
| Workflow Definition      |
| - steps                  |
| - transitions            |
| - conditions             |
| - recovery rules         |
+------------+-------------+
             |
             v
+--------------------------+
| Workflow Executor        |
| - current step           |
| - progress tracking      |
| - interruption handling  |
| - timeout handling       |
| - retry / recovery       |
+------+-------+-----------+
       |       | 
       v       v
+----------+  +----------+  +----------+  +----------+
| Motion   |  | Vision   |  | Sensors  |  | IO/PLC   |
| Service  |  | Service  |  | Service  |  | Service  |
+----------+  +----------+  +----------+  +----------+

Design principles that matter

Clear step definitions Each step should have a single clear purpose.

Explicit transitions Do not infer next steps from scattered flags.

State tracking Track workflow state, step state, and completion markers explicitly.

Separation from device logic The workflow says what process is happening. Device services say how to talk to hardware.

Time-aware logic Conditions should know about timeout, freshness, stability, and cancellation.

Observability built in Every step transition should be logged with timestamps, identifiers, and reason.

Recovery modeled, not improvised Recovery code written only during incidents is almost always bad.


PART 9 — INTERVIEW / REAL-WORLD TALKING POINTS

How to explain workflows clearly

You can say:

In industrial machine software, a workflow is the explicit model of a real machine process. It is not just sequential code. It defines ordered steps, dependencies, conditions, interruption behavior, and recovery paths across multiple subsystems.

That is a strong answer.

Difference between workflow and orchestration

A clean explanation:

Workflow is the process definition: the sequence and rules of the machine operation. Orchestration is the runtime coordination logic that drives motion, sensors, vision, and IO to execute that process. The workflow describes what should happen; orchestration ensures the subsystems do it in the right order and under the right conditions.

Common mistakes engineers make

  • treating physical actions like synchronous method calls
  • assuming command accepted means operation complete
  • hiding workflow in event handlers and flags
  • not defining pause/stop/abort semantics per step
  • retrying non-idempotent actions blindly
  • not tracking partial completion
  • using stale status for decisions
  • mixing device logic and workflow logic
  • having no timeout or no meaningful timeout reason
  • making recovery manual because software does not know what happened

What strong engineers understand about long-running processes

Strong engineers understand that:

  • the machine process is stateful and long-lived
  • every step has physical meaning
  • step completion must be proven, not assumed
  • interruption is part of the design, not an edge case
  • recovery requires explicit knowledge of partial completion
  • correctness matters more than elegant-looking abstraction
  • diagnosability is part of architecture

Interview-ready closing statement

The key architectural move is to model workflow explicitly. Once the workflow, step boundaries, completion conditions, and interruption/recovery semantics are clear, the system becomes understandable, testable, and safer. When those things are implicit, long-running machine behavior becomes fragile very quickly.


Final mental model

Think of workflow coordination in industrial software like this:

  • workflow gives the machine a structured process
  • executor/orchestrator drives that process across subsystems
  • state tracking tells you where you are
  • interruption handling tells you how to stop or pause safely
  • recovery logic tells you what to do when reality diverges from expectation

The core mindset is not “run steps in order.” It is:

maintain correct, observable, recoverable progress through a physical process over time.

That is the real meaning of workflow & process coordination in machine software.

If you want, I can turn this into the same reusable learning format as the earlier topics, with a compact summary + key takeaways + interview Q&A section at the end.

Docs-first project memory for AI-assisted implementation.