Below is a principal-level explanation of Workflow & Process Coordination, aligned to your source of truth: Domain 1 explicitly includes Machine Workflow & Sequencing, with emphasis on step-by-step sequencing, synchronization between subsystems, deterministic workflow execution, operational control semantics, and fault handling. The roadmap also ties this to long-running workflows, stateful components, error propagation, concurrency, and recovery.
PART 1 — WHAT A WORKFLOW IS IN MACHINE SOFTWARE
In industrial machine software, a workflow is the explicit model of a real machine process.
It is not just “some code that runs in order.” It is a representation of a physical operation such as:
- inspection cycle
- pick-and-place cycle
- wafer alignment procedure
- calibration routine
- unload / load sequence
- recovery procedure
A workflow answers questions like:
- What step are we in right now?
- What must complete before the next step can begin?
- What conditions must be true to continue?
- What happens if we pause, stop, timeout, or fail?
- What has already been done, and what remains?
That last question is critical. In business software, if a method fails, you often retry or roll back a transaction. In machine software, the machine may already have moved, clamped a part, energized a vacuum, captured an image, or opened a valve. The physical world does not roll back automatically.
Workflow vs orchestration vs state machine
These three are related, but they are not the same.
Workflow The process definition itself. It describes the business-of-the-machine sequence: load wafer, align, autofocus, scan, review, unload.
Orchestration The coordination logic that drives subsystems during that workflow. It decides when to command motion, when to wait for vision readiness, when to validate interlocks, when to branch, when to raise alarms.
State machine The execution-control model. It governs allowed states and transitions such as Idle -> Starting -> Running -> Paused -> Stopping -> Faulted -> Recovering.
A useful mental model is:
- workflow = what process the machine is performing
- orchestration = how the system coordinates components to perform it
- state machine = how execution is controlled safely and predictably
A machine can have one workflow model, an orchestration layer that executes it, and a state model that constrains what execution states are valid. That separation is usually healthier than collapsing everything into one giant state enum.
PART 2 — STRUCTURING WORKFLOW STEPS
A workflow is built from explicit steps.
Typical machine workflow steps include:
- move to position
- home axis
- wait for sensor
- acquire image
- validate result
- actuate clamp / vacuum / IO
- compute next target
- confirm subsystem ready
- branch based on outcome
- finalize / cleanup
These steps are not all equal. Some are:
- action steps: command something
- wait steps: wait for completion or condition
- decision steps: choose next branch
- validation steps: verify safety / readiness / quality
- recovery steps: clear partial state or bring machine to a safe point
Step dependencies
In real systems, a step depends on more than “previous step finished.”
A step may require:
- motion complete
- position within tolerance
- no active interlock
- device initialized
- sensor stable for N ms
- image acquisition buffer ready
- recipe parameter validated
- operator acknowledgment received
So good workflow design treats dependencies explicitly, not implicitly.
Sequencing rules
A robust workflow usually follows this pattern:
- validate prerequisites
- issue command
- observe progress
- detect completion or timeout
- verify postconditions
- transition to next step
That sounds simple, but many bad systems skip steps 1, 4, or 5.
Conditional branching
Machine workflows often branch on:
- recipe options
- product type
- inspection outcome
- sensor results
- subsystem capability
- fault condition
- operator choice during recovery
ASCII workflow diagram
+------------------+
| Start Cycle |
+------------------+
|
v
+------------------+
| Validate Ready |
| - recipe loaded |
| - no alarms |
| - interlocks ok |
+------------------+
|
v
+------------------+
| Move to Start |
+------------------+
|
v
+------------------+
| Wait Motion Done |
+------------------+
|
v
+------------------+
| Acquire Data |
+------------------+
|
v
+------------------+
| Validate Result |
+------------------+
/ \
/pass \fail
v v
+------------------+ +----------------------+
| Next Process Step| | Recovery / Retry |
+------------------+ +----------------------+
| |
v v
+------------------+ +----------------------+
| Complete Cycle | | Operator Decision |
+------------------+ +----------------------+What this diagram means
This is not just business flow. Each box usually maps to:
- a command to one or more subsystems
- a wait for asynchronous completion
- timeout and fault logic
- state tracking
- interruption handling points
That is why machine workflows need explicit modeling.
PART 3 — LONG-RUNNING WORKFLOWS
Machine workflows are often long-running.
They may last:
- a few seconds for a simple transfer
- minutes for calibration
- tens of minutes for a batch operation
- hours for full inspection lots or maintenance procedures
That changes the design completely.
A long-running workflow is not just a method call that takes longer. It has to survive:
- asynchronous device completions
- delays and timeouts
- operator intervention
- pause / stop / abort requests
- device reconnects
- transient bad measurements
- partial success
- power cycles in some architectures
- stale or reordered events
- subsystem availability changes
Why it is different from a simple function call
A normal function call assumes:
- one call stack
- immediate control
- one thread of execution
- predictable return path
A machine workflow usually involves:
- multiple asynchronous subsystems
- external events arriving later
- long waits
- state that must outlive one method frame
- interruption requests from outside
- progress tracking visible to operators and logs
So a workflow engine in a machine is usually closer to a persistent execution model than to a normal procedural method.
What must be tracked
For long-running execution, you typically track:
- workflow instance id
- current step
- step status
- workflow status
- start time / duration
- active command ids or correlation ids
- last known subsystem statuses
- retry count
- pause / stop / abort requested flags
- partial completion markers
- fault context
- operator action requirements
If you do not track these explicitly, debugging becomes miserable.
PART 4 — COORDINATING SUBSYSTEMS WITHIN WORKFLOW
A workflow coordinates subsystems such as:
- motion
- vision
- sensors
- IO
- vacuum / pneumatics
- robot handlers
- measurement devices
The workflow itself should not become a dumping ground for device-specific details. It should coordinate at the right level.
For example, a wafer inspection step may conceptually say:
- move stage to scan position
- wait until stage settled
- trigger camera acquisition
- validate frame received
- evaluate focus metric
- decide continue or refocus
The workflow is about process intent and cross-subsystem coordination, not low-level driver mechanics.
ASCII sequence diagram
Workflow MotionCtrl SensorSvc VisionSvc IO/Actuator
| | | | |
|--StartStep---->| | | |
| |--MoveTo(X,Y)-->| | |
| |<--Moving-------| | |
| |<--InPosition---| | |
|<--MotionDone---| | | |
|--CheckReady-------------------->| | |
|<-------------SensorOK-----------| | |
|--TriggerAcquire--------------------------------->| |
|<-------------------------------FrameReady--------| |
|--SetOutput------------------------------------------------------->|
|<----------------------------------------------------OutputDone----|
|--AdvanceToNextStep-->|What this diagram means
Notice the workflow does not assume a command is complete because the method returned.
Instead it follows a realistic pattern:
- command subsystem
- wait for actual completion signal
- validate readiness
- trigger next subsystem
- continue only when postconditions are true
That is the essence of process coordination.
Common dependency patterns
Real workflows usually depend on one of these:
- completion dependency: do not continue until prior action completed
- condition dependency: do not continue until condition becomes true
- stability dependency: do not continue until value is stable for some interval
- mutual exclusion dependency: do not start because another subsystem owns the resource
- safe-state dependency: do not start until machine is in a known safe condition
Strong engineers make these dependencies visible in the design.
PART 5 — HANDLING INTERRUPTIONS
Industrial workflows must handle interruption as a first-class concern.
Typical interruption types:
- pause
- resume
- stop
- abort
These are not synonyms.
Pause
Pause usually means:
- finish to a safe pause boundary if possible
- hold resources in a consistent state
- remember current progress
- allow later continuation
Example: finish current image acquisition, then stop advancing.
Resume
Resume means:
- confirm prerequisites still hold
- restore execution context
- continue from a valid re-entry point
- revalidate any stale assumptions
Resume is often harder than pause.
Stop
Stop usually means:
- request orderly termination
- finish current safe unit of work
- perform cleanup
- bring machine to a controlled state
Abort
Abort means:
- terminate as fast as safely possible
- may cut short normal sequencing
- may leave work incomplete
- often transitions to faulted / recovery-needed state
What happens at different timing points
At a step boundary
This is the easiest case.
You can often:
- record step complete
- check interruption request
- transition to Paused or Stopped
- avoid starting the next step
Mid-step
This is much harder.
Example:
- stage is moving
- vacuum is engaging
- image acquisition is underway
- robot arm is between positions
You cannot always just “stop now.” You need step-specific interruption semantics.
A good question for every step is:
What does pause/stop/abort mean while this step is active?
During waiting
This is where many workflows get stuck.
The workflow may be waiting for:
- motion done
- sensor ready
- acquisition complete
- PLC acknowledgment
- timeout window
If pause/stop arrives during waiting, the engine must decide:
- keep waiting until safe completion?
- cancel the underlying action?
- transition wait state?
- ignore completion events that arrive after cancellation?
That last one is a major source of bugs.
ASCII workflow-state view
+---------+
| Idle |
+---------+
|
v
+---------+
| Starting|
+---------+
|
v
+---------+
| Running |
+---------+
/ | \
pause / stop| abort
v v v
+---------+ +---------+ +---------+
| Pausing | |Stopping | |Aborting |
+---------+ +---------+ +---------+
| | |
v v v
+---------+ +---------+ +---------+
| Paused | | Stopped | | Faulted |
+---------+ +---------+ +---------+
|
resume
|
v
+---------+
| Running |
+---------+Why interruption handling is complex
Because the workflow is coordinating real operations that may already be in progress, and each subsystem may have different cancellation behavior.
Motion may decelerate. Camera capture may already be triggered. PLC may already have latched a command. A valve may already be open. A part may already be clamped.
So interruption is not just a control flag. It is a coordination problem across real subsystems.
PART 6 — PARTIAL COMPLETION & RECOVERY
This is one of the most important ideas in machine workflows.
When failure happens, you need to know:
- what has already completed
- what is in progress
- what definitely did not happen
- what physical state the machine is now in
- what the safe next action is
That is why strong workflow systems track completion markers, not just current step.
Typical recovery choices
When a workflow fails mid-process, the system may:
- retry the current step
- repeat the whole sub-sequence
- perform compensating cleanup
- continue from the next safe checkpoint
- move to a recovery workflow
- require operator intervention
Retry step
Good when:
- action is idempotent or safely repeatable
- failure is transient
- physical state remains valid
Bad when:
- repeating may duplicate actuation
- side effects already happened
- the environment changed
Rollback
In machine software, rollback is limited.
You can sometimes:
- move back to safe position
- release clamp
- clear output
- discard partial data
- mark part for reject
But you often cannot “undo” the physical world in the same clean way as database rollback.
Continue safely
Possible if:
- completed steps are trusted
- next step does not require redoing previous action
- machine state is still consistent
Require operator action
Often necessary when:
- material position is uncertain
- a gripper may still hold a part
- a sensor disagrees with expected state
- a human must inspect or reset hardware
Checkpoint mentality
Good workflow design often uses checkpoints like:
- recipe validated
- hardware initialized
- part clamped
- stage homed
- scan region 1 complete
- lot step N complete
These checkpoints make recovery tractable.
PART 7 — REAL-WORLD FAILURE SCENARIOS
1. Workflow stuck waiting for event
What it looks like Machine shows “Running,” but progress never advances. No obvious fault. Operator says it hangs randomly.
Why it happens
- expected completion event never arrived
- event arrived before wait subscription was active
- timeout missing or too large
- event correlation id mismatch
- subsystem completed physically, but status propagation failed
How engineers debug it
- inspect workflow timeline and current wait condition
- verify whether command was actually issued
- check raw device communication logs
- confirm event/callback path fired
- compare command id vs completion id
- look for race between command issue and event subscription
This is a classic asynchronous coordination bug.
2. Step completes but next step starts too early
What it looks like System starts acquisition before motion has truly settled, or starts clamp before positioning fully finished.
Why it happens
- using “command accepted” as “command completed”
- completion signal means “in position” but not “stable”
- stale cached state read as current
- no postcondition validation
- subsystem reports ready earlier than physically safe
How engineers debug it
- compare timestamps of command, completion, and next-step start
- inspect whether completion semantics are misunderstood
- add separate “settled” or “postcondition verified” state
- instrument actual device values around transition time
Strong engineers learn to distrust naive “done” signals.
3. Interruption leaves system inconsistent
What it looks like Pause requested during cycle. UI says paused, but one actuator is still active, or workflow resumes from the wrong place.
Why it happens
- interruption only updated workflow flag, not subsystem behavior
- no defined mid-step interruption policy
- transition to Paused happened before underlying step quiesced
- cleanup action not modeled
How engineers debug it
- reconstruct exact step at interruption time
- inspect whether pause was handled at boundary or mid-step
- verify command cancellation / safe-stop path
- check whether completion event from old step was consumed after pause
This is why interruption semantics must be defined per step type.
4. Retry causes duplicate actions
What it looks like Part gets clamped twice, image stored twice, output signal sent twice, item counted twice.
Why it happens
- step retried without idempotency design
- workflow did not know action had already succeeded
- acknowledgement was delayed, causing false timeout
- completion state not persisted before retry
How engineers debug it
- inspect retry reason and timing
- determine whether original action actually completed
- check if step had idempotency token or duplicate suppression
- separate “command issued,” “command acknowledged,” and “effect confirmed”
Retry is dangerous when physical side effects exist.
5. Condition check incorrect due to stale data
What it looks like Workflow proceeds because sensor says safe, but that value was from an earlier cycle.
Why it happens
- polling cache not refreshed
- async status propagation lag
- data timestamp ignored
- condition check subscribed to wrong source
- race between state update and decision
How engineers debug it
- inspect timestamp and source of condition data
- distinguish latest observed value from last published value
- validate freshness window
- trace the path from device read to workflow decision
In machine software, stale data is often more dangerous than missing data.
PART 8 — SOFTWARE DESIGN IMPLICATIONS
Workflow logic should be explicitly modeled, not hidden across random service methods.
Your source-of-truth already points in this direction: machine workflow/sequencing, deterministic execution, operational control semantics, interlocks/fault handling, and state-driven design all belong here.
Why workflow must be explicit
Because you need to reason about:
- current step
- allowed next step
- wait conditions
- interruption points
- recovery points
- fault ownership
- step completion evidence
If these are scattered through event handlers, timers, callbacks, and service classes, you no longer have a workflow. You have an accident waiting to happen.
Good vs bad approaches
Bad: implicit workflows in code
UI button click
-> serviceA.DoThing()
-> if ok call serviceB.Start()
-> callback somewhere sets flag
-> timer somewhere checks flag
-> maybe call serviceC()Why this fails:
- execution flow is invisible
- step boundaries are unclear
- state lives in booleans everywhere
- pause/stop/recovery become chaotic
- race conditions become normal
- debugging requires reading half the codebase
Good: structured workflow model
Workflow Definition
-> Step definitions
-> Transition rules
-> Preconditions / postconditions
-> Interruption policy
-> Retry / recovery policy
Workflow Executor
-> Runs current step
-> Tracks state and progress
-> Waits for completion
-> Applies transition rules
-> Handles pause/stop/abort/fault
Subsystem Services
-> Motion
-> Vision
-> IO
-> SensorsASCII component view
+--------------------------+
| Workflow Definition |
| - steps |
| - transitions |
| - conditions |
| - recovery rules |
+------------+-------------+
|
v
+--------------------------+
| Workflow Executor |
| - current step |
| - progress tracking |
| - interruption handling |
| - timeout handling |
| - retry / recovery |
+------+-------+-----------+
| |
v v
+----------+ +----------+ +----------+ +----------+
| Motion | | Vision | | Sensors | | IO/PLC |
| Service | | Service | | Service | | Service |
+----------+ +----------+ +----------+ +----------+Design principles that matter
Clear step definitions Each step should have a single clear purpose.
Explicit transitions Do not infer next steps from scattered flags.
State tracking Track workflow state, step state, and completion markers explicitly.
Separation from device logic The workflow says what process is happening. Device services say how to talk to hardware.
Time-aware logic Conditions should know about timeout, freshness, stability, and cancellation.
Observability built in Every step transition should be logged with timestamps, identifiers, and reason.
Recovery modeled, not improvised Recovery code written only during incidents is almost always bad.
PART 9 — INTERVIEW / REAL-WORLD TALKING POINTS
How to explain workflows clearly
You can say:
In industrial machine software, a workflow is the explicit model of a real machine process. It is not just sequential code. It defines ordered steps, dependencies, conditions, interruption behavior, and recovery paths across multiple subsystems.
That is a strong answer.
Difference between workflow and orchestration
A clean explanation:
Workflow is the process definition: the sequence and rules of the machine operation. Orchestration is the runtime coordination logic that drives motion, sensors, vision, and IO to execute that process. The workflow describes what should happen; orchestration ensures the subsystems do it in the right order and under the right conditions.
Common mistakes engineers make
- treating physical actions like synchronous method calls
- assuming command accepted means operation complete
- hiding workflow in event handlers and flags
- not defining pause/stop/abort semantics per step
- retrying non-idempotent actions blindly
- not tracking partial completion
- using stale status for decisions
- mixing device logic and workflow logic
- having no timeout or no meaningful timeout reason
- making recovery manual because software does not know what happened
What strong engineers understand about long-running processes
Strong engineers understand that:
- the machine process is stateful and long-lived
- every step has physical meaning
- step completion must be proven, not assumed
- interruption is part of the design, not an edge case
- recovery requires explicit knowledge of partial completion
- correctness matters more than elegant-looking abstraction
- diagnosability is part of architecture
Interview-ready closing statement
The key architectural move is to model workflow explicitly. Once the workflow, step boundaries, completion conditions, and interruption/recovery semantics are clear, the system becomes understandable, testable, and safer. When those things are implicit, long-running machine behavior becomes fragile very quickly.
Final mental model
Think of workflow coordination in industrial software like this:
- workflow gives the machine a structured process
- executor/orchestrator drives that process across subsystems
- state tracking tells you where you are
- interruption handling tells you how to stop or pause safely
- recovery logic tells you what to do when reality diverges from expectation
The core mindset is not “run steps in order.” It is:
maintain correct, observable, recoverable progress through a physical process over time.
That is the real meaning of workflow & process coordination in machine software.
If you want, I can turn this into the same reusable learning format as the earlier topics, with a compact summary + key takeaways + interview Q&A section at the end.