Observability: Logging, Metrics & Diagnostics in Industrial Machine Software
This topic maps directly to your roadmap’s Domain 15 — Observability, Diagnostics & Serviceability, which highlights structured logging, workflow correlation, device communication logs, alarm/event journaling, metrics, diagnostic snapshots, exportable logs, crash dumps, replay-friendly telemetry, and root-cause-oriented observability design.
Part 1 — Why Observability Is Critical in Machine Software
In industrial machine software, failures are rarely simple.
A web app failure may look like:
“API returned 500.”
A machine software failure may look like:
“Wafer inspection stopped during autofocus after the stage moved, camera trigger timed out, interlock changed briefly, image buffer filled, and the operator pressed retry.”
That is a very different diagnostic problem.
Industrial failures are often:
- intermittent
- timing-sensitive
- cross-layer
- dependent on hardware condition
- dependent on operator action
- dependent on machine state
- hard to reproduce outside the customer site
The most important point is this:
The visible symptom is often not the root cause.
Example:
UI symptom:
Motion timeout
Possible root cause:
Door interlock flickered
Motion controller paused
Workflow kept waiting for completion
Timeout fired 5 seconds laterAnother example:
UI symptom:
Inspection failed
Possible root cause:
Illumination intensity drifted
Image contrast dropped
Detection threshold became invalid
Vision algorithm reported low confidenceAnother:
UI symptom:
Device reconnected successfully
Possible root cause:
Physical connection recovered,
but command session state was not rebuilt,
so the workflow continued with stale command assumptions.Good observability helps engineers answer:
What happened?
When did it happen?
What happened first?
Which subsystem originated the problem?
What machine state was active?
Which recipe/config/version was running?
Which command was in progress?
What changed shortly before failure?
Was this a one-time event or degradation over time?Without this, debugging becomes guessing.
Part 2 — Logging Is Not Enough
A common beginner mistake is thinking:
“We need better diagnostics, so let’s add more logs.”
That is not enough.
Logs are one kind of evidence. They are not the whole observability system.
Industrial diagnostics need several kinds of evidence:
Logs
State transitions
Command traces
Device communication traces
Metrics and counters
Diagnostic snapshots
Alarm/fault history
Crash dumps
Image/result references
Configuration/version records
Operator action historyWeak log:
Operation failed.Better log:
Timestamp: 2026-04-27T10:15:32.124Z
Subsystem: Motion.StageX
OperationId: InspectWafer-Run-1842
WorkflowStep: MoveToDiePosition
MachineState: AutoRunning
DeviceId: ACS-MotionController-01
CommandId: MoveAbs-X-88291
TargetPositionMm: 142.350
TimeoutMs: 5000
Result: Timeout
FaultCode: MOTION_TIMEOUT
InterlockState: DoorClosed=True, VacuumOk=True, ServoReady=FalseThe second log is not just text. It is evidence.
Part 3 — Structured Logging Across Layers
In machine software, a useful log entry should carry context.
Important fields include:
Timestamp
Sequence number
Subsystem
Machine state
Workflow step
Operation / correlation ID
Device ID
Command ID
Recipe version
Lot / wafer / part ID
Result / status
Fault code
Error details
Thread / task context when usefulThe key idea is:
Every important event should be explainable in relation to the machine state, workflow state, command, and device involved.
Layer-Aware Logging
A machine failure often crosses several layers:
+-------------------------------+
| UI / HMI |
| Operator clicked Start |
+---------------+---------------+
|
v
+-------------------------------+
| Workflow / Orchestrator |
| Entered InspectWafer step |
+---------------+---------------+
|
v
+-------------------------------+
| Device Abstraction Layer |
| Sent MoveAbs command |
+---------------+---------------+
|
v
+-------------------------------+
| Motion Controller / Camera |
| Command accepted / timed out |
+---------------+---------------+
|
v
+-------------------------------+
| Physical Machine |
| Stage, sensor, interlock |
+-------------------------------+A good trace connects all of these.
Example:
CorrelationId: Run-1842
[10:15:30.010] UI Operator pressed Start
[10:15:30.080] Workflow Transition Idle -> AutoRunning
[10:15:30.240] Workflow Step: MoveToInspectionStart
[10:15:30.260] Motion Command MoveAbs X=142.350 Y=88.120
[10:15:30.275] Device Controller accepted command
[10:15:31.020] IO ServoReady changed True -> False
[10:15:35.280] Motion MoveAbs timed out
[10:15:35.300] Alarm MOTION_TIMEOUT raisedThis is powerful because it shows order and causality.
The timeout is not the real root cause. The ServoReady signal changed before the timeout.
Part 4 — Metrics, Counters, and Health Indicators
Logs tell stories. Metrics show patterns.
In industrial systems, metrics help you see degradation before failure.
Useful metrics include:
Command latency
Command timeout count
Retry count
Device reconnect count
Queue depth
Dropped frames/messages
Image processing duration
Workflow cycle time
Alarm frequency
Memory usage
CPU usage
Disk usage
Buffer pool usage
Camera frame rate
Motion command completion time
Database/write latencyExample:
Camera dropped frame count:
Monday: 0
Tuesday: 3
Wednesday: 19
Thursday: 74
Root cause may not be a sudden bug.
It may be degradation:
cable issue
overloaded image pipeline
disk too slow
memory pressure
frame grabber instabilityLogs may show only individual failures.
Metrics show the trend.
Another example:
Motion command average latency:
Normal: 80 ms
Current: 430 ms
Possible meaning:
controller overloaded
network issue
motion queue congestion
machine operating near mechanical limitA strong machine system should expose health indicators such as:
Healthy
Degraded
Recovering
Disconnected
Faulted
UnknownThis is better than simple “OK / Not OK”.
Part 5 — Diagnostic Snapshots and Evidence Packages
A diagnostic snapshot captures system context at an important moment.
Important moments include:
Fault raised
Alarm raised
Workflow aborted
Device timeout
Crash
Emergency stop observed
Recovery started
Operator pressed ResetThe key rule:
Capture evidence before reset/recovery destroys the context.
When operators clear alarms, reconnect devices, or restart software, valuable evidence disappears.
Evidence Package Diagram
+------------------------------------------------+
| Diagnostic Evidence Package |
+------------------------------------------------+
| Timestamp / sequence range |
| Software version / build |
| Machine ID / station ID |
| Recipe / config version |
| Current machine state |
| Active workflow step |
| Active alarms |
| Device health states |
| Recent command history |
| Recent event history |
| Queue / backlog state |
| Metrics snapshot |
| Relevant image / frame / result references |
| Exception / crash details |
| Operator actions before failure |
+------------------------------------------------+A good evidence package lets a field engineer say:
“At the time of failure, the machine was in AutoRunning, recipe R17.3 was active, the stage was moving to die position 42, the camera trigger queue had 18 pending items, ServoReady dropped 4 seconds before the motion timeout, and the operator pressed Retry twice.”
That is diagnosis.
Not guessing.
Part 6 — Timeline and Correlation
Root cause analysis depends on reconstructing event order.
A single failure may involve:
Operator action
Workflow transition
Command validation
Device command
Device response
Timeout
Alarm
Recovery attemptTimeline Diagram
Time ---->
10:00:00.000 UI Operator clicks Start
10:00:00.050 Workflow Idle -> AutoRunning
10:00:00.120 Recipe Recipe R17.3 activated
10:00:00.300 Motion MoveAbs command sent
10:00:00.320 Device Command accepted
10:00:01.100 IO VacuumOk True -> False
10:00:01.150 Workflow Waiting for motion complete
10:00:05.320 Motion Timeout waiting for completion
10:00:05.340 Alarm MOTION_TIMEOUT raised
10:00:06.000 UI Operator presses Reset
10:00:06.020 Snapshot Evidence package capturedNotice the real clue:
VacuumOk changed before the timeout.Without a timeline, engineers may blame motion code.
With a timeline, they investigate vacuum/interlock behavior.
Important design mechanisms:
Correlation IDs
Command IDs
Workflow instance IDs
Monotonic sequence numbers
Consistent timestamps
Command/result pairing
Operator action IDs
Device event sequence numbersTimestamps alone are not always enough because clocks may drift or events may be buffered. A monotonic sequence number is often extremely useful inside a single process.
Part 7 — Operator-Visible vs Engineer Diagnostics
Operators and engineers need different information.
Operators need:
What is wrong?
Can I continue?
What action should I take?
Is the machine safe?
What condition blocks operation?Example operator message:
Motion cannot continue because Stage X is not servo-ready.
Check machine status and call service if the problem remains.Engineers need:
Raw device status
Command trace
Fault code
Controller response
Timing information
State transition history
Configuration version
Recent interlock changesExample engineer detail:
FaultCode: MOTION_TIMEOUT
Device: ACS-MotionController-01
CommandId: MoveAbs-X-88291
Axis: X
Target: 142.350 mm
ServoReady changed True -> False at T-4.18s
ControllerState: Disabled
WorkflowStep: MoveToDiePosition
RecipeVersion: R17.3Do not mix these views carelessly.
If the operator sees too much raw detail, they get confused.
If engineers only see simplified operator messages, they cannot debug.
Good systems provide layered diagnostics:
Operator View:
Clear action and blocking condition
Engineer View:
Full evidence and traces
Developer View:
Deep internal state, stack traces, debug-level dataPart 8 — Real-World Failure Scenarios
Scenario 1 — Log says “operation failed”
What it looks like
[Error] Operation failed.The machine stops. Nobody knows whether the issue came from motion, vision, IO, recipe validation, or operator action.
Why it happens
The system logs at the wrong abstraction level.
Better design
Log with subsystem, command, state, and fault code:
Subsystem: Vision.Acquisition
WorkflowStep: CaptureDieImage
CommandId: TriggerCamera-5591
FaultCode: CAMERA_TRIGGER_TIMEOUT
CameraState: Armed
MotionState: InPosition
RecipeExposureMs: 12.5Scenario 2 — No correlation between operator action and device command
What it looks like
Operator says:
“I clicked Retry and the machine moved unexpectedly.”
Logs show motion commands, but not which UI action caused them.
Why it happens
UI, workflow, and device layers log separately with no shared operation ID.
Better design
Propagate correlation:
OperatorActionId: UI-Retry-188
CorrelationId: Recovery-Run-1842
CommandId: MoveToSafePosition-991Now the system can connect:
Operator click -> recovery workflow -> motion command -> device responseScenario 3 — Timestamp mismatch makes event order unclear
What it looks like
Device log says timeout happened before command was sent.
Why it happens
Different systems use different clocks, buffered logging, or local timestamps.
Better design
Use:
Central timestamp at ingestion
Device timestamp when available
Monotonic local sequence number
Command/result pairingExample:
Seq=8821 CommandSent
Seq=8822 DeviceAccepted
Seq=8823 InterlockChanged
Seq=8824 TimeoutRaisedScenario 4 — Fault cleared before evidence captured
What it looks like
Operator presses Reset. Alarm disappears. Service engineer arrives later and sees nothing useful.
Why it happens
The system treats reset as cleanup, not as a diagnostic moment.
Better design
Before clearing:
Capture snapshot
Persist fault record
Link recent logs/events
Store active device states
Record operator actionReset should not erase history.
Scenario 5 — Intermittent failure cannot be reproduced
What it looks like
Machine fails once per shift. Developers cannot reproduce it in the lab.
Why it happens
No snapshot, no recent event buffer, no counters, no environment/config capture.
Better design
Keep a rolling diagnostic buffer:
Last 5 minutes of key events
Last N device commands
Last N state transitions
Recent metric samples
Recent image/result referencesWhen a fault occurs, freeze the relevant window into an evidence package.
Scenario 6 — Field machine has different config/version
What it looks like
Developer says:
“This cannot happen in version 2.4.”
Field engineer says:
“But it happened yesterday.”
Later they discover the machine had:
Software 2.4.1
Motion firmware 7.8
Camera driver 3.2
Recipe R15 modified locally
Calibration file from previous monthBetter design
Every diagnostic bundle should include:
Software version
Build commit/hash
Device firmware versions
Driver versions
Recipe/config version
Calibration version
Machine ID
Station IDScenario 7 — Performance degrades slowly
What it looks like
The machine works after startup but becomes unstable after 8 hours.
Why it happens
Possible causes:
Memory leak
Handle leak
Queue growth
Disk backlog
Image buffer not released
Slow database writes
Increasing GC pressureLogs may not show this clearly.
Better design
Track long-running metrics:
Memory usage
Handle count
Queue depth
Image buffer pool usage
Processing duration
Dropped frames
Disk write latency
Cycle timeIndustrial software must be observable over hours, days, and weeks, not just during a demo.
Scenario 8 — No useful diagnostic export
What it looks like
Field engineer sends screenshots and a vague description:
“Machine stopped during inspection.”
Developers ask for logs. Field engineer manually zips random files.
Better design
Provide a service-friendly export:
Export Diagnostic Bundle
Time range
Fault ID
Logs
Metrics
Fault history
Snapshot
Config/version info
Recent command traces
Relevant images/resultsThe field engineer should not need to know where every file is stored.
Part 9 — Software Design Implications
Observability must be designed into the architecture from the start.
It cannot be patched in at the end with random logs.
Important Architectural Components
Structured logging contract
Correlation context propagation
Diagnostic snapshot service
Metrics collection
Event/fault journal
Device communication trace
Crash dump collection
Exportable diagnostic bundle
Retention policy
Field-service toolingGood vs Bad Approach
Bad:
String logs everywhere
No correlation ID
Generic exceptions
No workflow state in logs
No command/result pairing
No metrics
No snapshots
No export tool
Logs overwritten too quicklyGood:
Structured events
Cross-layer correlation
State-aware logs
Device command traces
Metrics and counters
Fault snapshots
Persistent fault history
Exportable diagnostic bundle
Role-aware diagnostics
Evidence preserved before recoveryComponent Diagram
+-----------+ +-------------+ +-----------+
| UI / HMI | | Workflow | | Device |
| Actions | | State | | Commands |
+-----+-----+ +------+------+ +-----+-----+
| | |
| structured events| metrics |
| correlation IDs | snapshots |
v v v
+------------------------------------------------+
| Observability Pipeline |
| - enrich context |
| - correlate events |
| - persist logs |
| - collect metrics |
| - create snapshots |
| - build diagnostic bundle |
+-------------------------+----------------------+
|
v
+------------------------------------------------+
| Diagnostic Stores |
| - Logs |
| - Metrics |
| - Fault history |
| - Event journal |
| - Crash dumps |
| - Image/result references |
| - Diagnostic bundles |
+-------------------------+----------------------+
|
v
+------------------------------------------------+
| Consumers |
| - Operator |
| - Engineer |
| - Field service |
| - Root cause analysis |
+------------------------------------------------+This is not cloud observability copied into a machine.
This is machine-aware observability.
The design must understand:
machine state
workflow step
device command
operator action
physical condition
recipe/config version
failure evidencePart 10 — Interview / Real-World Talking Points
How to Explain Observability in Industrial Systems
A strong answer:
In industrial machine software, observability is not just logging. It is the ability to reconstruct what the machine was doing, what the software believed, what the hardware reported, and what changed before a failure. Because failures are often intermittent and cross-layer, we need structured logs, metrics, command traces, state transitions, diagnostic snapshots, and exportable evidence packages. The goal is to support root cause analysis, especially when the original developer is not present and the issue happens on a field machine.
Why “Add More Logs” Is Not Enough
Because more text does not mean more diagnosis.
You need:
Context
Correlation
State
Sequence
Metrics
Snapshots
Fault history
ExportabilityBad logging creates noise.
Good observability creates evidence.
Common Mistakes Engineers Make
Logging generic messages
Logging only exceptions
Ignoring normal state transitions
Not logging command/result pairs
Not including machine state
Not including recipe/config version
Not preserving evidence before reset
Not collecting metrics
Not designing exportable diagnostic bundles
Mixing operator messages with engineer diagnosticsWhat Strong Engineers Understand
Strong engineers know that production support is part of architecture.
They design systems so that someone can answer:
What was the operator doing?
What was the workflow doing?
What command was active?
What did the device report?
What changed before the fault?
Was this a software bug, hardware issue, configuration problem, timing issue, or operator sequence issue?
Can we prove the sequence from evidence?That is the real goal.
Core Mental Model
Think of observability in machine software like this:
Logs tell what happened.
Metrics show how behavior changes over time.
Traces connect actions across layers.
Snapshots preserve context at failure time.
Fault history shows repeated patterns.
Diagnostic bundles make field support practical.The highest-level principle:
Industrial observability is not about collecting data. It is about preserving enough evidence to explain machine behavior after the fact.
That is what reduces downtime, avoids speculation, and makes field support possible.