Skip to content

Observability: Logging, Metrics & Diagnostics in Industrial Machine Software

This topic maps directly to your roadmap’s Domain 15 — Observability, Diagnostics & Serviceability, which highlights structured logging, workflow correlation, device communication logs, alarm/event journaling, metrics, diagnostic snapshots, exportable logs, crash dumps, replay-friendly telemetry, and root-cause-oriented observability design.


Part 1 — Why Observability Is Critical in Machine Software

In industrial machine software, failures are rarely simple.

A web app failure may look like:

“API returned 500.”

A machine software failure may look like:

“Wafer inspection stopped during autofocus after the stage moved, camera trigger timed out, interlock changed briefly, image buffer filled, and the operator pressed retry.”

That is a very different diagnostic problem.

Industrial failures are often:

  • intermittent
  • timing-sensitive
  • cross-layer
  • dependent on hardware condition
  • dependent on operator action
  • dependent on machine state
  • hard to reproduce outside the customer site

The most important point is this:

The visible symptom is often not the root cause.

Example:

text
UI symptom:
  Motion timeout

Possible root cause:
  Door interlock flickered
  Motion controller paused
  Workflow kept waiting for completion
  Timeout fired 5 seconds later

Another example:

text
UI symptom:
  Inspection failed

Possible root cause:
  Illumination intensity drifted
  Image contrast dropped
  Detection threshold became invalid
  Vision algorithm reported low confidence

Another:

text
UI symptom:
  Device reconnected successfully

Possible root cause:
  Physical connection recovered,
  but command session state was not rebuilt,
  so the workflow continued with stale command assumptions.

Good observability helps engineers answer:

text
What happened?
When did it happen?
What happened first?
Which subsystem originated the problem?
What machine state was active?
Which recipe/config/version was running?
Which command was in progress?
What changed shortly before failure?
Was this a one-time event or degradation over time?

Without this, debugging becomes guessing.


Part 2 — Logging Is Not Enough

A common beginner mistake is thinking:

“We need better diagnostics, so let’s add more logs.”

That is not enough.

Logs are one kind of evidence. They are not the whole observability system.

Industrial diagnostics need several kinds of evidence:

text
Logs
State transitions
Command traces
Device communication traces
Metrics and counters
Diagnostic snapshots
Alarm/fault history
Crash dumps
Image/result references
Configuration/version records
Operator action history

Weak log:

text
Operation failed.

Better log:

text
Timestamp: 2026-04-27T10:15:32.124Z
Subsystem: Motion.StageX
OperationId: InspectWafer-Run-1842
WorkflowStep: MoveToDiePosition
MachineState: AutoRunning
DeviceId: ACS-MotionController-01
CommandId: MoveAbs-X-88291
TargetPositionMm: 142.350
TimeoutMs: 5000
Result: Timeout
FaultCode: MOTION_TIMEOUT
InterlockState: DoorClosed=True, VacuumOk=True, ServoReady=False

The second log is not just text. It is evidence.


Part 3 — Structured Logging Across Layers

In machine software, a useful log entry should carry context.

Important fields include:

text
Timestamp
Sequence number
Subsystem
Machine state
Workflow step
Operation / correlation ID
Device ID
Command ID
Recipe version
Lot / wafer / part ID
Result / status
Fault code
Error details
Thread / task context when useful

The key idea is:

Every important event should be explainable in relation to the machine state, workflow state, command, and device involved.

Layer-Aware Logging

A machine failure often crosses several layers:

text
+-------------------------------+
| UI / HMI                      |
| Operator clicked Start        |
+---------------+---------------+
                |
                v
+-------------------------------+
| Workflow / Orchestrator       |
| Entered InspectWafer step     |
+---------------+---------------+
                |
                v
+-------------------------------+
| Device Abstraction Layer      |
| Sent MoveAbs command          |
+---------------+---------------+
                |
                v
+-------------------------------+
| Motion Controller / Camera    |
| Command accepted / timed out  |
+---------------+---------------+
                |
                v
+-------------------------------+
| Physical Machine              |
| Stage, sensor, interlock      |
+-------------------------------+

A good trace connects all of these.

Example:

text
CorrelationId: Run-1842

[10:15:30.010] UI        Operator pressed Start
[10:15:30.080] Workflow  Transition Idle -> AutoRunning
[10:15:30.240] Workflow  Step: MoveToInspectionStart
[10:15:30.260] Motion    Command MoveAbs X=142.350 Y=88.120
[10:15:30.275] Device    Controller accepted command
[10:15:31.020] IO        ServoReady changed True -> False
[10:15:35.280] Motion    MoveAbs timed out
[10:15:35.300] Alarm     MOTION_TIMEOUT raised

This is powerful because it shows order and causality.

The timeout is not the real root cause. The ServoReady signal changed before the timeout.


Part 4 — Metrics, Counters, and Health Indicators

Logs tell stories. Metrics show patterns.

In industrial systems, metrics help you see degradation before failure.

Useful metrics include:

text
Command latency
Command timeout count
Retry count
Device reconnect count
Queue depth
Dropped frames/messages
Image processing duration
Workflow cycle time
Alarm frequency
Memory usage
CPU usage
Disk usage
Buffer pool usage
Camera frame rate
Motion command completion time
Database/write latency

Example:

text
Camera dropped frame count:
  Monday: 0
  Tuesday: 3
  Wednesday: 19
  Thursday: 74

Root cause may not be a sudden bug.
It may be degradation:
  cable issue
  overloaded image pipeline
  disk too slow
  memory pressure
  frame grabber instability

Logs may show only individual failures.

Metrics show the trend.

Another example:

text
Motion command average latency:
  Normal: 80 ms
  Current: 430 ms

Possible meaning:
  controller overloaded
  network issue
  motion queue congestion
  machine operating near mechanical limit

A strong machine system should expose health indicators such as:

text
Healthy
Degraded
Recovering
Disconnected
Faulted
Unknown

This is better than simple “OK / Not OK”.


Part 5 — Diagnostic Snapshots and Evidence Packages

A diagnostic snapshot captures system context at an important moment.

Important moments include:

text
Fault raised
Alarm raised
Workflow aborted
Device timeout
Crash
Emergency stop observed
Recovery started
Operator pressed Reset

The key rule:

Capture evidence before reset/recovery destroys the context.

When operators clear alarms, reconnect devices, or restart software, valuable evidence disappears.

Evidence Package Diagram

text
+------------------------------------------------+
| Diagnostic Evidence Package                    |
+------------------------------------------------+
| Timestamp / sequence range                     |
| Software version / build                       |
| Machine ID / station ID                        |
| Recipe / config version                        |
| Current machine state                          |
| Active workflow step                           |
| Active alarms                                  |
| Device health states                           |
| Recent command history                         |
| Recent event history                           |
| Queue / backlog state                          |
| Metrics snapshot                               |
| Relevant image / frame / result references     |
| Exception / crash details                      |
| Operator actions before failure                |
+------------------------------------------------+

A good evidence package lets a field engineer say:

“At the time of failure, the machine was in AutoRunning, recipe R17.3 was active, the stage was moving to die position 42, the camera trigger queue had 18 pending items, ServoReady dropped 4 seconds before the motion timeout, and the operator pressed Retry twice.”

That is diagnosis.

Not guessing.


Part 6 — Timeline and Correlation

Root cause analysis depends on reconstructing event order.

A single failure may involve:

text
Operator action
Workflow transition
Command validation
Device command
Device response
Timeout
Alarm
Recovery attempt

Timeline Diagram

text
Time ---->

10:00:00.000  UI        Operator clicks Start
10:00:00.050  Workflow  Idle -> AutoRunning
10:00:00.120  Recipe    Recipe R17.3 activated
10:00:00.300  Motion    MoveAbs command sent
10:00:00.320  Device    Command accepted
10:00:01.100  IO        VacuumOk True -> False
10:00:01.150  Workflow  Waiting for motion complete
10:00:05.320  Motion    Timeout waiting for completion
10:00:05.340  Alarm     MOTION_TIMEOUT raised
10:00:06.000  UI        Operator presses Reset
10:00:06.020  Snapshot  Evidence package captured

Notice the real clue:

text
VacuumOk changed before the timeout.

Without a timeline, engineers may blame motion code.

With a timeline, they investigate vacuum/interlock behavior.

Important design mechanisms:

text
Correlation IDs
Command IDs
Workflow instance IDs
Monotonic sequence numbers
Consistent timestamps
Command/result pairing
Operator action IDs
Device event sequence numbers

Timestamps alone are not always enough because clocks may drift or events may be buffered. A monotonic sequence number is often extremely useful inside a single process.


Part 7 — Operator-Visible vs Engineer Diagnostics

Operators and engineers need different information.

Operators need:

text
What is wrong?
Can I continue?
What action should I take?
Is the machine safe?
What condition blocks operation?

Example operator message:

text
Motion cannot continue because Stage X is not servo-ready.
Check machine status and call service if the problem remains.

Engineers need:

text
Raw device status
Command trace
Fault code
Controller response
Timing information
State transition history
Configuration version
Recent interlock changes

Example engineer detail:

text
FaultCode: MOTION_TIMEOUT
Device: ACS-MotionController-01
CommandId: MoveAbs-X-88291
Axis: X
Target: 142.350 mm
ServoReady changed True -> False at T-4.18s
ControllerState: Disabled
WorkflowStep: MoveToDiePosition
RecipeVersion: R17.3

Do not mix these views carelessly.

If the operator sees too much raw detail, they get confused.

If engineers only see simplified operator messages, they cannot debug.

Good systems provide layered diagnostics:

text
Operator View:
  Clear action and blocking condition

Engineer View:
  Full evidence and traces

Developer View:
  Deep internal state, stack traces, debug-level data

Part 8 — Real-World Failure Scenarios

Scenario 1 — Log says “operation failed”

What it looks like

text
[Error] Operation failed.

The machine stops. Nobody knows whether the issue came from motion, vision, IO, recipe validation, or operator action.

Why it happens

The system logs at the wrong abstraction level.

Better design

Log with subsystem, command, state, and fault code:

text
Subsystem: Vision.Acquisition
WorkflowStep: CaptureDieImage
CommandId: TriggerCamera-5591
FaultCode: CAMERA_TRIGGER_TIMEOUT
CameraState: Armed
MotionState: InPosition
RecipeExposureMs: 12.5

Scenario 2 — No correlation between operator action and device command

What it looks like

Operator says:

“I clicked Retry and the machine moved unexpectedly.”

Logs show motion commands, but not which UI action caused them.

Why it happens

UI, workflow, and device layers log separately with no shared operation ID.

Better design

Propagate correlation:

text
OperatorActionId: UI-Retry-188
CorrelationId: Recovery-Run-1842
CommandId: MoveToSafePosition-991

Now the system can connect:

text
Operator click -> recovery workflow -> motion command -> device response

Scenario 3 — Timestamp mismatch makes event order unclear

What it looks like

Device log says timeout happened before command was sent.

Why it happens

Different systems use different clocks, buffered logging, or local timestamps.

Better design

Use:

text
Central timestamp at ingestion
Device timestamp when available
Monotonic local sequence number
Command/result pairing

Example:

text
Seq=8821 CommandSent
Seq=8822 DeviceAccepted
Seq=8823 InterlockChanged
Seq=8824 TimeoutRaised

Scenario 4 — Fault cleared before evidence captured

What it looks like

Operator presses Reset. Alarm disappears. Service engineer arrives later and sees nothing useful.

Why it happens

The system treats reset as cleanup, not as a diagnostic moment.

Better design

Before clearing:

text
Capture snapshot
Persist fault record
Link recent logs/events
Store active device states
Record operator action

Reset should not erase history.


Scenario 5 — Intermittent failure cannot be reproduced

What it looks like

Machine fails once per shift. Developers cannot reproduce it in the lab.

Why it happens

No snapshot, no recent event buffer, no counters, no environment/config capture.

Better design

Keep a rolling diagnostic buffer:

text
Last 5 minutes of key events
Last N device commands
Last N state transitions
Recent metric samples
Recent image/result references

When a fault occurs, freeze the relevant window into an evidence package.


Scenario 6 — Field machine has different config/version

What it looks like

Developer says:

“This cannot happen in version 2.4.”

Field engineer says:

“But it happened yesterday.”

Later they discover the machine had:

text
Software 2.4.1
Motion firmware 7.8
Camera driver 3.2
Recipe R15 modified locally
Calibration file from previous month

Better design

Every diagnostic bundle should include:

text
Software version
Build commit/hash
Device firmware versions
Driver versions
Recipe/config version
Calibration version
Machine ID
Station ID

Scenario 7 — Performance degrades slowly

What it looks like

The machine works after startup but becomes unstable after 8 hours.

Why it happens

Possible causes:

text
Memory leak
Handle leak
Queue growth
Disk backlog
Image buffer not released
Slow database writes
Increasing GC pressure

Logs may not show this clearly.

Better design

Track long-running metrics:

text
Memory usage
Handle count
Queue depth
Image buffer pool usage
Processing duration
Dropped frames
Disk write latency
Cycle time

Industrial software must be observable over hours, days, and weeks, not just during a demo.


Scenario 8 — No useful diagnostic export

What it looks like

Field engineer sends screenshots and a vague description:

“Machine stopped during inspection.”

Developers ask for logs. Field engineer manually zips random files.

Better design

Provide a service-friendly export:

text
Export Diagnostic Bundle
  Time range
  Fault ID
  Logs
  Metrics
  Fault history
  Snapshot
  Config/version info
  Recent command traces
  Relevant images/results

The field engineer should not need to know where every file is stored.


Part 9 — Software Design Implications

Observability must be designed into the architecture from the start.

It cannot be patched in at the end with random logs.

Important Architectural Components

text
Structured logging contract
Correlation context propagation
Diagnostic snapshot service
Metrics collection
Event/fault journal
Device communication trace
Crash dump collection
Exportable diagnostic bundle
Retention policy
Field-service tooling

Good vs Bad Approach

Bad:

text
String logs everywhere
No correlation ID
Generic exceptions
No workflow state in logs
No command/result pairing
No metrics
No snapshots
No export tool
Logs overwritten too quickly

Good:

text
Structured events
Cross-layer correlation
State-aware logs
Device command traces
Metrics and counters
Fault snapshots
Persistent fault history
Exportable diagnostic bundle
Role-aware diagnostics
Evidence preserved before recovery

Component Diagram

text
+-----------+     +-------------+     +-----------+
| UI / HMI  |     | Workflow    |     | Device    |
| Actions   |     | State       |     | Commands  |
+-----+-----+     +------+------+     +-----+-----+
      |                  |                  |
      | structured events| metrics          |
      | correlation IDs  | snapshots        |
      v                  v                  v

+------------------------------------------------+
| Observability Pipeline                         |
| - enrich context                               |
| - correlate events                             |
| - persist logs                                 |
| - collect metrics                              |
| - create snapshots                             |
| - build diagnostic bundle                      |
+-------------------------+----------------------+
                          |
                          v

+------------------------------------------------+
| Diagnostic Stores                              |
| - Logs                                         |
| - Metrics                                      |
| - Fault history                                |
| - Event journal                                |
| - Crash dumps                                  |
| - Image/result references                      |
| - Diagnostic bundles                           |
+-------------------------+----------------------+
                          |
                          v

+------------------------------------------------+
| Consumers                                      |
| - Operator                                     |
| - Engineer                                     |
| - Field service                                |
| - Root cause analysis                          |
+------------------------------------------------+

This is not cloud observability copied into a machine.

This is machine-aware observability.

The design must understand:

text
machine state
workflow step
device command
operator action
physical condition
recipe/config version
failure evidence

Part 10 — Interview / Real-World Talking Points

How to Explain Observability in Industrial Systems

A strong answer:

In industrial machine software, observability is not just logging. It is the ability to reconstruct what the machine was doing, what the software believed, what the hardware reported, and what changed before a failure. Because failures are often intermittent and cross-layer, we need structured logs, metrics, command traces, state transitions, diagnostic snapshots, and exportable evidence packages. The goal is to support root cause analysis, especially when the original developer is not present and the issue happens on a field machine.

Why “Add More Logs” Is Not Enough

Because more text does not mean more diagnosis.

You need:

text
Context
Correlation
State
Sequence
Metrics
Snapshots
Fault history
Exportability

Bad logging creates noise.

Good observability creates evidence.

Common Mistakes Engineers Make

text
Logging generic messages
Logging only exceptions
Ignoring normal state transitions
Not logging command/result pairs
Not including machine state
Not including recipe/config version
Not preserving evidence before reset
Not collecting metrics
Not designing exportable diagnostic bundles
Mixing operator messages with engineer diagnostics

What Strong Engineers Understand

Strong engineers know that production support is part of architecture.

They design systems so that someone can answer:

text
What was the operator doing?
What was the workflow doing?
What command was active?
What did the device report?
What changed before the fault?
Was this a software bug, hardware issue, configuration problem, timing issue, or operator sequence issue?
Can we prove the sequence from evidence?

That is the real goal.


Core Mental Model

Think of observability in machine software like this:

text
Logs tell what happened.
Metrics show how behavior changes over time.
Traces connect actions across layers.
Snapshots preserve context at failure time.
Fault history shows repeated patterns.
Diagnostic bundles make field support practical.

The highest-level principle:

Industrial observability is not about collecting data. It is about preserving enough evidence to explain machine behavior after the fact.

That is what reduces downtime, avoids speculation, and makes field support possible.

Docs-first project memory for AI-assisted implementation.