Watchdogs, Heartbeats & Health Monitoring
This topic sits inside the reliability/fault-handling area of your roadmap, especially “Timeout design,” “Watchdogs and heartbeat monitoring,” and “Degraded mode operation.”
Part 1 — Why Health Monitoring Is Critical
In industrial machine software, many failures do not throw an exception.
Sometimes the failure is simply:
something expected did not happen.
A camera does not raise an error. It just stops producing frames.
A motion controller does not crash. A command just never reaches Completed.
A PLC connection still exists, but the status value stops changing.
A processing pipeline still has threads, but the queue no longer drains.
That is why this statement is dangerous:
“There is no error, so the system is healthy.”
In machine software, silence can be failure.
Examples:
Camera acquisition:
Expected: frame arrives every ~50 ms
Failure: no frame arrives for 2 seconds
Motion:
Expected: MoveTo(position) completes in 5 seconds
Failure: axis remains Busy forever
PLC:
Expected: heartbeat counter increments every 500 ms
Failure: counter freezes
Image processing:
Expected: queue depth stays below 100
Failure: queue grows to 20,000 while system still says "Running"A good industrial system does not only wait for exceptions. It watches for:
- freshness
- progress
- completion
- throughput
- responsiveness
- consistency
- resource pressure
The core question is:
“Is the machine still making valid progress?”
Part 2 — Heartbeats vs Watchdogs vs Health Checks
These three are related, but they are not the same.
Heartbeat
A heartbeat is a periodic signal that says:
“I am still alive.”
Example:
PLC heartbeat counter increments every 500 ms.
Camera acquisition service updates LastFrameTimestamp.
Background worker updates LastLoopTimestamp.Heartbeat is useful, but weak.
A component can be “alive” but still useless.
Example:
Camera SDK thread is alive.
Heartbeat still updates.
But no valid image frames are produced.So heartbeat answers:
“Is something still running?”
It does not fully answer:
“Is it doing the correct job?”
Watchdog
A watchdog is an observer that expects something to happen within a time window.
It asks:
“Did expected progress happen in time?”
Examples:
Motion command must complete within 30 seconds.
Camera must produce a valid frame within 500 ms.
Workflow step must advance within 10 seconds.
Queue must drain below threshold within 5 seconds.A watchdog is stronger than a heartbeat because it watches progress, not only existence.
Health Check
A health check explicitly evaluates whether a component is usable.
Example:
Camera health:
- connected
- acquisition started
- last frame is fresh
- frame rate acceptable
- no SDK error
- buffer queue not full
Motion axis health:
- controller connected
- axis enabled
- not faulted
- position feedback fresh
- command state valid
- no limit violationHealth check answers:
“Can this component safely and correctly perform its responsibility now?”
Concept Diagram
+----------------+
| Component |
| Camera / PLC |
| Axis / Worker |
+-------+--------+
|
| heartbeat / status / progress event
v
+----------------+
| Health Monitor |
| freshness |
| progress |
| thresholds |
| state model |
+-------+--------+
|
| health state decision
v
+----------------+
| Watchdog |
| warn / fault |
| escalate |
| request action |
+----------------+The important idea:
Heartbeat is evidence. Health monitor interprets evidence. Watchdog decides whether lack of progress is unacceptable.
Part 3 — What Should Be Monitored
1. Device Connectivity
Healthy means:
Device is reachable, initialized, configured, and in expected mode.Unhealthy looks like:
Connection lost
SDK handle invalid
Device responds slowly
Device reconnects but loses stateUseful evidence:
Last successful command time
Last communication error
Connection state
Device mode
Firmware/version info
Reconnect countBad design:
bool IsConnected;Better design:
Connected
Initialized
Configured
Ready
Busy
Faulted
Recovering
Offline2. Command Completion
Healthy means:
Commands complete within expected timing and reach a valid final state.Unhealthy looks like:
Command stuck in Busy
Command result never received
Command completed but physical state does not match
Command timeout repeatedly occursUseful evidence:
CommandId
StartTime
ExpectedTimeout
CurrentDeviceState
LastStatusUpdate
FinalResultExample:
MoveAxis(X, 120.0)
Expected:
- Axis enters Moving
- Position changes
- Axis reaches target
- Done signal arrives
Unhealthy:
- Axis stays Busy
- Position does not change
- Done signal never arrives3. Workflow Progress
Healthy means:
The workflow continues moving through valid states.Unhealthy looks like:
Workflow stuck in "WaitingForFrame"
Workflow waiting forever for vacuum ready
Workflow step never exits
Workflow task silently stoppedUseful evidence:
Current step
Step enter timestamp
Expected max duration
Blocking condition
Last event received
Current machine modeImportant:
Workflow watchdogs should monitor step progress, not just thread life.
4. Queue Depth / Backlog
Healthy means:
Queues remain bounded and drain at an acceptable rate.Unhealthy looks like:
Frame queue grows continuously
Result writer queue never drains
UI update queue becomes delayed
Inspection results accumulate in memoryUseful evidence:
Queue depth
Enqueue rate
Dequeue rate
Oldest item age
Dropped item count
Backpressure stateA system can be “green” while slowly dying because queues are growing.
That is a classic long-running machine problem.
5. Frame Arrival Rate
Healthy means:
Frames arrive at expected rate and timestamps are fresh.Unhealthy looks like:
No frames
Irregular frames
Duplicate frames
Old frames reused
Frames arriving but invalid
Frame rate lower than recipe expectationUseful evidence:
Last frame timestamp
Frame counter
Frame interval statistics
Dropped frame count
Camera trigger count
Acquisition stateExample:
Expected frame interval: 50 ms
Warning threshold: no frame for 250 ms
Fault threshold: no frame for 1000 ms6. Sensor Freshness
Healthy means:
Sensor values are recent enough to trust.Unhealthy looks like:
Temperature value unchanged for too long
Vacuum value stale
Door state not updated
Encoder feedback not refreshedUseful evidence:
Value
Timestamp
Source
Update interval
Quality flag
Last successful readBad:
if (sensor.VacuumOk)
{
Continue();
}Better:
if (sensor.VacuumOk && sensor.Age < MaxAllowedAge)
{
Continue();
}Industrial systems must treat old data as suspicious.
7. Background Worker Activity
Healthy means:
Worker loop is running, not blocked, and making progress.Unhealthy looks like:
Worker task died
Worker loop blocked on lock
Worker stuck waiting for device callback
Worker alive but not processing messagesUseful evidence:
Last loop timestamp
Last processed item timestamp
Unhandled exception
Current operation
Cancellation state
Thread/task statusA worker being “started” is not enough.
The real question is:
“Is the worker still doing useful work?”
8. UI Responsiveness
Healthy means:
UI thread can process messages within acceptable latency.Unhealthy looks like:
UI freezes
Operator clicks do nothing
Live status stops updating
Alarm screen becomes staleUseful evidence:
Dispatcher heartbeat
UI update latency
Render delay
Last UI tick timestamp
Pending UI queue sizeIn WPF industrial systems, UI responsiveness is not just comfort.
It affects operator trust and recovery speed.
9. Storage Availability
Healthy means:
The system can write required data fast enough and safely.Unhealthy looks like:
Disk full
Database unavailable
Result writer queue backing up
Image save latency too high
File lock issuesUseful evidence:
Free disk space
Write latency
Failed write count
Queue depth
Oldest unsaved result ageFor inspection machines, storage health can directly affect production.
10. CPU / Memory / Disk Pressure
Healthy means:
Resources stay within safe operating range.Unhealthy looks like:
Memory grows over 8 hours
CPU spikes cause missed frames
Disk IO delays result saving
GC pauses affect UI or pipeline timingUseful evidence:
Working set
Private bytes
GC heap size
CPU usage
Disk queue length
Handle count
Thread countThis is especially important in long-running vision systems.
Part 4 — Health States and Escalation
Binary health is too weak.
This is not enough:
Healthy / UnhealthyReal systems need intermediate states because many failures are gradual.
A practical model:
Healthy
Suspect
Degraded
Faulted
Recovering
OfflineState Meaning
Healthy
- Component is usable.
- Evidence is fresh.
- Progress is normal.
Suspect
- One or more signals look abnormal.
- Not enough evidence to stop the machine yet.
Degraded
- Component still works, but below expected quality or speed.
- Production may continue with reduced capability.
Faulted
- Component cannot be trusted.
- Machine action must be stopped, blocked, or escalated.
Recovering
- System is attempting controlled recovery.
- Commands may be restricted.
Offline
- Component is intentionally unavailable or disconnected.State Diagram
+---------+
| Healthy |
+----+----+
|
| missed heartbeat / slow progress / stale data
v
+---------+
| Suspect |
+----+----+
|
| repeated issue / threshold exceeded
v
+----------+
| Degraded |
+----+-----+
|
| unsafe / unusable / timeout exceeded
v
+---------+
| Faulted |
+----+----+
|
| recovery requested
v
+------------+
| Recovering |
+-----+------+
|
| recovery success
v
+---------+
| Healthy |
+---------+
Faulted ---> Offline
Offline ---> RecoveringWhy this matters:
Not every missed heartbeat should stop production. But repeated missed heartbeats should not be ignored.
Good systems escalate based on evidence over time.
Example:
1 missed camera frame:
Suspect
10 missed frames:
Degraded
No frame for 2 seconds during active inspection:
FaultedPart 5 — Watchdog Time Windows and False Positives
Watchdog timing is one of the hardest parts.
If the timeout is too short, the machine becomes noisy and unstable.
If the timeout is too long, the machine detects real failures too late.
Example 1 — Camera Frame Watchdog
Normal frame interval: 50 ms
Warning threshold: 250 ms
Fault threshold: 1000 msTimeline:
Expected frame stream:
0ms 50ms 100ms 150ms 200ms
|-------|-------|-------|-------|
Frame Frame Frame Frame Frame
Abnormal stream:
0ms 50ms 100ms 500ms 1000ms
|-------|-------|------------------|-------------------|
Frame Frame no frame Suspect/Warning FaultExplanation:
- One missing frame may be normal jitter.
- A 250 ms gap may indicate degraded acquisition.
- A 1000 ms gap during active inspection is likely a real fault.
Example 2 — Workflow Step Watchdog
Step: WaitForVacuumReady
Expected: 10 seconds
Warning: 15 seconds
Fault: 30 secondsTimeline:
Step entered
|
v
0s -------- 10s -------- 15s ---------------- 30s
| | | |
Start Expected Warning Fault
completion operator awareness stop workflowThe timeout should reflect physical behavior.
Vacuum may take time. Motion may take time. Camera exposure may vary. A real system must understand the process.
Common Timing Mistakes
Bad:
Set every device timeout to 5 seconds.Why bad?
Because different operations have different physical meanings.
Better:
Camera grab timeout: 500 ms
Axis move timeout: based on distance, velocity, acceleration, margin
Vacuum timeout: based on chamber size and expected pressure curve
Database write warning: based on queue age and production rateStrong engineers do not choose watchdog windows randomly.
They ask:
What is the normal duration?
What is the worst acceptable duration?
What jitter is expected?
What happens if we stop too early?
What happens if we wait too long?Part 6 — Active vs Passive Health Monitoring
Active Monitoring
Active monitoring asks the component to prove health.
Example:
PC software sends ping to PLC every 500 ms.
PLC replies with counter/status.Diagram:
+-------------+ ping/read status +-----+
| PC Software | ---------------------------> | PLC |
| Monitor | <--------------------------- | |
+-------------+ response/counter +-----+Good for:
- communication checks
- explicit status reads
- detecting disconnected devices
- verifying command channel availability
Weakness:
A device can respond to ping but still be functionally broken.
Example:
PLC responds to ping,
but the conveyor command is ignored.Passive Monitoring
Passive monitoring observes normal operational events.
Example:
Camera frames arrive during acquisition.
Each frame updates LastFrameTimestamp.
Monitor checks frame freshness.Diagram:
+--------+ frame events +----------------+
| Camera | -------------------------> | Frame Pipeline |
+--------+ +-------+--------+
|
| last frame timestamp
v
+----------------+
| Health Monitor |
+----------------+Good for:
- real operational health
- throughput monitoring
- progress detection
- detecting stuck pipelines
Weakness:
If the component is idle, lack of events may be normal.
So passive monitoring must understand context:
No frame while camera is Idle:
Healthy
No frame while camera is Acquiring:
FaultContext matters.
Part 7 — Real-World Failure Scenarios
Scenario 1 — Heartbeat Updates but Device Is Functionally Stuck
Production symptom:
PLC heartbeat is green.
Machine says "Connected."
But conveyor does not move.Why it happens:
The communication link is alive, but the functional part of the controller is stuck, inhibited, faulted, or ignoring commands.
Bad diagnosis:
PLC is connected, so PLC is fine.Better diagnosis:
Connection health: OK
Heartbeat health: OK
Command execution health: Failed
Physical progress: No encoder/sensor changeHow experienced engineers handle it:
They separate:
Connectivity health
Controller health
Command health
Physical progress healthScenario 2 — Watchdog Timeout Too Short Causes False Stops
Production symptom:
Machine randomly stops during heavy load.
Operators report false alarms.
Logs show camera timeout.Why it happens:
The watchdog threshold was set based on ideal lab timing, not real production timing.
Example:
Normal lab frame interval: 50 ms
Production occasionally: 120–180 ms
Watchdog timeout: 100 msResult:
The watchdog becomes noise.
How experienced engineers handle it:
They measure real timing distribution and set thresholds with margin.
They also separate warning and fault thresholds.
Scenario 3 — Watchdog Timeout Too Long Delays Safe Recovery
Production symptom:
Motion axis gets stuck.
Machine waits 5 minutes before faulting.
Operator loses time.
Material may be at risk.Why it happens:
Timeout was set too generously to avoid false positives.
The machine detects failure too late.
Better approach:
Expected motion duration calculated from distance/speed.
Warning after expected duration + margin.
Fault after maximum physically reasonable duration.A 10 mm move and a 500 mm move should not have the same timeout.
Scenario 4 — Queue Backlog Grows but Health Remains Green
Production symptom:
Inspection continues.
Camera keeps acquiring.
Processing queue grows.
Memory increases.
Eventually app freezes or crashes.Why it happens:
Health model only checks whether services are running.
It does not monitor throughput or backlog.
Bad health model:
Camera connected: true
Processor running: true
Storage connected: true
System healthy: trueBetter health model:
Frame arrival rate: OK
Processing rate: too slow
Queue depth: rising
Oldest frame age: 12 seconds
System state: DegradedScenario 5 — Background Worker Dies Silently
Production symptom:
Machine appears idle.
No new results are saved.
No alarm appears.Why it happens:
A background task threw an exception and exited.
Nobody observed the task.
Bad design:
Task.Run(() => ProcessResults());No supervision. No heartbeat. No restart strategy. No fault propagation.
Better design:
Worker has:
- lifecycle state
- last activity timestamp
- unhandled exception capture
- health contribution
- supervisor/monitor ownershipScenario 6 — Stale Sensor Value Treated as Current
Production symptom:
Software thinks vacuum is OK.
Actually vacuum was lost seconds ago.
Machine continues incorrectly.Why it happens:
The system stores the last known value but does not check its age.
Bad:
VacuumOk = trueBetter:
VacuumOk = true
LastUpdated = 2026-04-27 10:12:01.230
Age = 8.5 seconds
Quality = StaleStrong rule:
Every important sensor value should have a timestamp and freshness policy.
Scenario 7 — Health Monitor Itself Becomes Unreliable
Production symptom:
Everything is green.
But later logs show health monitor stopped updating.Why it happens:
The monitor was implemented as just another background worker with no supervision.
How experienced engineers handle it:
They design health monitoring as a first-class subsystem.
At minimum:
Health monitor has own heartbeat.
Health snapshot includes timestamp.
Consumers reject stale health snapshots.
Critical monitors are simple and robust.Important principle:
Health monitoring must not become a hidden single point of failure.
Scenario 8 — Reconnect Resets Heartbeat but Device State Remains Invalid
Production symptom:
Device disconnects and reconnects.
UI becomes green.
Next command fails or behaves incorrectly.Why it happens:
Reconnect restored communication, but not machine readiness.
The device may need:
re-initialization
configuration reload
homing
state verification
mode synchronization
buffer clearingBad model:
Connected = true
Therefore Ready = trueBetter model:
Connected
Initialized
Configured
StateVerified
ReadyReconnect is not recovery by itself.
Part 8 — Software Design Implications
Health monitoring must be designed into the architecture. It cannot be added only at the end as a few timers.
Bad Approach
Each device exposes IsConnected.
UI checks IsConnected.
If true, show green.Problems:
Connected does not mean ready.
Ready does not mean progressing.
Progressing does not mean producing valid output.
Valid once does not mean fresh now.Good Approach
Use layered health signals:
Connectivity
Initialization
Configuration
Operational state
Progress
Freshness
Performance
Resource pressure
Fault historyComponent Diagram
+------------------+
| Device / Worker |
| Camera / Axis |
| PLC / Pipeline |
+---------+--------+
|
| heartbeat
| progress events
| status snapshots
| timestamps
v
+------------------+
| Health Monitor |
| freshness checks |
| progress checks |
| thresholds |
| state model |
+---------+--------+
|
| health state
| diagnostic evidence
v
+------------------+
| Fault Manager |
| severity |
| escalation |
| ownership |
+---------+--------+
|
| action request
v
+------------------+
| Machine State |
| alarm |
| inhibit command |
| controlled stop |
| diagnostics |
+------------------+Key design principle:
Detection and action should be separated.
The health monitor should decide:
Camera frame stream is stale.The fault/recovery layer should decide:
Warn operator.
Stop inspection.
Block next wafer.
Attempt controlled recovery.
Require service intervention.This separation prevents health monitoring from becoming a messy place full of machine-control decisions.
Health Snapshot Example
A good health snapshot contains evidence, not only status.
public enum HealthState
{
Healthy,
Suspect,
Degraded,
Faulted,
Recovering,
Offline
}
public sealed record HealthSnapshot(
string ComponentName,
HealthState State,
DateTimeOffset Timestamp,
string Reason,
IReadOnlyDictionary<string, object> Evidence);Example output:
Component: CameraAcquisition
State: Degraded
Reason: Frame interval above warning threshold
Evidence:
- LastFrameAgeMs = 420
- ExpectedFrameIntervalMs = 50
- WarningThresholdMs = 250
- FaultThresholdMs = 1000
- FrameQueueDepth = 320
- AcquisitionState = RunningThis is much better than:
CameraHealthy = falseBecause support engineers need to know why.
Part 9 — Interview / Real-World Talking Points
How to Explain Watchdogs and Heartbeats Clearly
A strong answer:
A heartbeat tells me a component is still alive. A watchdog tells me whether expected progress happened within an acceptable time window. In industrial systems, I do not rely only on connectivity or heartbeat. I monitor freshness, command completion, workflow progress, queue backlog, and functional readiness. The goal is to detect stuck or degraded behavior before it becomes unsafe or causes production loss.
Why “Connected” Is Not Equal to “Healthy”
Because connection is only one layer.
A device may be:
Connected but not initialized.
Initialized but not configured.
Configured but not ready.
Ready but not progressing.
Progressing but producing stale or invalid data.So the correct model is layered health, not a single boolean.
Common Mistakes Software Engineers Make
Common mistakes:
Using only IsConnected
Using heartbeat as complete health
No timestamp on sensor values
No watchdog for workflow steps
No command timeout per operation type
Same timeout for all operations
No queue/backlog monitoring
No distinction between warning and fault
No health state history
No diagnostic evidence
Health monitor tightly coupled to recovery actionsThe most dangerous mistake:
assuming “nothing happened” means “nothing is wrong.”
What Strong Engineers Understand
Strong industrial software engineers understand that health is about:
freshness
progress
context
timing
trend
evidence
escalationThey ask:
Is the data fresh?
Is the command progressing?
Is the workflow moving?
Is the queue draining?
Is the component functionally usable?
Is the problem transient or repeated?
Should we warn, degrade, stop, or recover?They also understand that health monitoring must be context-aware.
Example:
No camera frames while Idle:
Healthy
No camera frames while Inspecting:
FaultSame signal. Different meaning.
Final Mental Model
In business software, health often means:
Can I reach the service?In industrial machine software, health means:
Can this subsystem safely and correctly perform its machine responsibility right now,
with fresh data, valid state, and observable progress?That is the mindset shift.
A production-grade machine does not only ask:
Are components alive?It asks:
Are they making the right progress at the right time?That is the real purpose of watchdogs, heartbeats, and health monitoring.