Skip to content

Watchdogs, Heartbeats & Health Monitoring

This topic sits inside the reliability/fault-handling area of your roadmap, especially “Timeout design,” “Watchdogs and heartbeat monitoring,” and “Degraded mode operation.”


Part 1 — Why Health Monitoring Is Critical

In industrial machine software, many failures do not throw an exception.

Sometimes the failure is simply:

something expected did not happen.

A camera does not raise an error. It just stops producing frames.

A motion controller does not crash. A command just never reaches Completed.

A PLC connection still exists, but the status value stops changing.

A processing pipeline still has threads, but the queue no longer drains.

That is why this statement is dangerous:

“There is no error, so the system is healthy.”

In machine software, silence can be failure.

Examples:

text
Camera acquisition:
Expected: frame arrives every ~50 ms
Failure: no frame arrives for 2 seconds

Motion:
Expected: MoveTo(position) completes in 5 seconds
Failure: axis remains Busy forever

PLC:
Expected: heartbeat counter increments every 500 ms
Failure: counter freezes

Image processing:
Expected: queue depth stays below 100
Failure: queue grows to 20,000 while system still says "Running"

A good industrial system does not only wait for exceptions. It watches for:

  • freshness
  • progress
  • completion
  • throughput
  • responsiveness
  • consistency
  • resource pressure

The core question is:

“Is the machine still making valid progress?”


Part 2 — Heartbeats vs Watchdogs vs Health Checks

These three are related, but they are not the same.

Heartbeat

A heartbeat is a periodic signal that says:

“I am still alive.”

Example:

text
PLC heartbeat counter increments every 500 ms.
Camera acquisition service updates LastFrameTimestamp.
Background worker updates LastLoopTimestamp.

Heartbeat is useful, but weak.

A component can be “alive” but still useless.

Example:

text
Camera SDK thread is alive.
Heartbeat still updates.
But no valid image frames are produced.

So heartbeat answers:

“Is something still running?”

It does not fully answer:

“Is it doing the correct job?”


Watchdog

A watchdog is an observer that expects something to happen within a time window.

It asks:

“Did expected progress happen in time?”

Examples:

text
Motion command must complete within 30 seconds.
Camera must produce a valid frame within 500 ms.
Workflow step must advance within 10 seconds.
Queue must drain below threshold within 5 seconds.

A watchdog is stronger than a heartbeat because it watches progress, not only existence.


Health Check

A health check explicitly evaluates whether a component is usable.

Example:

text
Camera health:
- connected
- acquisition started
- last frame is fresh
- frame rate acceptable
- no SDK error
- buffer queue not full

Motion axis health:
- controller connected
- axis enabled
- not faulted
- position feedback fresh
- command state valid
- no limit violation

Health check answers:

“Can this component safely and correctly perform its responsibility now?”


Concept Diagram

text
+----------------+
|   Component    |
|  Camera / PLC  |
|  Axis / Worker |
+-------+--------+
        |
        | heartbeat / status / progress event
        v
+----------------+
| Health Monitor |
| freshness      |
| progress       |
| thresholds     |
| state model    |
+-------+--------+
        |
        | health state decision
        v
+----------------+
|   Watchdog     |
| warn / fault   |
| escalate       |
| request action |
+----------------+

The important idea:

Heartbeat is evidence. Health monitor interprets evidence. Watchdog decides whether lack of progress is unacceptable.


Part 3 — What Should Be Monitored

1. Device Connectivity

Healthy means:

text
Device is reachable, initialized, configured, and in expected mode.

Unhealthy looks like:

text
Connection lost
SDK handle invalid
Device responds slowly
Device reconnects but loses state

Useful evidence:

text
Last successful command time
Last communication error
Connection state
Device mode
Firmware/version info
Reconnect count

Bad design:

csharp
bool IsConnected;

Better design:

text
Connected
Initialized
Configured
Ready
Busy
Faulted
Recovering
Offline

2. Command Completion

Healthy means:

text
Commands complete within expected timing and reach a valid final state.

Unhealthy looks like:

text
Command stuck in Busy
Command result never received
Command completed but physical state does not match
Command timeout repeatedly occurs

Useful evidence:

text
CommandId
StartTime
ExpectedTimeout
CurrentDeviceState
LastStatusUpdate
FinalResult

Example:

text
MoveAxis(X, 120.0)

Expected:
- Axis enters Moving
- Position changes
- Axis reaches target
- Done signal arrives

Unhealthy:
- Axis stays Busy
- Position does not change
- Done signal never arrives

3. Workflow Progress

Healthy means:

text
The workflow continues moving through valid states.

Unhealthy looks like:

text
Workflow stuck in "WaitingForFrame"
Workflow waiting forever for vacuum ready
Workflow step never exits
Workflow task silently stopped

Useful evidence:

text
Current step
Step enter timestamp
Expected max duration
Blocking condition
Last event received
Current machine mode

Important:

Workflow watchdogs should monitor step progress, not just thread life.


4. Queue Depth / Backlog

Healthy means:

text
Queues remain bounded and drain at an acceptable rate.

Unhealthy looks like:

text
Frame queue grows continuously
Result writer queue never drains
UI update queue becomes delayed
Inspection results accumulate in memory

Useful evidence:

text
Queue depth
Enqueue rate
Dequeue rate
Oldest item age
Dropped item count
Backpressure state

A system can be “green” while slowly dying because queues are growing.

That is a classic long-running machine problem.


5. Frame Arrival Rate

Healthy means:

text
Frames arrive at expected rate and timestamps are fresh.

Unhealthy looks like:

text
No frames
Irregular frames
Duplicate frames
Old frames reused
Frames arriving but invalid
Frame rate lower than recipe expectation

Useful evidence:

text
Last frame timestamp
Frame counter
Frame interval statistics
Dropped frame count
Camera trigger count
Acquisition state

Example:

text
Expected frame interval: 50 ms
Warning threshold: no frame for 250 ms
Fault threshold: no frame for 1000 ms

6. Sensor Freshness

Healthy means:

text
Sensor values are recent enough to trust.

Unhealthy looks like:

text
Temperature value unchanged for too long
Vacuum value stale
Door state not updated
Encoder feedback not refreshed

Useful evidence:

text
Value
Timestamp
Source
Update interval
Quality flag
Last successful read

Bad:

csharp
if (sensor.VacuumOk)
{
    Continue();
}

Better:

csharp
if (sensor.VacuumOk && sensor.Age < MaxAllowedAge)
{
    Continue();
}

Industrial systems must treat old data as suspicious.


7. Background Worker Activity

Healthy means:

text
Worker loop is running, not blocked, and making progress.

Unhealthy looks like:

text
Worker task died
Worker loop blocked on lock
Worker stuck waiting for device callback
Worker alive but not processing messages

Useful evidence:

text
Last loop timestamp
Last processed item timestamp
Unhandled exception
Current operation
Cancellation state
Thread/task status

A worker being “started” is not enough.

The real question is:

“Is the worker still doing useful work?”


8. UI Responsiveness

Healthy means:

text
UI thread can process messages within acceptable latency.

Unhealthy looks like:

text
UI freezes
Operator clicks do nothing
Live status stops updating
Alarm screen becomes stale

Useful evidence:

text
Dispatcher heartbeat
UI update latency
Render delay
Last UI tick timestamp
Pending UI queue size

In WPF industrial systems, UI responsiveness is not just comfort.

It affects operator trust and recovery speed.


9. Storage Availability

Healthy means:

text
The system can write required data fast enough and safely.

Unhealthy looks like:

text
Disk full
Database unavailable
Result writer queue backing up
Image save latency too high
File lock issues

Useful evidence:

text
Free disk space
Write latency
Failed write count
Queue depth
Oldest unsaved result age

For inspection machines, storage health can directly affect production.


10. CPU / Memory / Disk Pressure

Healthy means:

text
Resources stay within safe operating range.

Unhealthy looks like:

text
Memory grows over 8 hours
CPU spikes cause missed frames
Disk IO delays result saving
GC pauses affect UI or pipeline timing

Useful evidence:

text
Working set
Private bytes
GC heap size
CPU usage
Disk queue length
Handle count
Thread count

This is especially important in long-running vision systems.


Part 4 — Health States and Escalation

Binary health is too weak.

This is not enough:

text
Healthy / Unhealthy

Real systems need intermediate states because many failures are gradual.

A practical model:

text
Healthy
Suspect
Degraded
Faulted
Recovering
Offline

State Meaning

text
Healthy
- Component is usable.
- Evidence is fresh.
- Progress is normal.

Suspect
- One or more signals look abnormal.
- Not enough evidence to stop the machine yet.

Degraded
- Component still works, but below expected quality or speed.
- Production may continue with reduced capability.

Faulted
- Component cannot be trusted.
- Machine action must be stopped, blocked, or escalated.

Recovering
- System is attempting controlled recovery.
- Commands may be restricted.

Offline
- Component is intentionally unavailable or disconnected.

State Diagram

text
+---------+
| Healthy |
+----+----+
     |
     | missed heartbeat / slow progress / stale data
     v
+---------+
| Suspect |
+----+----+
     |
     | repeated issue / threshold exceeded
     v
+----------+
| Degraded |
+----+-----+
     |
     | unsafe / unusable / timeout exceeded
     v
+---------+
| Faulted |
+----+----+
     |
     | recovery requested
     v
+------------+
| Recovering |
+-----+------+
      |
      | recovery success
      v
+---------+
| Healthy |
+---------+

Faulted ---> Offline
Offline ---> Recovering

Why this matters:

Not every missed heartbeat should stop production. But repeated missed heartbeats should not be ignored.

Good systems escalate based on evidence over time.

Example:

text
1 missed camera frame:
Suspect

10 missed frames:
Degraded

No frame for 2 seconds during active inspection:
Faulted

Part 5 — Watchdog Time Windows and False Positives

Watchdog timing is one of the hardest parts.

If the timeout is too short, the machine becomes noisy and unstable.

If the timeout is too long, the machine detects real failures too late.

Example 1 — Camera Frame Watchdog

text
Normal frame interval: 50 ms
Warning threshold: 250 ms
Fault threshold: 1000 ms

Timeline:

text
Expected frame stream:

0ms     50ms    100ms   150ms   200ms
 |-------|-------|-------|-------|
 Frame   Frame   Frame   Frame   Frame


Abnormal stream:

0ms     50ms    100ms              500ms              1000ms
 |-------|-------|------------------|-------------------|
 Frame   Frame   no frame           Suspect/Warning     Fault

Explanation:

  • One missing frame may be normal jitter.
  • A 250 ms gap may indicate degraded acquisition.
  • A 1000 ms gap during active inspection is likely a real fault.

Example 2 — Workflow Step Watchdog

text
Step: WaitForVacuumReady
Expected: 10 seconds
Warning: 15 seconds
Fault: 30 seconds

Timeline:

text
Step entered
    |
    v
0s -------- 10s -------- 15s ---------------- 30s
|           |            |                    |
Start       Expected     Warning              Fault
            completion   operator awareness   stop workflow

The timeout should reflect physical behavior.

Vacuum may take time. Motion may take time. Camera exposure may vary. A real system must understand the process.


Common Timing Mistakes

Bad:

text
Set every device timeout to 5 seconds.

Why bad?

Because different operations have different physical meanings.

Better:

text
Camera grab timeout: 500 ms
Axis move timeout: based on distance, velocity, acceleration, margin
Vacuum timeout: based on chamber size and expected pressure curve
Database write warning: based on queue age and production rate

Strong engineers do not choose watchdog windows randomly.

They ask:

text
What is the normal duration?
What is the worst acceptable duration?
What jitter is expected?
What happens if we stop too early?
What happens if we wait too long?

Part 6 — Active vs Passive Health Monitoring

Active Monitoring

Active monitoring asks the component to prove health.

Example:

text
PC software sends ping to PLC every 500 ms.
PLC replies with counter/status.

Diagram:

text
+-------------+       ping/read status       +-----+
| PC Software | ---------------------------> | PLC |
| Monitor     | <--------------------------- |     |
+-------------+       response/counter       +-----+

Good for:

  • communication checks
  • explicit status reads
  • detecting disconnected devices
  • verifying command channel availability

Weakness:

A device can respond to ping but still be functionally broken.

Example:

text
PLC responds to ping,
but the conveyor command is ignored.

Passive Monitoring

Passive monitoring observes normal operational events.

Example:

text
Camera frames arrive during acquisition.
Each frame updates LastFrameTimestamp.
Monitor checks frame freshness.

Diagram:

text
+--------+        frame events        +----------------+
| Camera | -------------------------> | Frame Pipeline |
+--------+                            +-------+--------+
                                              |
                                              | last frame timestamp
                                              v
                                      +----------------+
                                      | Health Monitor |
                                      +----------------+

Good for:

  • real operational health
  • throughput monitoring
  • progress detection
  • detecting stuck pipelines

Weakness:

If the component is idle, lack of events may be normal.

So passive monitoring must understand context:

text
No frame while camera is Idle:
Healthy

No frame while camera is Acquiring:
Fault

Context matters.


Part 7 — Real-World Failure Scenarios

Scenario 1 — Heartbeat Updates but Device Is Functionally Stuck

Production symptom:

text
PLC heartbeat is green.
Machine says "Connected."
But conveyor does not move.

Why it happens:

The communication link is alive, but the functional part of the controller is stuck, inhibited, faulted, or ignoring commands.

Bad diagnosis:

text
PLC is connected, so PLC is fine.

Better diagnosis:

text
Connection health: OK
Heartbeat health: OK
Command execution health: Failed
Physical progress: No encoder/sensor change

How experienced engineers handle it:

They separate:

text
Connectivity health
Controller health
Command health
Physical progress health

Scenario 2 — Watchdog Timeout Too Short Causes False Stops

Production symptom:

text
Machine randomly stops during heavy load.
Operators report false alarms.
Logs show camera timeout.

Why it happens:

The watchdog threshold was set based on ideal lab timing, not real production timing.

Example:

text
Normal lab frame interval: 50 ms
Production occasionally: 120–180 ms
Watchdog timeout: 100 ms

Result:

The watchdog becomes noise.

How experienced engineers handle it:

They measure real timing distribution and set thresholds with margin.

They also separate warning and fault thresholds.


Scenario 3 — Watchdog Timeout Too Long Delays Safe Recovery

Production symptom:

text
Motion axis gets stuck.
Machine waits 5 minutes before faulting.
Operator loses time.
Material may be at risk.

Why it happens:

Timeout was set too generously to avoid false positives.

The machine detects failure too late.

Better approach:

text
Expected motion duration calculated from distance/speed.
Warning after expected duration + margin.
Fault after maximum physically reasonable duration.

A 10 mm move and a 500 mm move should not have the same timeout.


Scenario 4 — Queue Backlog Grows but Health Remains Green

Production symptom:

text
Inspection continues.
Camera keeps acquiring.
Processing queue grows.
Memory increases.
Eventually app freezes or crashes.

Why it happens:

Health model only checks whether services are running.

It does not monitor throughput or backlog.

Bad health model:

text
Camera connected: true
Processor running: true
Storage connected: true
System healthy: true

Better health model:

text
Frame arrival rate: OK
Processing rate: too slow
Queue depth: rising
Oldest frame age: 12 seconds
System state: Degraded

Scenario 5 — Background Worker Dies Silently

Production symptom:

text
Machine appears idle.
No new results are saved.
No alarm appears.

Why it happens:

A background task threw an exception and exited.

Nobody observed the task.

Bad design:

csharp
Task.Run(() => ProcessResults());

No supervision. No heartbeat. No restart strategy. No fault propagation.

Better design:

text
Worker has:
- lifecycle state
- last activity timestamp
- unhandled exception capture
- health contribution
- supervisor/monitor ownership

Scenario 6 — Stale Sensor Value Treated as Current

Production symptom:

text
Software thinks vacuum is OK.
Actually vacuum was lost seconds ago.
Machine continues incorrectly.

Why it happens:

The system stores the last known value but does not check its age.

Bad:

text
VacuumOk = true

Better:

text
VacuumOk = true
LastUpdated = 2026-04-27 10:12:01.230
Age = 8.5 seconds
Quality = Stale

Strong rule:

Every important sensor value should have a timestamp and freshness policy.


Scenario 7 — Health Monitor Itself Becomes Unreliable

Production symptom:

text
Everything is green.
But later logs show health monitor stopped updating.

Why it happens:

The monitor was implemented as just another background worker with no supervision.

How experienced engineers handle it:

They design health monitoring as a first-class subsystem.

At minimum:

text
Health monitor has own heartbeat.
Health snapshot includes timestamp.
Consumers reject stale health snapshots.
Critical monitors are simple and robust.

Important principle:

Health monitoring must not become a hidden single point of failure.


Scenario 8 — Reconnect Resets Heartbeat but Device State Remains Invalid

Production symptom:

text
Device disconnects and reconnects.
UI becomes green.
Next command fails or behaves incorrectly.

Why it happens:

Reconnect restored communication, but not machine readiness.

The device may need:

text
re-initialization
configuration reload
homing
state verification
mode synchronization
buffer clearing

Bad model:

text
Connected = true
Therefore Ready = true

Better model:

text
Connected
Initialized
Configured
StateVerified
Ready

Reconnect is not recovery by itself.


Part 8 — Software Design Implications

Health monitoring must be designed into the architecture. It cannot be added only at the end as a few timers.

Bad Approach

text
Each device exposes IsConnected.
UI checks IsConnected.
If true, show green.

Problems:

text
Connected does not mean ready.
Ready does not mean progressing.
Progressing does not mean producing valid output.
Valid once does not mean fresh now.

Good Approach

Use layered health signals:

text
Connectivity
Initialization
Configuration
Operational state
Progress
Freshness
Performance
Resource pressure
Fault history

Component Diagram

text
+------------------+
| Device / Worker  |
| Camera / Axis    |
| PLC / Pipeline   |
+---------+--------+
          |
          | heartbeat
          | progress events
          | status snapshots
          | timestamps
          v
+------------------+
|  Health Monitor  |
| freshness checks |
| progress checks  |
| thresholds       |
| state model      |
+---------+--------+
          |
          | health state
          | diagnostic evidence
          v
+------------------+
|  Fault Manager   |
| severity         |
| escalation       |
| ownership        |
+---------+--------+
          |
          | action request
          v
+------------------+
| Machine State    |
| alarm            |
| inhibit command  |
| controlled stop  |
| diagnostics      |
+------------------+

Key design principle:

Detection and action should be separated.

The health monitor should decide:

text
Camera frame stream is stale.

The fault/recovery layer should decide:

text
Warn operator.
Stop inspection.
Block next wafer.
Attempt controlled recovery.
Require service intervention.

This separation prevents health monitoring from becoming a messy place full of machine-control decisions.


Health Snapshot Example

A good health snapshot contains evidence, not only status.

csharp
public enum HealthState
{
    Healthy,
    Suspect,
    Degraded,
    Faulted,
    Recovering,
    Offline
}

public sealed record HealthSnapshot(
    string ComponentName,
    HealthState State,
    DateTimeOffset Timestamp,
    string Reason,
    IReadOnlyDictionary<string, object> Evidence);

Example output:

text
Component: CameraAcquisition
State: Degraded
Reason: Frame interval above warning threshold
Evidence:
- LastFrameAgeMs = 420
- ExpectedFrameIntervalMs = 50
- WarningThresholdMs = 250
- FaultThresholdMs = 1000
- FrameQueueDepth = 320
- AcquisitionState = Running

This is much better than:

text
CameraHealthy = false

Because support engineers need to know why.


Part 9 — Interview / Real-World Talking Points

How to Explain Watchdogs and Heartbeats Clearly

A strong answer:

A heartbeat tells me a component is still alive. A watchdog tells me whether expected progress happened within an acceptable time window. In industrial systems, I do not rely only on connectivity or heartbeat. I monitor freshness, command completion, workflow progress, queue backlog, and functional readiness. The goal is to detect stuck or degraded behavior before it becomes unsafe or causes production loss.


Why “Connected” Is Not Equal to “Healthy”

Because connection is only one layer.

A device may be:

text
Connected but not initialized.
Initialized but not configured.
Configured but not ready.
Ready but not progressing.
Progressing but producing stale or invalid data.

So the correct model is layered health, not a single boolean.


Common Mistakes Software Engineers Make

Common mistakes:

text
Using only IsConnected
Using heartbeat as complete health
No timestamp on sensor values
No watchdog for workflow steps
No command timeout per operation type
Same timeout for all operations
No queue/backlog monitoring
No distinction between warning and fault
No health state history
No diagnostic evidence
Health monitor tightly coupled to recovery actions

The most dangerous mistake:

assuming “nothing happened” means “nothing is wrong.”


What Strong Engineers Understand

Strong industrial software engineers understand that health is about:

text
freshness
progress
context
timing
trend
evidence
escalation

They ask:

text
Is the data fresh?
Is the command progressing?
Is the workflow moving?
Is the queue draining?
Is the component functionally usable?
Is the problem transient or repeated?
Should we warn, degrade, stop, or recover?

They also understand that health monitoring must be context-aware.

Example:

text
No camera frames while Idle:
Healthy

No camera frames while Inspecting:
Fault

Same signal. Different meaning.


Final Mental Model

In business software, health often means:

text
Can I reach the service?

In industrial machine software, health means:

text
Can this subsystem safely and correctly perform its machine responsibility right now,
with fresh data, valid state, and observable progress?

That is the mindset shift.

A production-grade machine does not only ask:

text
Are components alive?

It asks:

text
Are they making the right progress at the right time?

That is the real purpose of watchdogs, heartbeats, and health monitoring.

Docs-first project memory for AI-assisted implementation.