Watchdogs, Heartbeats & Health Monitoring

This topic sits inside the reliability/fault-handling area of your roadmap, especially “Timeout design,” “Watchdogs and heartbeat monitoring,” and “Degraded mode operation.”

Part 1 — Why Health Monitoring Is Critical

In industrial machine software, many failures do not throw an exception.

Sometimes the failure is simply:

something expected did not happen.

A camera does not raise an error. It just stops producing frames.

A motion controller does not crash. A command just never reaches Completed.

A PLC connection still exists, but the status value stops changing.

A processing pipeline still has threads, but the queue no longer drains.

That is why this statement is dangerous:

“There is no error, so the system is healthy.”

In machine software, silence can be failure.

Examples:

text

Camera acquisition:
Expected: frame arrives every ~50 ms
Failure: no frame arrives for 2 seconds

Motion:
Expected: MoveTo(position) completes in 5 seconds
Failure: axis remains Busy forever

PLC:
Expected: heartbeat counter increments every 500 ms
Failure: counter freezes

Image processing:
Expected: queue depth stays below 100
Failure: queue grows to 20,000 while system still says "Running"

A good industrial system does not only wait for exceptions. It watches for:

freshness
progress
completion
throughput
responsiveness
consistency
resource pressure

The core question is:

“Is the machine still making valid progress?”

Part 2 — Heartbeats vs Watchdogs vs Health Checks

These three are related, but they are not the same.

Heartbeat

A heartbeat is a periodic signal that says:

“I am still alive.”

Example:

text

PLC heartbeat counter increments every 500 ms.
Camera acquisition service updates LastFrameTimestamp.
Background worker updates LastLoopTimestamp.

Heartbeat is useful, but weak.

A component can be “alive” but still useless.

Example:

text

Camera SDK thread is alive.
Heartbeat still updates.
But no valid image frames are produced.

So heartbeat answers:

“Is something still running?”

It does not fully answer:

“Is it doing the correct job?”

Watchdog

A watchdog is an observer that expects something to happen within a time window.

It asks:

“Did expected progress happen in time?”

Examples:

text

Motion command must complete within 30 seconds.
Camera must produce a valid frame within 500 ms.
Workflow step must advance within 10 seconds.
Queue must drain below threshold within 5 seconds.

A watchdog is stronger than a heartbeat because it watches progress, not only existence.

Health Check

A health check explicitly evaluates whether a component is usable.

Example:

text

Camera health:
- connected
- acquisition started
- last frame is fresh
- frame rate acceptable
- no SDK error
- buffer queue not full

Motion axis health:
- controller connected
- axis enabled
- not faulted
- position feedback fresh
- command state valid
- no limit violation

Health check answers:

“Can this component safely and correctly perform its responsibility now?”

Concept Diagram

text

+----------------+
|   Component    |
|  Camera / PLC  |
|  Axis / Worker |
+-------+--------+
        |
        | heartbeat / status / progress event
        v
+----------------+
| Health Monitor |
| freshness      |
| progress       |
| thresholds     |
| state model    |
+-------+--------+
        |
        | health state decision
        v
+----------------+
|   Watchdog     |
| warn / fault   |
| escalate       |
| request action |
+----------------+

The important idea:

Heartbeat is evidence. Health monitor interprets evidence. Watchdog decides whether lack of progress is unacceptable.

Part 3 — What Should Be Monitored

1. Device Connectivity

Healthy means:

text

Device is reachable, initialized, configured, and in expected mode.

Unhealthy looks like:

text

Connection lost
SDK handle invalid
Device responds slowly
Device reconnects but loses state

Useful evidence:

text

Last successful command time
Last communication error
Connection state
Device mode
Firmware/version info
Reconnect count

Bad design:

csharp

bool IsConnected;

Better design:

text

Connected
Initialized
Configured
Ready
Busy
Faulted
Recovering
Offline

2. Command Completion

Healthy means:

text

Commands complete within expected timing and reach a valid final state.

Unhealthy looks like:

text

Command stuck in Busy
Command result never received
Command completed but physical state does not match
Command timeout repeatedly occurs

Useful evidence:

text

CommandId
StartTime
ExpectedTimeout
CurrentDeviceState
LastStatusUpdate
FinalResult

Example:

text

MoveAxis(X, 120.0)

Expected:
- Axis enters Moving
- Position changes
- Axis reaches target
- Done signal arrives

Unhealthy:
- Axis stays Busy
- Position does not change
- Done signal never arrives

3. Workflow Progress

Healthy means:

text

The workflow continues moving through valid states.

Unhealthy looks like:

text

Workflow stuck in "WaitingForFrame"
Workflow waiting forever for vacuum ready
Workflow step never exits
Workflow task silently stopped

Useful evidence:

text

Current step
Step enter timestamp
Expected max duration
Blocking condition
Last event received
Current machine mode

Important:

Workflow watchdogs should monitor step progress, not just thread life.

4. Queue Depth / Backlog

Healthy means:

text

Queues remain bounded and drain at an acceptable rate.

Unhealthy looks like:

text

Frame queue grows continuously
Result writer queue never drains
UI update queue becomes delayed
Inspection results accumulate in memory

Useful evidence:

text

Queue depth
Enqueue rate
Dequeue rate
Oldest item age
Dropped item count
Backpressure state

A system can be “green” while slowly dying because queues are growing.

That is a classic long-running machine problem.

5. Frame Arrival Rate

Healthy means:

text

Frames arrive at expected rate and timestamps are fresh.

Unhealthy looks like:

text

No frames
Irregular frames
Duplicate frames
Old frames reused
Frames arriving but invalid
Frame rate lower than recipe expectation

Useful evidence:

text

Last frame timestamp
Frame counter
Frame interval statistics
Dropped frame count
Camera trigger count
Acquisition state

Example:

text

Expected frame interval: 50 ms
Warning threshold: no frame for 250 ms
Fault threshold: no frame for 1000 ms

6. Sensor Freshness

Healthy means:

text

Sensor values are recent enough to trust.

Unhealthy looks like:

text

Temperature value unchanged for too long
Vacuum value stale
Door state not updated
Encoder feedback not refreshed

Useful evidence:

text

Value
Timestamp
Source
Update interval
Quality flag
Last successful read

Bad:

csharp

if (sensor.VacuumOk)
{
    Continue();
}

Better:

csharp

if (sensor.VacuumOk && sensor.Age < MaxAllowedAge)
{
    Continue();
}

Industrial systems must treat old data as suspicious.

7. Background Worker Activity

Healthy means:

text

Worker loop is running, not blocked, and making progress.

Unhealthy looks like:

text

Worker task died
Worker loop blocked on lock
Worker stuck waiting for device callback
Worker alive but not processing messages

Useful evidence:

text

Last loop timestamp
Last processed item timestamp
Unhandled exception
Current operation
Cancellation state
Thread/task status

A worker being “started” is not enough.

The real question is:

“Is the worker still doing useful work?”

8. UI Responsiveness

Healthy means:

text

UI thread can process messages within acceptable latency.

Unhealthy looks like:

text

UI freezes
Operator clicks do nothing
Live status stops updating
Alarm screen becomes stale

Useful evidence:

text

Dispatcher heartbeat
UI update latency
Render delay
Last UI tick timestamp
Pending UI queue size

In WPF industrial systems, UI responsiveness is not just comfort.

It affects operator trust and recovery speed.

9. Storage Availability

Healthy means:

text

The system can write required data fast enough and safely.

Unhealthy looks like:

text

Disk full
Database unavailable
Result writer queue backing up
Image save latency too high
File lock issues

Useful evidence:

text

Free disk space
Write latency
Failed write count
Queue depth
Oldest unsaved result age

For inspection machines, storage health can directly affect production.

10. CPU / Memory / Disk Pressure

Healthy means:

text

Resources stay within safe operating range.

Unhealthy looks like:

text

Memory grows over 8 hours
CPU spikes cause missed frames
Disk IO delays result saving
GC pauses affect UI or pipeline timing

Useful evidence:

text

Working set
Private bytes
GC heap size
CPU usage
Disk queue length
Handle count
Thread count

This is especially important in long-running vision systems.

Part 4 — Health States and Escalation

Binary health is too weak.

This is not enough:

text

Healthy / Unhealthy

Real systems need intermediate states because many failures are gradual.

A practical model:

text

Healthy
Suspect
Degraded
Faulted
Recovering
Offline

State Meaning

text

Healthy
- Component is usable.
- Evidence is fresh.
- Progress is normal.

Suspect
- One or more signals look abnormal.
- Not enough evidence to stop the machine yet.

Degraded
- Component still works, but below expected quality or speed.
- Production may continue with reduced capability.

Faulted
- Component cannot be trusted.
- Machine action must be stopped, blocked, or escalated.

Recovering
- System is attempting controlled recovery.
- Commands may be restricted.

Offline
- Component is intentionally unavailable or disconnected.

State Diagram

text

+---------+
| Healthy |
+----+----+
     |
     | missed heartbeat / slow progress / stale data
     v
+---------+
| Suspect |
+----+----+
     |
     | repeated issue / threshold exceeded
     v
+----------+
| Degraded |
+----+-----+
     |
     | unsafe / unusable / timeout exceeded
     v
+---------+
| Faulted |
+----+----+
     |
     | recovery requested
     v
+------------+
| Recovering |
+-----+------+
      |
      | recovery success
      v
+---------+
| Healthy |
+---------+

Faulted ---> Offline
Offline ---> Recovering

Why this matters:

Not every missed heartbeat should stop production. But repeated missed heartbeats should not be ignored.

Good systems escalate based on evidence over time.

Example:

text

1 missed camera frame:
Suspect

10 missed frames:
Degraded

No frame for 2 seconds during active inspection:
Faulted

Part 5 — Watchdog Time Windows and False Positives

Watchdog timing is one of the hardest parts.

If the timeout is too short, the machine becomes noisy and unstable.

If the timeout is too long, the machine detects real failures too late.

Example 1 — Camera Frame Watchdog

text

Normal frame interval: 50 ms
Warning threshold: 250 ms
Fault threshold: 1000 ms

Timeline:

text

Expected frame stream:

0ms     50ms    100ms   150ms   200ms
 |-------|-------|-------|-------|
 Frame   Frame   Frame   Frame   Frame


Abnormal stream:

0ms     50ms    100ms              500ms              1000ms
 |-------|-------|------------------|-------------------|
 Frame   Frame   no frame           Suspect/Warning     Fault

Explanation:

One missing frame may be normal jitter.
A 250 ms gap may indicate degraded acquisition.
A 1000 ms gap during active inspection is likely a real fault.

Example 2 — Workflow Step Watchdog

text

Step: WaitForVacuumReady
Expected: 10 seconds
Warning: 15 seconds
Fault: 30 seconds

Timeline:

text

Step entered
    |
    v
0s -------- 10s -------- 15s ---------------- 30s
|           |            |                    |
Start       Expected     Warning              Fault
            completion   operator awareness   stop workflow

The timeout should reflect physical behavior.

Vacuum may take time. Motion may take time. Camera exposure may vary. A real system must understand the process.

Common Timing Mistakes

Bad:

text

Set every device timeout to 5 seconds.

Why bad?

Because different operations have different physical meanings.

Better:

text

Camera grab timeout: 500 ms
Axis move timeout: based on distance, velocity, acceleration, margin
Vacuum timeout: based on chamber size and expected pressure curve
Database write warning: based on queue age and production rate

Strong engineers do not choose watchdog windows randomly.

They ask:

text

What is the normal duration?
What is the worst acceptable duration?
What jitter is expected?
What happens if we stop too early?
What happens if we wait too long?

Part 6 — Active vs Passive Health Monitoring

Active Monitoring

Active monitoring asks the component to prove health.

Example:

text

PC software sends ping to PLC every 500 ms.
PLC replies with counter/status.

Diagram:

text

+-------------+       ping/read status       +-----+
| PC Software | ---------------------------> | PLC |
| Monitor     | <--------------------------- |     |
+-------------+       response/counter       +-----+

Good for:

communication checks
explicit status reads
detecting disconnected devices
verifying command channel availability

Weakness:

A device can respond to ping but still be functionally broken.

Example:

text

PLC responds to ping,
but the conveyor command is ignored.

Passive Monitoring

Passive monitoring observes normal operational events.

Example:

text

Camera frames arrive during acquisition.
Each frame updates LastFrameTimestamp.
Monitor checks frame freshness.

Diagram:

text

+--------+        frame events        +----------------+
| Camera | -------------------------> | Frame Pipeline |
+--------+                            +-------+--------+
                                              |
                                              | last frame timestamp
                                              v
                                      +----------------+
                                      | Health Monitor |
                                      +----------------+

Good for:

real operational health
throughput monitoring
progress detection
detecting stuck pipelines

Weakness:

If the component is idle, lack of events may be normal.

So passive monitoring must understand context:

text

No frame while camera is Idle:
Healthy

No frame while camera is Acquiring:
Fault

Context matters.

Part 7 — Real-World Failure Scenarios

Scenario 1 — Heartbeat Updates but Device Is Functionally Stuck

Production symptom:

text

PLC heartbeat is green.
Machine says "Connected."
But conveyor does not move.

Why it happens:

The communication link is alive, but the functional part of the controller is stuck, inhibited, faulted, or ignoring commands.

Bad diagnosis:

text

PLC is connected, so PLC is fine.

Better diagnosis:

text

Connection health: OK
Heartbeat health: OK
Command execution health: Failed
Physical progress: No encoder/sensor change

How experienced engineers handle it:

They separate:

text

Connectivity health
Controller health
Command health
Physical progress health

Scenario 2 — Watchdog Timeout Too Short Causes False Stops

Production symptom:

text

Machine randomly stops during heavy load.
Operators report false alarms.
Logs show camera timeout.

Why it happens:

The watchdog threshold was set based on ideal lab timing, not real production timing.

Example:

text

Normal lab frame interval: 50 ms
Production occasionally: 120–180 ms
Watchdog timeout: 100 ms

Result:

The watchdog becomes noise.

How experienced engineers handle it:

They measure real timing distribution and set thresholds with margin.

They also separate warning and fault thresholds.

Scenario 3 — Watchdog Timeout Too Long Delays Safe Recovery

Production symptom:

text

Motion axis gets stuck.
Machine waits 5 minutes before faulting.
Operator loses time.
Material may be at risk.

Why it happens:

Timeout was set too generously to avoid false positives.

The machine detects failure too late.

Better approach:

text

Expected motion duration calculated from distance/speed.
Warning after expected duration + margin.
Fault after maximum physically reasonable duration.

A 10 mm move and a 500 mm move should not have the same timeout.

Scenario 4 — Queue Backlog Grows but Health Remains Green

Production symptom:

text

Inspection continues.
Camera keeps acquiring.
Processing queue grows.
Memory increases.
Eventually app freezes or crashes.

Why it happens:

Health model only checks whether services are running.

It does not monitor throughput or backlog.

Bad health model:

text

Camera connected: true
Processor running: true
Storage connected: true
System healthy: true

Better health model:

text

Frame arrival rate: OK
Processing rate: too slow
Queue depth: rising
Oldest frame age: 12 seconds
System state: Degraded

Scenario 5 — Background Worker Dies Silently

Production symptom:

text

Machine appears idle.
No new results are saved.
No alarm appears.

Why it happens:

A background task threw an exception and exited.

Nobody observed the task.

Bad design:

csharp

Task.Run(() => ProcessResults());

No supervision. No heartbeat. No restart strategy. No fault propagation.

Better design:

text

Worker has:
- lifecycle state
- last activity timestamp
- unhandled exception capture
- health contribution
- supervisor/monitor ownership

Scenario 6 — Stale Sensor Value Treated as Current

Production symptom:

text

Software thinks vacuum is OK.
Actually vacuum was lost seconds ago.
Machine continues incorrectly.

Why it happens:

The system stores the last known value but does not check its age.

Bad:

text

VacuumOk = true

Better:

text

VacuumOk = true
LastUpdated = 2026-04-27 10:12:01.230
Age = 8.5 seconds
Quality = Stale

Strong rule:

Every important sensor value should have a timestamp and freshness policy.

Scenario 7 — Health Monitor Itself Becomes Unreliable

Production symptom:

text

Everything is green.
But later logs show health monitor stopped updating.

Why it happens:

The monitor was implemented as just another background worker with no supervision.

How experienced engineers handle it:

They design health monitoring as a first-class subsystem.

At minimum:

text

Health monitor has own heartbeat.
Health snapshot includes timestamp.
Consumers reject stale health snapshots.
Critical monitors are simple and robust.

Important principle:

Health monitoring must not become a hidden single point of failure.

Scenario 8 — Reconnect Resets Heartbeat but Device State Remains Invalid

Production symptom:

text

Device disconnects and reconnects.
UI becomes green.
Next command fails or behaves incorrectly.

Why it happens:

Reconnect restored communication, but not machine readiness.

The device may need:

text

re-initialization
configuration reload
homing
state verification
mode synchronization
buffer clearing

Bad model:

text

Connected = true
Therefore Ready = true

Better model:

text

Connected
Initialized
Configured
StateVerified
Ready

Reconnect is not recovery by itself.

Part 8 — Software Design Implications

Health monitoring must be designed into the architecture. It cannot be added only at the end as a few timers.

Bad Approach

text

Each device exposes IsConnected.
UI checks IsConnected.
If true, show green.

Problems:

text

Connected does not mean ready.
Ready does not mean progressing.
Progressing does not mean producing valid output.
Valid once does not mean fresh now.

Good Approach

Use layered health signals:

text

Connectivity
Initialization
Configuration
Operational state
Progress
Freshness
Performance
Resource pressure
Fault history

Component Diagram

text

+------------------+
| Device / Worker  |
| Camera / Axis    |
| PLC / Pipeline   |
+---------+--------+
          |
          | heartbeat
          | progress events
          | status snapshots
          | timestamps
          v
+------------------+
|  Health Monitor  |
| freshness checks |
| progress checks  |
| thresholds       |
| state model      |
+---------+--------+
          |
          | health state
          | diagnostic evidence
          v
+------------------+
|  Fault Manager   |
| severity         |
| escalation       |
| ownership        |
+---------+--------+
          |
          | action request
          v
+------------------+
| Machine State    |
| alarm            |
| inhibit command  |
| controlled stop  |
| diagnostics      |
+------------------+

Key design principle:

Detection and action should be separated.

The health monitor should decide:

text

Camera frame stream is stale.

The fault/recovery layer should decide:

text

Warn operator.
Stop inspection.
Block next wafer.
Attempt controlled recovery.
Require service intervention.

This separation prevents health monitoring from becoming a messy place full of machine-control decisions.

Health Snapshot Example

A good health snapshot contains evidence, not only status.

csharp

public enum HealthState
{
    Healthy,
    Suspect,
    Degraded,
    Faulted,
    Recovering,
    Offline
}

public sealed record HealthSnapshot(
    string ComponentName,
    HealthState State,
    DateTimeOffset Timestamp,
    string Reason,
    IReadOnlyDictionary<string, object> Evidence);

Example output:

text

Component: CameraAcquisition
State: Degraded
Reason: Frame interval above warning threshold
Evidence:
- LastFrameAgeMs = 420
- ExpectedFrameIntervalMs = 50
- WarningThresholdMs = 250
- FaultThresholdMs = 1000
- FrameQueueDepth = 320
- AcquisitionState = Running

This is much better than:

text

CameraHealthy = false

Because support engineers need to know why.

Part 9 — Interview / Real-World Talking Points

How to Explain Watchdogs and Heartbeats Clearly

A strong answer:

A heartbeat tells me a component is still alive. A watchdog tells me whether expected progress happened within an acceptable time window. In industrial systems, I do not rely only on connectivity or heartbeat. I monitor freshness, command completion, workflow progress, queue backlog, and functional readiness. The goal is to detect stuck or degraded behavior before it becomes unsafe or causes production loss.

Why “Connected” Is Not Equal to “Healthy”

Because connection is only one layer.

A device may be:

text

Connected but not initialized.
Initialized but not configured.
Configured but not ready.
Ready but not progressing.
Progressing but producing stale or invalid data.

So the correct model is layered health, not a single boolean.

Common Mistakes Software Engineers Make

Common mistakes:

text

Using only IsConnected
Using heartbeat as complete health
No timestamp on sensor values
No watchdog for workflow steps
No command timeout per operation type
Same timeout for all operations
No queue/backlog monitoring
No distinction between warning and fault
No health state history
No diagnostic evidence
Health monitor tightly coupled to recovery actions

The most dangerous mistake:

assuming “nothing happened” means “nothing is wrong.”

What Strong Engineers Understand

Strong industrial software engineers understand that health is about:

text

freshness
progress
context
timing
trend
evidence
escalation

They ask:

text

Is the data fresh?
Is the command progressing?
Is the workflow moving?
Is the queue draining?
Is the component functionally usable?
Is the problem transient or repeated?
Should we warn, degrade, stop, or recover?

They also understand that health monitoring must be context-aware.

Example:

text

No camera frames while Idle:
Healthy

No camera frames while Inspecting:
Fault

Same signal. Different meaning.

Final Mental Model

In business software, health often means:

text

Can I reach the service?

In industrial machine software, health means:

text

Can this subsystem safely and correctly perform its machine responsibility right now,
with fresh data, valid state, and observable progress?

That is the mindset shift.

A production-grade machine does not only ask:

text

Are components alive?

It asks:

text

Are they making the right progress at the right time?

That is the real purpose of watchdogs, heartbeats, and health monitoring.

Streaming Pipelines Dotnet Real World

Watchdogs, Heartbeats & Health Monitoring ​

Part 1 — Why Health Monitoring Is Critical ​

Part 2 — Heartbeats vs Watchdogs vs Health Checks ​

Heartbeat ​

Watchdog ​

Health Check ​

Concept Diagram ​

Part 3 — What Should Be Monitored ​

1. Device Connectivity ​

2. Command Completion ​

3. Workflow Progress ​

4. Queue Depth / Backlog ​

5. Frame Arrival Rate ​

6. Sensor Freshness ​

7. Background Worker Activity ​

8. UI Responsiveness ​

9. Storage Availability ​

10. CPU / Memory / Disk Pressure ​

Part 4 — Health States and Escalation ​

State Meaning ​

State Diagram ​

Part 5 — Watchdog Time Windows and False Positives ​

Example 1 — Camera Frame Watchdog ​

Example 2 — Workflow Step Watchdog ​

Common Timing Mistakes ​

Part 6 — Active vs Passive Health Monitoring ​

Active Monitoring ​

Passive Monitoring ​

Part 7 — Real-World Failure Scenarios ​

Scenario 1 — Heartbeat Updates but Device Is Functionally Stuck ​

Scenario 2 — Watchdog Timeout Too Short Causes False Stops ​

Scenario 3 — Watchdog Timeout Too Long Delays Safe Recovery ​

Scenario 4 — Queue Backlog Grows but Health Remains Green ​

Scenario 5 — Background Worker Dies Silently ​

Scenario 6 — Stale Sensor Value Treated as Current ​

Scenario 7 — Health Monitor Itself Becomes Unreliable ​

Scenario 8 — Reconnect Resets Heartbeat but Device State Remains Invalid ​

Part 8 — Software Design Implications ​

Bad Approach ​

Good Approach ​

Component Diagram ​

Health Snapshot Example ​

Part 9 — Interview / Real-World Talking Points ​

How to Explain Watchdogs and Heartbeats Clearly ​

Why “Connected” Is Not Equal to “Healthy” ​

Common Mistakes Software Engineers Make ​

What Strong Engineers Understand ​

Final Mental Model ​

Watchdogs, Heartbeats & Health Monitoring

Part 1 — Why Health Monitoring Is Critical

Part 2 — Heartbeats vs Watchdogs vs Health Checks

Heartbeat

Watchdog

Health Check

Concept Diagram

Part 3 — What Should Be Monitored

1. Device Connectivity

2. Command Completion

3. Workflow Progress

4. Queue Depth / Backlog

5. Frame Arrival Rate

6. Sensor Freshness

7. Background Worker Activity

8. UI Responsiveness

9. Storage Availability

10. CPU / Memory / Disk Pressure

Part 4 — Health States and Escalation

State Meaning

State Diagram

Part 5 — Watchdog Time Windows and False Positives

Example 1 — Camera Frame Watchdog

Example 2 — Workflow Step Watchdog

Common Timing Mistakes

Part 6 — Active vs Passive Health Monitoring

Active Monitoring

Passive Monitoring

Part 7 — Real-World Failure Scenarios

Scenario 1 — Heartbeat Updates but Device Is Functionally Stuck

Scenario 2 — Watchdog Timeout Too Short Causes False Stops

Scenario 3 — Watchdog Timeout Too Long Delays Safe Recovery

Scenario 4 — Queue Backlog Grows but Health Remains Green

Scenario 5 — Background Worker Dies Silently

Scenario 6 — Stale Sensor Value Treated as Current

Scenario 7 — Health Monitor Itself Becomes Unreliable

Scenario 8 — Reconnect Resets Heartbeat but Device State Remains Invalid

Part 8 — Software Design Implications

Bad Approach

Good Approach

Component Diagram

Health Snapshot Example

Part 9 — Interview / Real-World Talking Points

How to Explain Watchdogs and Heartbeats Clearly

Why “Connected” Is Not Equal to “Healthy”

Common Mistakes Software Engineers Make

What Strong Engineers Understand

Final Mental Model