Skip to content

Production Monitoring & Alerting in Industrial Machine Software

Production monitoring is the system that tells you:

“Is the machine still producing correctly, safely, and efficiently — and do the right people need to act?”

This topic belongs strongly to reliability, long-running behavior, and observability/serviceability. The roadmap explicitly treats industrial systems as long-running machines where health metrics, performance counters, diagnostics, serviceability, and health dashboards directly affect downtime and support cost.


PART 1 — Why Production Monitoring Is Different From Logging

Big picture

Logging answers:

“What happened?”

Production monitoring answers:

“What is happening now, is it getting worse, and should someone act?”

A machine can be technically “running” but operationally unhealthy.

Example:

text
Process: running
UI: responsive
Machine state: Auto
Current alarm: none

But:
- cycle time increased from 4.2s to 5.8s
- camera reconnects increased
- image queue depth is growing
- operator interventions increased
- disk free space is falling

From a business-software mindset, this may look fine.

From an industrial-machine mindset, this may mean:

“The machine is slowly moving toward downtime or quality loss.”

Logging vs monitoring

text
+-------------------+-------------------------------+
| Logging           | Monitoring                    |
+-------------------+-------------------------------+
| Event detail      | Operational signal            |
| After-the-fact    | Real-time / near-real-time     |
| Debugging         | Detection and decision         |
| Developer-focused | Operator / maintenance / ops   |
| "What happened?" | "Is action needed?"           |
+-------------------+-------------------------------+

A log entry may say:

text
Camera reconnect succeeded after timeout.

A monitoring signal says:

text
Camera reconnect count increased from 1/hour to 18/hour.
This camera is becoming unstable.

That second one is much more useful for production.


PART 2 — What Should Be Monitored in Production

A production machine should not only expose whether the software process is alive. It should expose whether the machine is healthy, productive, stable, and supportable.

1. Availability / uptime

This tells you whether the machine is available for production.

But uptime alone is dangerous.

Bad metric:

text
MachineApp.exe is running = healthy

Better metrics:

text
Machine app running
Controller connected
Devices initialized
Machine ready for auto mode
Machine producing
Machine blocked
Machine faulted
Machine waiting for operator

A machine can have 99% process uptime but only 70% useful production time.


2. Machine state distribution

You want to know how much time the machine spends in each state.

Example:

text
Auto Running:       68%
Idle:               12%
Waiting Material:    8%
Alarmed:             5%
Manual Mode:         4%
Recovering:          3%

This tells supervisors and engineers where production time is going.

If “Recovering” grows from 1% to 8%, something is degrading even before hard failures appear.


3. Alarms and fault frequency

You monitor alarm patterns, not just current alarms.

Important signals:

text
Alarm count per hour
Top recurring alarms
Mean time between faults
Faults by subsystem
Faults by recipe
Faults by shift
Faults after maintenance
Faults after software version change

A single alarm may be normal.

The same alarm 30 times per shift is a production problem.


4. Cycle time and throughput

This is one of the most important production signals.

Example:

text
Expected wafer inspection cycle: 45 seconds
Current average:                 53 seconds
P95 cycle time:                   71 seconds

The machine is still running, but output is dropping.

For inspection machines, throughput can degrade because of:

text
slower motion
camera exposure changes
image processing backlog
increased retries
operator pauses
storage write latency
network upload delay
recipe-specific workload

5. Device health and reconnect counts

Industrial devices often fail gradually.

Monitor:

text
camera reconnect count
PLC communication drops
motion controller errors
light controller timeout count
IO module reconnects
robot handshake failures
sensor invalid-read count

Example:

text
Camera reconnects:
08:00 - 09:00: 0
09:00 - 10:00: 1
10:00 - 11:00: 3
11:00 - 12:00: 9

That trend matters more than any single reconnect.


6. Retry and timeout rates

Retries are useful, but they can hide real problems.

If software silently retries everything, the machine may appear stable while performance and reliability degrade.

Monitor:

text
retry count per subsystem
timeout count per operation
retry success rate
retry latency impact
repeated transient faults

A retry is not always “problem solved.”

Sometimes it means:

“The system is compensating for a developing failure.”


7. Queue depth and backlog

Very important in vision, imaging, streaming, and inspection systems.

Example pipeline:

text
Camera Acquisition -> Image Queue -> Processing -> Result Queue -> Storage -> Upload

Monitor:

text
image queue depth
processing queue depth
result queue depth
storage backlog
upload backlog
dropped frame count
oldest item age

A queue depth of 5 may be fine.

A queue depth that grows continuously means the system cannot keep up.


8. Resource usage

Monitor:

text
CPU
memory
GPU, if used
disk capacity
disk write latency
network latency
handle count
thread count
native memory
database size
log folder size

In industrial systems, this matters because the app may run for days or weeks.

A slow memory leak may not appear during a demo, but it can kill production after 36 hours.


9. Image / inspection quality metrics

For inspection machines, monitoring quality signals is critical.

Examples:

text
focus score
brightness mean / variance
exposure time drift
defect count distribution
false defect rate
classification confidence
alignment failure rate
image acquisition failure rate

If false defects rise suddenly, production may still continue, but quality trust is damaged.


10. Storage capacity and write failures

This is a classic production failure.

The machine runs fine until:

text
disk full
image save fails
result database cannot write
logs cannot rotate
export queue blocks

Monitor:

text
free disk percentage
days until disk full
write failure count
storage latency
old data cleanup status
archive job status

11. Recipe / config / version context

Metrics without context are weak.

Always attach:

text
machine ID
software version
firmware version
recipe name/version
lot/job/run ID
operator mode
shift
product type
camera profile
calibration version

Without this, support may know “throughput dropped” but not why.

Maybe the issue only happens on one recipe.

Maybe it started after a firmware update.

Maybe it only happens on night shift.


PART 3 — Local HMI Alarms vs Production Alerting

Local HMI alarms and production alerts are related, but they are not the same thing.

Local HMI alarm

Purpose:

“Tell the operator what needs immediate attention at this machine.”

Examples:

text
Door open
Vacuum not reached
Motion axis fault
Wafer not detected
Camera disconnected
Emergency stop active

Local alarms are machine-specific and immediate.

They usually answer:

text
What happened?
Can the operator continue?
What recovery action is needed?

Production / factory alert

Purpose:

“Tell the right production, maintenance, engineering, or support role that a broader operational issue needs attention.”

Examples:

text
Machine throughput dropped 15% for 2 hours
Camera reconnect count exceeds normal baseline
Disk will be full in 12 hours
Same fault occurred 20 times this shift
Three machines running same recipe show increased false defects
Processing backlog growing continuously

Production alerts are often trend-based, aggregated, and role-specific.

Layer diagram

text
                +----------------------+
                |    Physical Machine  |
                | devices / motion / IO |
                +----------+-----------+
                           |
             immediate fault/alarm
                           |
                           v
                +----------------------+
                |      Local HMI       |
                | operator alarm/action|
                +----------------------+

                           |
                           | metrics / events / health
                           v

                +----------------------+
                |  Monitoring System   |
                | trends / aggregation |
                +----------+-----------+
                           |
          +----------------+----------------+
          |                |                |
          v                v                v
   +-------------+  +---------------+  +-------------+
   | Alerting    |  | Reports       |  | Dashboards  |
   | maintenance |  | shift/OEE     |  | engineering |
   +-------------+  +---------------+  +-------------+

Important rule:

Not every local alarm should become a remote alert.

If every operator alarm becomes a remote alert, engineers and supervisors will ignore them.

Example:

text
Operator opened door during manual maintenance.

This may be a local alarm, but not a remote production alert.

However:

text
Door-open interruption happened 40 times this shift and caused 90 minutes lost time.

That may become a production monitoring issue.


PART 4 — Alert Conditions and Thresholds

A good alert should be based on a meaningful condition, not raw noise.

Bad alert:

text
Send alert every time a timeout occurs.

Better alert:

text
Send warning if camera timeout rate exceeds 5 per hour
while machine is in Auto mode
for more than 10 minutes.

Common alert conditions

text
error rate exceeds threshold
retry count increasing
queue depth above safe range
oldest queued item too old
disk below capacity threshold
cycle time drift exceeds baseline
device reconnect count exceeds normal range
same transient fault repeats
machine stuck in recovering state
inspection false defect rate rises

Static threshold vs trend-based alert

Static threshold:

text
Disk free < 10%

Trend-based alert:

text
Disk will be full within 16 hours based on current growth rate.

Static threshold:

text
Cycle time > 60 seconds

Trend-based alert:

text
Cycle time increased 20% compared with the same recipe baseline.

For machines, trend-based alerts are often more valuable because different recipes, products, or modes may have different normal behavior.

Alert lifecycle diagram

text
+-------------+
| Raw Signal  |
| metric/event|
+------+------+
       |
       v
+-------------+
| Condition   |
| threshold / |
| trend rule  |
+------+------+
       |
       v
+-------------+
| Alert Raised|
| severity +  |
| owner       |
+------+------+
       |
       v
+-------------+
| Acknowledged|
| by operator |
| or support  |
+------+------+
       |
       v
+-------------+
| Action Taken|
| inspect /   |
| repair /    |
| tune / fix  |
+------+------+
       |
       v
+-------------+
| Resolved    |
| condition   |
| cleared     |
+-------------+

A professional alert system tracks the lifecycle.

It does not just “send messages.”


Severity levels

A simple model:

text
Info:
  Interesting, no action required now.

Warning:
  Degradation or risk. Action soon.

Alarm / Critical:
  Production is affected or machine may stop soon.

Emergency / Safety:
  Safety-related condition. Usually handled locally and by safety systems.

Do not overload severity.

If everything is critical, nothing is critical.


Hysteresis and suppression

Without hysteresis, alerts flap.

Bad:

text
Temperature > 70°C => alert
Temperature < 70°C => clear

If the value moves between 69.8 and 70.2, the alert keeps opening and closing.

Better:

text
Raise alert when temperature > 70°C for 5 minutes.
Clear alert when temperature < 65°C for 10 minutes.

This reduces noise and improves trust.


PART 5 — Degradation Detection

Many serious failures are preceded by weak signals.

A machine rarely goes from perfect to dead instantly.

More often:

text
Healthy -> slightly unstable -> degraded -> faulted

Trend diagram

text
Health
  ^
  |
  |  Healthy
  |  ************
  |              *********
  |                       *******
  |                              *****
  |                                   ***
  |                                      **
  +------------------------------------------------> Time
       Healthy        Suspect       Degraded    Faulted

Another way to model it:

text
+---------+      +---------+      +----------+      +---------+
| Healthy | ---> | Suspect | ---> | Degraded | ---> | Faulted |
+---------+      +---------+      +----------+      +---------+
     |                |                |                 |
     |                |                |                 |
 normal          weak signal       production        hard stop /
 behavior        appears           impact visible    alarm

Examples of degradation

text
slower device response
more retries
more reconnects
more operator interventions
longer vacuum recovery time
rising temperature
higher CPU or memory
increased false defect rate
lower throughput
larger queue depth
more alignment failures

Why degradation detection is valuable

Detecting total failure is late.

Detecting degradation gives the factory time to act.

Example:

text
Vacuum recovery time:
Normal: 1.2 seconds
After 2 hours: 1.8 seconds
After 4 hours: 2.6 seconds
After 6 hours: 4.5 seconds
Eventually: vacuum timeout alarm

If you only alert on the timeout, production already stopped.

If you monitor the trend, maintenance can inspect the vacuum line, valve, filter, seal, or pump before downtime.


PART 6 — Actionable Alerting

A good alert should answer five questions.

text
1. What is wrong?
2. How serious is it?
3. Who should act?
4. What action is expected?
5. What context is needed?

Bad alert

text
Error rate high.

This is almost useless.

Better alert

text
Machine WFI-03
Subsystem: Camera Acquisition
Severity: Warning
Condition: Camera reconnect count = 14 in 1 hour
Normal baseline: 0-2/hour
Current mode: Auto production
Recipe: WAFER_TOP_INSPECTION_V7
Impact: Acquisition delay increased average cycle time by 12%
Recommended action: Maintenance checks camera cable, power, and frame grabber.
Support context: Started after 10:42, no software restart since shift start.

That is actionable.

Alert owner matters

Different alerts belong to different people.

text
+----------------------+-----------------------------+
| Alert Type           | Likely Owner                 |
+----------------------+-----------------------------+
| Door open            | Operator                     |
| Low air pressure     | Maintenance                  |
| Disk almost full     | IT / service / support       |
| Camera reconnects    | Maintenance + engineering    |
| False defect spike   | Process / vision engineer    |
| Software memory leak | Software engineering/support |
| Throughput drop      | Supervisor + engineering     |
+----------------------+-----------------------------+

Alerts without ownership become background noise.


PART 7 — Real-World Failure Scenarios

Scenario 1 — Alert flood

What it looks like

During one device failure, the system sends:

text
Camera timeout
Camera reconnect failed
Image acquisition failed
Inspection step failed
Workflow failed
Result missing
Machine stopped
Upload failed

Everyone receives dozens of alerts.

Why it happens

The system alerts on every symptom instead of grouping around the root condition.

Better design

Use correlation and suppression.

text
Root alert:
Camera acquisition subsystem unavailable.

Suppressed child symptoms:
- image acquisition failed
- inspection step failed
- result missing

Alert on the root operational problem, not every downstream consequence.


Scenario 2 — Machine gradually slows down

What it looks like

The machine still runs, but daily output drops.

Operators feel it is “slower than usual,” but no alarm appears.

Why it happens

Only hard faults are monitored.

Cycle time, queue depth, and retry rate are not monitored.

Better design

Track:

text
average cycle time
P95 cycle time
throughput per hour
time in waiting/recovering states
recipe-specific baseline

Alert when performance drifts from expected behavior.


Scenario 3 — Disk fills up

What it looks like

Image saving suddenly fails.

Inspection results cannot be stored.

The machine may need to stop production.

Why it happens

The software monitored process uptime but not storage capacity, retention, or cleanup jobs.

Better design

Monitor:

text
free disk
growth rate
estimated time to full
cleanup job success
write latency
write failure count

Good alert:

text
Disk will be full in approximately 14 hours at current image storage rate.

Scenario 4 — Retry spike hides hardware issue

What it looks like

The machine keeps running because retry logic succeeds.

But cycle time increases and failures become more frequent.

Why it happens

Retries are treated as invisible implementation details.

Better design

Retries should be metrics.

text
retry_count
retry_success_rate
retry_added_latency
retry_by_device
retry_by_operation

A successful retry is still a signal.


Scenario 5 — Remote alert lacks machine context

What it looks like

Support receives:

text
Inspection failed.

They do not know:

text
which machine
which recipe
which camera
which lot
which software version
which alarm came first
whether this is repeated

Why it happens

Alert payloads are designed like error messages, not support tools.

Better design

Include machine context automatically.

text
machine ID
subsystem
recipe
lot/run
machine state
software/firmware version
recent alarm history
metric trend
recommended owner/action

Scenario 6 — Alert clears automatically but root cause remains

What it looks like

A timeout alert appears and clears.

Everyone assumes the machine recovered.

Two hours later, the machine stops.

Why it happens

The alert condition cleared, but the underlying pattern remained.

Better design

Separate:

text
current condition
recurring pattern
degradation trend

Example:

text
Current timeout cleared.
But timeout count exceeded baseline for this shift.
Keep degradation warning active.

Scenario 7 — Monitoring says “healthy” because process is up

What it looks like

Dashboard shows green.

But the machine is not producing.

Why it happens

Health check only checks:

text
application process alive
database reachable
service endpoint responds

Better design

Machine health must include production readiness.

text
process alive
devices connected
controller healthy
machine initialized
recipe loaded
not blocked
can enter auto
is producing
cycle time normal
queues stable

Scenario 8 — False alerts reduce trust

What it looks like

Operators and engineers ignore alerts because many are false.

Why it happens

Thresholds are too sensitive, not mode-aware, or not recipe-aware.

Example bad rule:

text
Alert if cycle time > 50 seconds.

But some recipes normally take 65 seconds.

Better design

Use context-aware thresholds.

text
cycle time threshold by recipe
only alert in Auto mode
ignore during maintenance mode
require condition to persist
use baseline comparison

PART 8 — Software Design Implications

Production monitoring should be designed as an operational feedback system.

Not as an afterthought.

Component diagram

text
+----------------------+
|   Machine Runtime    |
| workflow / devices / |
| motion / inspection  |
+----------+-----------+
           |
           | metrics / events / health
           v
+----------------------+
| Monitoring Aggregator|
| normalize / enrich / |
| aggregate / correlate|
+----------+-----------+
           |
           | conditions / trends
           v
+----------------------+
|     Alert Engine     |
| thresholds / rules / |
| severity / routing   |
+----------+-----------+
           |
           v
+------------------------------------------------+
| Operator | Maintenance | Engineering | Support |
+------------------------------------------------+

What the Machine Runtime should expose

The runtime should expose meaningful signals.

Examples:

text
MachineStateChanged
CycleCompleted
AlarmRaised
AlarmCleared
DeviceReconnected
RetryOccurred
QueueDepthChanged
InspectionCompleted
StorageWriteFailed
RecipeActivated
RunStarted
RunCompleted
HealthStateChanged

Avoid exposing only raw logs.

Monitoring should consume structured operational signals.


Health state aggregation

Subsystem health should roll up into machine health.

text
+--------------------+
| Machine Health     |
+---------+----------+
          |
          +-- Motion Health
          +-- Camera Health
          +-- Vision Pipeline Health
          +-- Storage Health
          +-- PLC/Controller Health
          +-- Recipe/Config Health
          +-- Host/MES Connectivity Health

But be careful.

Do not reduce everything to one green/red light.

A useful health model shows:

text
overall state
affected subsystem
reason
impact
recommended action

Local vs remote routing

Some conditions are local only.

Some are remote only.

Some are both.

text
+----------------------------+-------------+---------------+
| Condition                  | Local HMI   | Remote Alert  |
+----------------------------+-------------+---------------+
| Door open                  | Yes         | Usually no    |
| Emergency stop             | Yes         | Maybe yes     |
| Disk full soon             | Maybe       | Yes           |
| Camera reconnect trend     | Maybe       | Yes           |
| One transient timeout      | Maybe log   | No            |
| Repeated timeout pattern   | Yes/Maybe   | Yes           |
| Throughput down 20%        | Maybe       | Yes           |
+----------------------------+-------------+---------------+

Routing should consider:

text
machine mode
severity
duration
recurrence
production impact
owner
time of day / shift
customer support model

Correlation with machine state, recipe, and run

A metric without context can mislead.

Example:

text
Cycle time = 80 seconds

Is that bad?

It depends.

text
Recipe A expected: 45 seconds -> bad
Recipe B expected: 85 seconds -> normal
Maintenance mode -> maybe irrelevant
Auto production -> important
First wafer after recipe load -> maybe expected

So monitoring should include:

text
machine state
recipe
run/lot/job
product type
operator mode
software version
calibration version
subsystem state

Retention and export

Factories need historical analysis.

Keep enough data to answer:

text
When did degradation start?
Did it correlate with recipe change?
Did it start after maintenance?
Does it happen only on one machine?
Is this issue getting worse across the fleet?
Which alarms happened before downtime?

For support, export matters.

A good service package may include:

text
metric trends
alarm history
machine state timeline
recent configuration
recipe/version info
diagnostic snapshot
selected logs

Bad vs good approaches

Bad approach

text
- alert on every error log
- process uptime equals healthy
- no trend detection
- no machine-state context
- no recipe context
- no owner
- same alert sent to everyone
- thresholds copied from another machine
- no suppression or hysteresis
- no historical view

Good approach

text
- monitored health model
- subsystem-level metrics
- trend-aware alerts
- actionable alert payloads
- clear owner and expected action
- local vs remote routing
- mode-aware thresholds
- recipe-aware baselines
- suppression and hysteresis
- correlation with alarms, state, and production run

PART 9 — Interview / Real-World Talking Points

How to explain production monitoring clearly

You can say:

Production monitoring is the operational feedback loop of a machine. Logging helps us diagnose what happened, but monitoring tells us whether the machine is currently healthy, productive, degrading, or at risk of downtime. In industrial systems, the machine can be running but still unhealthy, so we monitor trends like cycle time, retry rate, reconnect count, queue depth, device health, storage capacity, and inspection quality signals.

Why monitoring is different from logging and alarms

text
Logging:
  detailed evidence for diagnosis

Local alarms:
  immediate operator-facing machine condition

Production monitoring:
  health, trend, performance, degradation, and escalation

A strong answer:

I would not alert on every log or every local alarm. I would design monitoring around operational conditions: production impact, degradation, recurrence, severity, owner, and expected action.

Common mistakes software engineers make

text
1. Treating process uptime as machine health.
2. Alerting on raw exceptions instead of operational conditions.
3. Ignoring trends and only detecting hard failures.
4. Creating alerts without owners or actions.
5. Sending every local alarm to remote support.
6. Missing recipe/machine-state context.
7. Ignoring retry counts because retries “succeeded.”
8. Using thresholds that are not mode-aware or recipe-aware.
9. Building dashboards that look nice but do not guide action.
10. Forgetting long-running degradation: memory, disk, queues, latency.

What strong engineers understand

Strong industrial software engineers understand that:

text
- production monitoring is not just dashboarding
- degradation matters as much as failure
- alerts must be actionable
- noisy alerts destroy trust
- context is essential
- machine health is not one boolean
- local HMI alarms and factory alerts serve different purposes
- monitoring should guide operators, maintenance, engineering, and support differently

The key mindset:

A machine monitoring system should help the organization act earlier, with better context, before production loss, quality loss, or downtime becomes expensive.

Docs-first project memory for AI-assisted implementation.