Production Monitoring & Alerting in Industrial Machine Software
Production monitoring is the system that tells you:
“Is the machine still producing correctly, safely, and efficiently — and do the right people need to act?”
This topic belongs strongly to reliability, long-running behavior, and observability/serviceability. The roadmap explicitly treats industrial systems as long-running machines where health metrics, performance counters, diagnostics, serviceability, and health dashboards directly affect downtime and support cost.
PART 1 — Why Production Monitoring Is Different From Logging
Big picture
Logging answers:
“What happened?”
Production monitoring answers:
“What is happening now, is it getting worse, and should someone act?”
A machine can be technically “running” but operationally unhealthy.
Example:
Process: running
UI: responsive
Machine state: Auto
Current alarm: none
But:
- cycle time increased from 4.2s to 5.8s
- camera reconnects increased
- image queue depth is growing
- operator interventions increased
- disk free space is fallingFrom a business-software mindset, this may look fine.
From an industrial-machine mindset, this may mean:
“The machine is slowly moving toward downtime or quality loss.”
Logging vs monitoring
+-------------------+-------------------------------+
| Logging | Monitoring |
+-------------------+-------------------------------+
| Event detail | Operational signal |
| After-the-fact | Real-time / near-real-time |
| Debugging | Detection and decision |
| Developer-focused | Operator / maintenance / ops |
| "What happened?" | "Is action needed?" |
+-------------------+-------------------------------+A log entry may say:
Camera reconnect succeeded after timeout.A monitoring signal says:
Camera reconnect count increased from 1/hour to 18/hour.
This camera is becoming unstable.That second one is much more useful for production.
PART 2 — What Should Be Monitored in Production
A production machine should not only expose whether the software process is alive. It should expose whether the machine is healthy, productive, stable, and supportable.
1. Availability / uptime
This tells you whether the machine is available for production.
But uptime alone is dangerous.
Bad metric:
MachineApp.exe is running = healthyBetter metrics:
Machine app running
Controller connected
Devices initialized
Machine ready for auto mode
Machine producing
Machine blocked
Machine faulted
Machine waiting for operatorA machine can have 99% process uptime but only 70% useful production time.
2. Machine state distribution
You want to know how much time the machine spends in each state.
Example:
Auto Running: 68%
Idle: 12%
Waiting Material: 8%
Alarmed: 5%
Manual Mode: 4%
Recovering: 3%This tells supervisors and engineers where production time is going.
If “Recovering” grows from 1% to 8%, something is degrading even before hard failures appear.
3. Alarms and fault frequency
You monitor alarm patterns, not just current alarms.
Important signals:
Alarm count per hour
Top recurring alarms
Mean time between faults
Faults by subsystem
Faults by recipe
Faults by shift
Faults after maintenance
Faults after software version changeA single alarm may be normal.
The same alarm 30 times per shift is a production problem.
4. Cycle time and throughput
This is one of the most important production signals.
Example:
Expected wafer inspection cycle: 45 seconds
Current average: 53 seconds
P95 cycle time: 71 secondsThe machine is still running, but output is dropping.
For inspection machines, throughput can degrade because of:
slower motion
camera exposure changes
image processing backlog
increased retries
operator pauses
storage write latency
network upload delay
recipe-specific workload5. Device health and reconnect counts
Industrial devices often fail gradually.
Monitor:
camera reconnect count
PLC communication drops
motion controller errors
light controller timeout count
IO module reconnects
robot handshake failures
sensor invalid-read countExample:
Camera reconnects:
08:00 - 09:00: 0
09:00 - 10:00: 1
10:00 - 11:00: 3
11:00 - 12:00: 9That trend matters more than any single reconnect.
6. Retry and timeout rates
Retries are useful, but they can hide real problems.
If software silently retries everything, the machine may appear stable while performance and reliability degrade.
Monitor:
retry count per subsystem
timeout count per operation
retry success rate
retry latency impact
repeated transient faultsA retry is not always “problem solved.”
Sometimes it means:
“The system is compensating for a developing failure.”
7. Queue depth and backlog
Very important in vision, imaging, streaming, and inspection systems.
Example pipeline:
Camera Acquisition -> Image Queue -> Processing -> Result Queue -> Storage -> UploadMonitor:
image queue depth
processing queue depth
result queue depth
storage backlog
upload backlog
dropped frame count
oldest item ageA queue depth of 5 may be fine.
A queue depth that grows continuously means the system cannot keep up.
8. Resource usage
Monitor:
CPU
memory
GPU, if used
disk capacity
disk write latency
network latency
handle count
thread count
native memory
database size
log folder sizeIn industrial systems, this matters because the app may run for days or weeks.
A slow memory leak may not appear during a demo, but it can kill production after 36 hours.
9. Image / inspection quality metrics
For inspection machines, monitoring quality signals is critical.
Examples:
focus score
brightness mean / variance
exposure time drift
defect count distribution
false defect rate
classification confidence
alignment failure rate
image acquisition failure rateIf false defects rise suddenly, production may still continue, but quality trust is damaged.
10. Storage capacity and write failures
This is a classic production failure.
The machine runs fine until:
disk full
image save fails
result database cannot write
logs cannot rotate
export queue blocksMonitor:
free disk percentage
days until disk full
write failure count
storage latency
old data cleanup status
archive job status11. Recipe / config / version context
Metrics without context are weak.
Always attach:
machine ID
software version
firmware version
recipe name/version
lot/job/run ID
operator mode
shift
product type
camera profile
calibration versionWithout this, support may know “throughput dropped” but not why.
Maybe the issue only happens on one recipe.
Maybe it started after a firmware update.
Maybe it only happens on night shift.
PART 3 — Local HMI Alarms vs Production Alerting
Local HMI alarms and production alerts are related, but they are not the same thing.
Local HMI alarm
Purpose:
“Tell the operator what needs immediate attention at this machine.”
Examples:
Door open
Vacuum not reached
Motion axis fault
Wafer not detected
Camera disconnected
Emergency stop activeLocal alarms are machine-specific and immediate.
They usually answer:
What happened?
Can the operator continue?
What recovery action is needed?Production / factory alert
Purpose:
“Tell the right production, maintenance, engineering, or support role that a broader operational issue needs attention.”
Examples:
Machine throughput dropped 15% for 2 hours
Camera reconnect count exceeds normal baseline
Disk will be full in 12 hours
Same fault occurred 20 times this shift
Three machines running same recipe show increased false defects
Processing backlog growing continuouslyProduction alerts are often trend-based, aggregated, and role-specific.
Layer diagram
+----------------------+
| Physical Machine |
| devices / motion / IO |
+----------+-----------+
|
immediate fault/alarm
|
v
+----------------------+
| Local HMI |
| operator alarm/action|
+----------------------+
|
| metrics / events / health
v
+----------------------+
| Monitoring System |
| trends / aggregation |
+----------+-----------+
|
+----------------+----------------+
| | |
v v v
+-------------+ +---------------+ +-------------+
| Alerting | | Reports | | Dashboards |
| maintenance | | shift/OEE | | engineering |
+-------------+ +---------------+ +-------------+Important rule:
Not every local alarm should become a remote alert.
If every operator alarm becomes a remote alert, engineers and supervisors will ignore them.
Example:
Operator opened door during manual maintenance.This may be a local alarm, but not a remote production alert.
However:
Door-open interruption happened 40 times this shift and caused 90 minutes lost time.That may become a production monitoring issue.
PART 4 — Alert Conditions and Thresholds
A good alert should be based on a meaningful condition, not raw noise.
Bad alert:
Send alert every time a timeout occurs.Better alert:
Send warning if camera timeout rate exceeds 5 per hour
while machine is in Auto mode
for more than 10 minutes.Common alert conditions
error rate exceeds threshold
retry count increasing
queue depth above safe range
oldest queued item too old
disk below capacity threshold
cycle time drift exceeds baseline
device reconnect count exceeds normal range
same transient fault repeats
machine stuck in recovering state
inspection false defect rate risesStatic threshold vs trend-based alert
Static threshold:
Disk free < 10%Trend-based alert:
Disk will be full within 16 hours based on current growth rate.Static threshold:
Cycle time > 60 secondsTrend-based alert:
Cycle time increased 20% compared with the same recipe baseline.For machines, trend-based alerts are often more valuable because different recipes, products, or modes may have different normal behavior.
Alert lifecycle diagram
+-------------+
| Raw Signal |
| metric/event|
+------+------+
|
v
+-------------+
| Condition |
| threshold / |
| trend rule |
+------+------+
|
v
+-------------+
| Alert Raised|
| severity + |
| owner |
+------+------+
|
v
+-------------+
| Acknowledged|
| by operator |
| or support |
+------+------+
|
v
+-------------+
| Action Taken|
| inspect / |
| repair / |
| tune / fix |
+------+------+
|
v
+-------------+
| Resolved |
| condition |
| cleared |
+-------------+A professional alert system tracks the lifecycle.
It does not just “send messages.”
Severity levels
A simple model:
Info:
Interesting, no action required now.
Warning:
Degradation or risk. Action soon.
Alarm / Critical:
Production is affected or machine may stop soon.
Emergency / Safety:
Safety-related condition. Usually handled locally and by safety systems.Do not overload severity.
If everything is critical, nothing is critical.
Hysteresis and suppression
Without hysteresis, alerts flap.
Bad:
Temperature > 70°C => alert
Temperature < 70°C => clearIf the value moves between 69.8 and 70.2, the alert keeps opening and closing.
Better:
Raise alert when temperature > 70°C for 5 minutes.
Clear alert when temperature < 65°C for 10 minutes.This reduces noise and improves trust.
PART 5 — Degradation Detection
Many serious failures are preceded by weak signals.
A machine rarely goes from perfect to dead instantly.
More often:
Healthy -> slightly unstable -> degraded -> faultedTrend diagram
Health
^
|
| Healthy
| ************
| *********
| *******
| *****
| ***
| **
+------------------------------------------------> Time
Healthy Suspect Degraded FaultedAnother way to model it:
+---------+ +---------+ +----------+ +---------+
| Healthy | ---> | Suspect | ---> | Degraded | ---> | Faulted |
+---------+ +---------+ +----------+ +---------+
| | | |
| | | |
normal weak signal production hard stop /
behavior appears impact visible alarmExamples of degradation
slower device response
more retries
more reconnects
more operator interventions
longer vacuum recovery time
rising temperature
higher CPU or memory
increased false defect rate
lower throughput
larger queue depth
more alignment failuresWhy degradation detection is valuable
Detecting total failure is late.
Detecting degradation gives the factory time to act.
Example:
Vacuum recovery time:
Normal: 1.2 seconds
After 2 hours: 1.8 seconds
After 4 hours: 2.6 seconds
After 6 hours: 4.5 seconds
Eventually: vacuum timeout alarmIf you only alert on the timeout, production already stopped.
If you monitor the trend, maintenance can inspect the vacuum line, valve, filter, seal, or pump before downtime.
PART 6 — Actionable Alerting
A good alert should answer five questions.
1. What is wrong?
2. How serious is it?
3. Who should act?
4. What action is expected?
5. What context is needed?Bad alert
Error rate high.This is almost useless.
Better alert
Machine WFI-03
Subsystem: Camera Acquisition
Severity: Warning
Condition: Camera reconnect count = 14 in 1 hour
Normal baseline: 0-2/hour
Current mode: Auto production
Recipe: WAFER_TOP_INSPECTION_V7
Impact: Acquisition delay increased average cycle time by 12%
Recommended action: Maintenance checks camera cable, power, and frame grabber.
Support context: Started after 10:42, no software restart since shift start.That is actionable.
Alert owner matters
Different alerts belong to different people.
+----------------------+-----------------------------+
| Alert Type | Likely Owner |
+----------------------+-----------------------------+
| Door open | Operator |
| Low air pressure | Maintenance |
| Disk almost full | IT / service / support |
| Camera reconnects | Maintenance + engineering |
| False defect spike | Process / vision engineer |
| Software memory leak | Software engineering/support |
| Throughput drop | Supervisor + engineering |
+----------------------+-----------------------------+Alerts without ownership become background noise.
PART 7 — Real-World Failure Scenarios
Scenario 1 — Alert flood
What it looks like
During one device failure, the system sends:
Camera timeout
Camera reconnect failed
Image acquisition failed
Inspection step failed
Workflow failed
Result missing
Machine stopped
Upload failedEveryone receives dozens of alerts.
Why it happens
The system alerts on every symptom instead of grouping around the root condition.
Better design
Use correlation and suppression.
Root alert:
Camera acquisition subsystem unavailable.
Suppressed child symptoms:
- image acquisition failed
- inspection step failed
- result missingAlert on the root operational problem, not every downstream consequence.
Scenario 2 — Machine gradually slows down
What it looks like
The machine still runs, but daily output drops.
Operators feel it is “slower than usual,” but no alarm appears.
Why it happens
Only hard faults are monitored.
Cycle time, queue depth, and retry rate are not monitored.
Better design
Track:
average cycle time
P95 cycle time
throughput per hour
time in waiting/recovering states
recipe-specific baselineAlert when performance drifts from expected behavior.
Scenario 3 — Disk fills up
What it looks like
Image saving suddenly fails.
Inspection results cannot be stored.
The machine may need to stop production.
Why it happens
The software monitored process uptime but not storage capacity, retention, or cleanup jobs.
Better design
Monitor:
free disk
growth rate
estimated time to full
cleanup job success
write latency
write failure countGood alert:
Disk will be full in approximately 14 hours at current image storage rate.Scenario 4 — Retry spike hides hardware issue
What it looks like
The machine keeps running because retry logic succeeds.
But cycle time increases and failures become more frequent.
Why it happens
Retries are treated as invisible implementation details.
Better design
Retries should be metrics.
retry_count
retry_success_rate
retry_added_latency
retry_by_device
retry_by_operationA successful retry is still a signal.
Scenario 5 — Remote alert lacks machine context
What it looks like
Support receives:
Inspection failed.They do not know:
which machine
which recipe
which camera
which lot
which software version
which alarm came first
whether this is repeatedWhy it happens
Alert payloads are designed like error messages, not support tools.
Better design
Include machine context automatically.
machine ID
subsystem
recipe
lot/run
machine state
software/firmware version
recent alarm history
metric trend
recommended owner/actionScenario 6 — Alert clears automatically but root cause remains
What it looks like
A timeout alert appears and clears.
Everyone assumes the machine recovered.
Two hours later, the machine stops.
Why it happens
The alert condition cleared, but the underlying pattern remained.
Better design
Separate:
current condition
recurring pattern
degradation trendExample:
Current timeout cleared.
But timeout count exceeded baseline for this shift.
Keep degradation warning active.Scenario 7 — Monitoring says “healthy” because process is up
What it looks like
Dashboard shows green.
But the machine is not producing.
Why it happens
Health check only checks:
application process alive
database reachable
service endpoint respondsBetter design
Machine health must include production readiness.
process alive
devices connected
controller healthy
machine initialized
recipe loaded
not blocked
can enter auto
is producing
cycle time normal
queues stableScenario 8 — False alerts reduce trust
What it looks like
Operators and engineers ignore alerts because many are false.
Why it happens
Thresholds are too sensitive, not mode-aware, or not recipe-aware.
Example bad rule:
Alert if cycle time > 50 seconds.But some recipes normally take 65 seconds.
Better design
Use context-aware thresholds.
cycle time threshold by recipe
only alert in Auto mode
ignore during maintenance mode
require condition to persist
use baseline comparisonPART 8 — Software Design Implications
Production monitoring should be designed as an operational feedback system.
Not as an afterthought.
Component diagram
+----------------------+
| Machine Runtime |
| workflow / devices / |
| motion / inspection |
+----------+-----------+
|
| metrics / events / health
v
+----------------------+
| Monitoring Aggregator|
| normalize / enrich / |
| aggregate / correlate|
+----------+-----------+
|
| conditions / trends
v
+----------------------+
| Alert Engine |
| thresholds / rules / |
| severity / routing |
+----------+-----------+
|
v
+------------------------------------------------+
| Operator | Maintenance | Engineering | Support |
+------------------------------------------------+What the Machine Runtime should expose
The runtime should expose meaningful signals.
Examples:
MachineStateChanged
CycleCompleted
AlarmRaised
AlarmCleared
DeviceReconnected
RetryOccurred
QueueDepthChanged
InspectionCompleted
StorageWriteFailed
RecipeActivated
RunStarted
RunCompleted
HealthStateChangedAvoid exposing only raw logs.
Monitoring should consume structured operational signals.
Health state aggregation
Subsystem health should roll up into machine health.
+--------------------+
| Machine Health |
+---------+----------+
|
+-- Motion Health
+-- Camera Health
+-- Vision Pipeline Health
+-- Storage Health
+-- PLC/Controller Health
+-- Recipe/Config Health
+-- Host/MES Connectivity HealthBut be careful.
Do not reduce everything to one green/red light.
A useful health model shows:
overall state
affected subsystem
reason
impact
recommended actionLocal vs remote routing
Some conditions are local only.
Some are remote only.
Some are both.
+----------------------------+-------------+---------------+
| Condition | Local HMI | Remote Alert |
+----------------------------+-------------+---------------+
| Door open | Yes | Usually no |
| Emergency stop | Yes | Maybe yes |
| Disk full soon | Maybe | Yes |
| Camera reconnect trend | Maybe | Yes |
| One transient timeout | Maybe log | No |
| Repeated timeout pattern | Yes/Maybe | Yes |
| Throughput down 20% | Maybe | Yes |
+----------------------------+-------------+---------------+Routing should consider:
machine mode
severity
duration
recurrence
production impact
owner
time of day / shift
customer support modelCorrelation with machine state, recipe, and run
A metric without context can mislead.
Example:
Cycle time = 80 secondsIs that bad?
It depends.
Recipe A expected: 45 seconds -> bad
Recipe B expected: 85 seconds -> normal
Maintenance mode -> maybe irrelevant
Auto production -> important
First wafer after recipe load -> maybe expectedSo monitoring should include:
machine state
recipe
run/lot/job
product type
operator mode
software version
calibration version
subsystem stateRetention and export
Factories need historical analysis.
Keep enough data to answer:
When did degradation start?
Did it correlate with recipe change?
Did it start after maintenance?
Does it happen only on one machine?
Is this issue getting worse across the fleet?
Which alarms happened before downtime?For support, export matters.
A good service package may include:
metric trends
alarm history
machine state timeline
recent configuration
recipe/version info
diagnostic snapshot
selected logsBad vs good approaches
Bad approach
- alert on every error log
- process uptime equals healthy
- no trend detection
- no machine-state context
- no recipe context
- no owner
- same alert sent to everyone
- thresholds copied from another machine
- no suppression or hysteresis
- no historical viewGood approach
- monitored health model
- subsystem-level metrics
- trend-aware alerts
- actionable alert payloads
- clear owner and expected action
- local vs remote routing
- mode-aware thresholds
- recipe-aware baselines
- suppression and hysteresis
- correlation with alarms, state, and production runPART 9 — Interview / Real-World Talking Points
How to explain production monitoring clearly
You can say:
Production monitoring is the operational feedback loop of a machine. Logging helps us diagnose what happened, but monitoring tells us whether the machine is currently healthy, productive, degrading, or at risk of downtime. In industrial systems, the machine can be running but still unhealthy, so we monitor trends like cycle time, retry rate, reconnect count, queue depth, device health, storage capacity, and inspection quality signals.
Why monitoring is different from logging and alarms
Logging:
detailed evidence for diagnosis
Local alarms:
immediate operator-facing machine condition
Production monitoring:
health, trend, performance, degradation, and escalationA strong answer:
I would not alert on every log or every local alarm. I would design monitoring around operational conditions: production impact, degradation, recurrence, severity, owner, and expected action.
Common mistakes software engineers make
1. Treating process uptime as machine health.
2. Alerting on raw exceptions instead of operational conditions.
3. Ignoring trends and only detecting hard failures.
4. Creating alerts without owners or actions.
5. Sending every local alarm to remote support.
6. Missing recipe/machine-state context.
7. Ignoring retry counts because retries “succeeded.”
8. Using thresholds that are not mode-aware or recipe-aware.
9. Building dashboards that look nice but do not guide action.
10. Forgetting long-running degradation: memory, disk, queues, latency.What strong engineers understand
Strong industrial software engineers understand that:
- production monitoring is not just dashboarding
- degradation matters as much as failure
- alerts must be actionable
- noisy alerts destroy trust
- context is essential
- machine health is not one boolean
- local HMI alarms and factory alerts serve different purposes
- monitoring should guide operators, maintenance, engineering, and support differentlyThe key mindset:
A machine monitoring system should help the organization act earlier, with better context, before production loss, quality loss, or downtime becomes expensive.