Crash Handling & Graceful Shutdown in Industrial Machine Software
This topic fits the roadmap’s reliability/fault-handling area, especially “startup and shutdown robustness,” “safe stop,” “crash dump collection,” and “hardware resource ownership.”
PART 1 — Why shutdown is safety-critical in machine software
In normal desktop software, shutdown often means:
save settings, close windows, release memory, exit process.
In industrial machine software, shutdown means:
stop controlling physical reality in a safe, explainable, recoverable way.
A machine application may be controlling:
- motion axes
- cameras
- frame grabbers
- IO outputs
- vacuum
- clamps
- lights
- lasers
- robots
- conveyors
- storage pipelines
- active inspection workflows
So shutdown is not only a software lifecycle event. It is a machine-control event.
If shutdown is poorly handled, the process may disappear, but the machine may still be left in a dangerous or inconsistent condition.
For example:
A camera SDK handle is not released correctly. The next startup fails because the camera is still locked by the previous process or driver session.
A motion command is active when the application exits. The software UI is gone, but the controller may still be executing the last command.
Vacuum or clamp output remains active. Material may stay held inside the machine, but the next startup may not know that.
An inspection result image is written to disk, but the database record is not committed. Later, production traceability says the result does not exist, but the image file exists.
This is the core mindset:
In industrial software, shutdown must leave both software and the physical machine in a known, safe, diagnosable state.
PART 2 — Normal shutdown vs abnormal termination
There are two very different worlds.
Normal shutdown
Normal shutdown happens when the operator, system, or maintenance procedure requests a controlled stop.
Typical examples:
- operator clicks Exit
- operator stops production
- service engineer shuts down the tool
- system update requires shutdown
- machine transitions to offline mode
In normal shutdown, the software still has control. It can coordinate subsystems.
A good normal shutdown should:
- reject new commands
- stop workflows safely
- stop or park motion if appropriate
- disarm cameras and acquisition
- turn off outputs safely
- flush logs and diagnostics
- release device handles
- write a clean shutdown marker
Abnormal termination
Abnormal termination happens when the system loses normal control.
Examples:
- unhandled exception
- process crash
- OS kill
- power loss
- watchdog terminates the process
- native SDK crashes the process
- machine PC freezes
- someone kills the app from Task Manager
In abnormal termination, you cannot assume cleanup code will run.
That is why industrial systems must design for both:
Normal path:
+---------+ +----------+ +---------+
| Running | ----> | Stopping | ----> | Stopped |
+---------+ +----------+ +---------+
Abnormal path:
+---------+ +---------+ +-------------------+
| Running | ----> | Crashed | ----> | Recovery Required |
+---------+ +---------+ +-------------------+The most important difference:
| Situation | What you can guarantee |
|---|---|
| Normal shutdown | Ordered shutdown may complete |
| Crash | Only pre-existing safety design can protect you |
| Power loss | Software cleanup may not happen at all |
| OS kill | finally, Dispose, async cleanup may not run |
| Native SDK crash | The process may die before managed code reacts |
So experienced engineers do not build safety around “my shutdown handler will always run.”
They build safety around:
What happens if it does not run?
PART 3 — Graceful shutdown sequence
A realistic graceful shutdown should be coordinated, not random.
A common sequence looks like this:
Operator/UI
|
| Request Shutdown
v
+----------------------+
| Shutdown Coordinator |
+----------------------+
|
| 1. Reject new commands
v
+----------------------+
| Command Gateway |
+----------------------+
|
| 2. Request workflow stop
v
+----------------------+
| Workflow Engine |
+----------------------+
|
| 3. Stop/park motion
v
+----------------------+
| Motion Subsystem |
+----------------------+
|
| 4. Stop acquisition
v
+----------------------+
| Camera / Acquisition |
+----------------------+
|
| 5. Deactivate outputs
v
+----------------------+
| IO / Vacuum / Clamp |
+----------------------+
|
| 6. Flush data
v
+----------------------+
| Storage / Logs |
+----------------------+
|
| 7. Release resources
v
+----------------------+
| Device Managers |
+----------------------+
|
| 8. Mark clean shutdown
v
+----------------------+
| Shutdown Marker |
+----------------------+The order matters.
You usually do not release devices before stopping workflows. You usually do not turn off vacuum blindly before understanding whether material is held. You usually do not stop logging before capturing the shutdown reason. You usually do not dispose the camera while acquisition callbacks are still running.
A better mental model is:
Shutdown is a controlled workflow with dependencies, timeouts, and fallback behavior.
Not:
The user closed the window, so call Dispose everywhere.
PART 4 — Safe stopping of active operations
Shutdown often happens while the machine is doing something.
Active operations may include:
- motion in progress
- camera acquisition
- image processing
- device command waiting for response
- storage write
- operator command executing
- robot transfer
- vacuum pickup
- alignment flow
- inspection cycle
A strong design distinguishes several stop types.
Cancel
Cancel means:
finish safely as soon as possible, cooperatively.
Example:
- stop queueing more image processing
- stop the recipe after the current safe step
- cancel pending non-critical tasks
Cancel is usually software-level and cooperative.
Stop at safe boundary
This means:
do not interrupt the current physical action halfway; stop after a known safe point.
Example:
- finish current wafer scan line
- wait until axis reaches a stable position
- finish current camera frame acquisition
- complete current database transaction
- stop before loading the next part
This is common in production workflows.
Abort immediately
Abort means:
stop the current operation now, even if production context becomes incomplete.
Example:
- abort inspection
- stop motion command
- terminate acquisition
- discard current pipeline batch
Abort may require recovery afterward.
Emergency stop
Emergency stop is different.
It should be handled by the safety system, safety PLC, drives, relays, or hardware circuit — not by normal application logic.
Software may observe and react to E-stop, but it should not be the only thing responsible for achieving a safe emergency stop.
Important distinction:
Graceful Stop:
Software-controlled, orderly, diagnostic-friendly.
Abort:
Software-controlled, urgent, may require recovery.
Emergency Stop:
Safety-system-controlled, hardware/safety priority.A common mistake is treating all stop requests the same.
In machine software, “stop” is not one thing.
PART 5 — Resource cleanup and release
Industrial software often uses resources that outlive normal C# objects.
Examples:
- native SDK handles
- camera handles
- frame grabber buffers
- unmanaged image buffers
- serial ports
- TCP sockets
- PLC connections
- device ownership locks
- file handles
- database sessions
- event subscriptions
- native callbacks
- timers
- background workers
- acquisition threads
A resource lifecycle should be explicit:
+-------------+
| Unallocated |
+-------------+
|
| Open / Initialize
v
+-------------+
| Allocated |
+-------------+
|
| Start / Arm / Subscribe
v
+-------------+
| Active |
+-------------+
|
| Stop / Disarm / Unsubscribe
v
+-------------+
| Inactive |
+-------------+
|
| Release / Dispose / Close
v
+-------------+
| Released |
+-------------+The dangerous shortcut is this:
Active ---> DisposeThat often fails.
For example:
- camera still streaming while handle is released
- callback fires into disposed object
- unmanaged buffer is freed while processing thread still uses it
- TCP connection is closed while protocol parser still expects response
- timer continues firing after subsystem is “disposed”
- UI closes but background worker still sends device commands
A good subsystem usually has separate methods or states:
InitializeAsync()
StartAsync()
StopAsync()
ShutdownAsync()
Dispose()Dispose() should not be where the real machine stop logic lives.
Dispose is a final cleanup tool. Shutdown is a machine behavior.
PART 6 — Crash handling and evidence preservation
During a crash, the system may have very little control.
The priority is not:
recover everything immediately.
The priority is:
- preserve evidence
- avoid making physical state worse
- mark the state as uncertain
- require controlled restart/recovery
Useful crash evidence includes:
- exception details
- crash dump
- current workflow step
- active command
- current recipe/job/lot/wafer
- machine state snapshot
- device health/status
- last alarms
- last operator action
- last device communication
- pending storage operations
- recent logs/events
A good crash flow looks like this:
+------------------+
| Unhandled Fault |
+------------------+
|
v
+------------------+
| Capture Evidence |
+------------------+
|
v
+----------------------+
| Mark State Uncertain |
+----------------------+
|
v
+--------------------------+
| Avoid Further Commands |
+--------------------------+
|
v
+--------------------------+
| Require Controlled Start |
+--------------------------+One of the worst mistakes is cleaning up too aggressively before preserving evidence.
For example:
- clear current workflow state
- reset alarms
- delete temporary files
- retry device initialization
- overwrite last-known state
- rotate logs immediately
- hide crash details from operator/service engineer
That makes root cause analysis much harder.
In production, the question after a crash is not only:
Can we restart?
It is also:
Can we prove what happened?
PART 7 — Restart readiness after shutdown or crash
Startup and shutdown are connected.
A machine should not start with this assumption:
The previous process ended cleanly, so everything is fine.
It should check:
- Was the previous shutdown clean?
- Is there a crash marker?
- Was a workflow active?
- Was material inside the machine?
- Were devices released correctly?
- Are device handles available?
- Is the motion controller in a known state?
- Are outputs in expected state?
- Are there incomplete storage operations?
- Does the operator need a recovery procedure?
A safe startup model:
+---------+
| Startup |
+---------+
|
v
+--------------------------+
| Check Previous Shutdown |
+--------------------------+
|
+------------------+
| Clean |
v v
+---------+ +-------------------+
| Ready | | Recovery Required |
+---------+ +-------------------+
|
v
+-------------------+
| Operator/Service |
| Recovery Flow |
+-------------------+The key principle:
After a crash, the UI should not simply show Ready.
It should show something like:
- Recovery Required
- Previous shutdown abnormal
- Machine state uncertain
- Verify material position
- Re-home required
- Clear device fault
- Resume/reject incomplete job
This prevents stale software assumptions from becoming dangerous physical actions.
PART 8 — Real-world failure scenarios
1. App exits while motion controller is still executing
What it looks like:
- UI disappears
- axis continues moving
- next startup sees unexpected position
- operator loses trust
- motion controller reports active or faulted state
Why it happens:
- software sent a move command
- app closed without canceling/stopping motion
- controller owns execution after command is accepted
Prevention:
- motion subsystem has explicit shutdown behavior
- shutdown coordinator asks motion to stop or park
- startup checks actual controller state
- UI does not assume software state equals physical state
2. Acquisition is not stopped before camera handle is released
What it looks like:
- crash during shutdown
- access violation in native SDK
- next startup cannot open camera
- random callback into disposed object
Why it happens:
- camera streaming thread still active
- callback subscription not removed
- buffer still owned by native SDK
- managed object disposed before native acquisition stops
Prevention:
- stop acquisition first
- wait for acquisition stopped confirmation
- unsubscribe callbacks
- release buffers
- close camera handle last
3. Native SDK crash prevents normal cleanup
What it looks like:
- process disappears without managed exception
- no normal shutdown logs
- dump may show native DLL failure
- device may remain locked
Why it happens:
- unsafe native driver
- bad pointer
- SDK internal thread crash
- incompatible driver/firmware version
Prevention:
- isolate risky SDK calls where possible
- capture dumps
- use watchdog/startup recovery
- mark abnormal shutdown
- verify device state on next startup
4. UI closes but background worker continues using device
What it looks like:
- window closes slowly or hangs
- device commands continue after operator requested exit
- logs appear after UI is gone
- shutdown race conditions occur
Why it happens:
- UI owns lifecycle incorrectly
- background worker not cancellation-aware
- device service outlives UI state
- no central shutdown coordinator
Prevention:
- application-level lifecycle owner
- cancellation tokens propagated through workers
- command gateway rejects new work during shutdown
- background workers must acknowledge stop
5. Storage queue loses inspection results during shutdown
What it looks like:
- image exists but database row missing
- database row exists but image missing
- report incomplete
- traceability gap
Why it happens:
- async storage queue still had pending work
- process exited before flush
- no bounded drain strategy
- no incomplete-operation marker
Prevention:
- storage pipeline supports drain/finalize
- shutdown waits with timeout
- pending items are recorded
- incomplete result state is explicit
- restart can reconcile image/database mismatch
6. Shutdown hangs forever waiting for device response
What it looks like:
- operator clicks Exit
- app freezes on “Stopping...”
- service engineer kills process
- evidence is incomplete
Why it happens:
- shutdown waits indefinitely
- device does not respond
- no timeout or fallback path
- cleanup assumes happy path
Prevention:
- every shutdown step has timeout
- distinguish graceful stop from forced stop
- log which subsystem blocked shutdown
- escalate to recovery-required marker
7. Previous crash leaves machine unknown, but UI starts as Ready
What it looks like:
- app starts normally
- operator presses Start
- machine behaves incorrectly
- material is in unexpected position
- workflow context is stale
Why it happens:
- startup does not check previous shutdown marker
- software reconstructs state too optimistically
- physical state is not revalidated
Prevention:
- abnormal shutdown detection
- startup recovery checks
- require homing/revalidation
- show Recovery Required instead of Ready
8. Operator kills app to recover, destroying evidence
What it looks like:
- operator says “the machine froze”
- logs stop suddenly
- no clear fault reason
- engineering cannot reproduce
Why it happens:
- shutdown/recovery UX is poor
- app appears stuck
- operator has no safe recovery option
- diagnostics are not preserved quickly enough
Prevention:
- visible “Stopping / Recovery / Collecting diagnostics” states
- watchdog health monitoring
- fast diagnostic snapshot
- operator procedure for abnormal stop
- crash dumps and last-event buffers
PART 9 — Software design implications
Graceful shutdown must be an explicit architecture path.
It should not be hidden inside:
- WPF window close event
- random
Dispose()methods - finalizers
- destructors
- process exit events
- scattered try/finally blocks
A strong design has a shutdown coordinator.
+-------------------+ +----------------------+
| Shutdown Request | ----> | Shutdown Coordinator |
+-------------------+ +----------------------+
|
+-------------------+ |
| Crash Detector | ----------------+
+-------------------+
|
v
+----------------+----------------+----------------+
| | | |
v v v v
+--------------+ +---------------+ +--------------+ +----------------+
| Workflow Stop| | Device Disarm | | Storage Flush| | Diagnostics |
+--------------+ +---------------+ +--------------+ +----------------+
| | | |
+----------------+----------------+----------------+
|
v
+--------------------------------+
| Clean Shutdown Marker OR |
| Recovery Required Marker |
+--------------------------------+Good shutdown design includes:
- central shutdown coordinator
- ordered subsystem stop
- explicit subsystem lifecycle contracts
- cancellation-aware workflows
- timeout-aware cleanup
- command rejection during shutdown
- safe output deactivation
- device ownership tracking
- diagnostic capture
- abnormal shutdown marker
- startup recovery gate
Bad approaches:
Bad:
- Window_Closing does everything
- Dispose randomly stops hardware
- no shutdown ordering
- no timeout
- no crash marker
- startup always shows Ready
- logs are flushed after devices are already killed
- cleanup hides the original failureGood approaches:
Good:
- shutdown is a first-class workflow
- each subsystem has Stop/Shutdown semantics
- shutdown is ordered by dependency
- physical state is treated as uncertain after crash
- evidence is captured before cleanup
- restart checks previous shutdown resultA useful subsystem contract might look conceptually like this:
public interface IMachineSubsystem
{
string Name { get; }
Task StopOperationsAsync(
ShutdownContext context,
CancellationToken cancellationToken);
Task DisarmAsync(
ShutdownContext context,
CancellationToken cancellationToken);
Task ReleaseResourcesAsync(
ShutdownContext context,
CancellationToken cancellationToken);
}The important idea is not the exact interface.
The important idea is separation:
- stop active behavior
- put device/output into safe state
- release software resources
Those are not the same thing.
PART 10 — Interview / real-world talking points
A strong interview explanation could be:
In industrial software, graceful shutdown is not just process cleanup. The application may be controlling motion, cameras, IO, vacuum, clamps, storage, and active workflows. A safe shutdown must stop accepting new commands, stop workflows at safe boundaries, disarm devices, deactivate outputs safely, flush diagnostics and storage, release hardware resources, and mark whether shutdown was clean. For crashes, we cannot assume cleanup runs, so we preserve evidence, mark the machine state as uncertain, and force startup through recovery checks instead of showing Ready immediately.
Common mistakes engineers make when entering industrial systems:
- treating shutdown like a web app or desktop app lifecycle
- assuming
Dispose()means the machine is safe - assuming process exit stops hardware
- ignoring native SDK/resource ownership
- not designing startup checks for abnormal shutdown
- letting UI close while workflows still run
- waiting forever for devices during shutdown
- clearing evidence too early
- showing Ready after a crash without revalidation
What strong engineers understand:
- shutdown is part of safety and reliability
- physical state may outlive software state
- ordered shutdown matters
- every subsystem needs lifecycle ownership
- crash handling is mostly about evidence and containment
- startup must verify whether the previous shutdown was clean
- recovery-required is safer than pretending everything is normal
The core sentence to remember:
In machine software, shutdown is successful only when the process exits, the hardware is safe, resources are released, evidence is preserved, and the next startup knows whether recovery is required.