Skip to content

Crash Handling & Graceful Shutdown in Industrial Machine Software

This topic fits the roadmap’s reliability/fault-handling area, especially “startup and shutdown robustness,” “safe stop,” “crash dump collection,” and “hardware resource ownership.”


PART 1 — Why shutdown is safety-critical in machine software

In normal desktop software, shutdown often means:

save settings, close windows, release memory, exit process.

In industrial machine software, shutdown means:

stop controlling physical reality in a safe, explainable, recoverable way.

A machine application may be controlling:

  • motion axes
  • cameras
  • frame grabbers
  • IO outputs
  • vacuum
  • clamps
  • lights
  • lasers
  • robots
  • conveyors
  • storage pipelines
  • active inspection workflows

So shutdown is not only a software lifecycle event. It is a machine-control event.

If shutdown is poorly handled, the process may disappear, but the machine may still be left in a dangerous or inconsistent condition.

For example:

A camera SDK handle is not released correctly. The next startup fails because the camera is still locked by the previous process or driver session.

A motion command is active when the application exits. The software UI is gone, but the controller may still be executing the last command.

Vacuum or clamp output remains active. Material may stay held inside the machine, but the next startup may not know that.

An inspection result image is written to disk, but the database record is not committed. Later, production traceability says the result does not exist, but the image file exists.

This is the core mindset:

In industrial software, shutdown must leave both software and the physical machine in a known, safe, diagnosable state.


PART 2 — Normal shutdown vs abnormal termination

There are two very different worlds.

Normal shutdown

Normal shutdown happens when the operator, system, or maintenance procedure requests a controlled stop.

Typical examples:

  • operator clicks Exit
  • operator stops production
  • service engineer shuts down the tool
  • system update requires shutdown
  • machine transitions to offline mode

In normal shutdown, the software still has control. It can coordinate subsystems.

A good normal shutdown should:

  • reject new commands
  • stop workflows safely
  • stop or park motion if appropriate
  • disarm cameras and acquisition
  • turn off outputs safely
  • flush logs and diagnostics
  • release device handles
  • write a clean shutdown marker

Abnormal termination

Abnormal termination happens when the system loses normal control.

Examples:

  • unhandled exception
  • process crash
  • OS kill
  • power loss
  • watchdog terminates the process
  • native SDK crashes the process
  • machine PC freezes
  • someone kills the app from Task Manager

In abnormal termination, you cannot assume cleanup code will run.

That is why industrial systems must design for both:

text
Normal path:
+---------+       +----------+       +---------+
| Running | ----> | Stopping | ----> | Stopped |
+---------+       +----------+       +---------+


Abnormal path:
+---------+       +---------+       +-------------------+
| Running | ----> | Crashed | ----> | Recovery Required |
+---------+       +---------+       +-------------------+

The most important difference:

SituationWhat you can guarantee
Normal shutdownOrdered shutdown may complete
CrashOnly pre-existing safety design can protect you
Power lossSoftware cleanup may not happen at all
OS killfinally, Dispose, async cleanup may not run
Native SDK crashThe process may die before managed code reacts

So experienced engineers do not build safety around “my shutdown handler will always run.”

They build safety around:

What happens if it does not run?


PART 3 — Graceful shutdown sequence

A realistic graceful shutdown should be coordinated, not random.

A common sequence looks like this:

text
Operator/UI
   |
   | Request Shutdown
   v
+----------------------+
| Shutdown Coordinator |
+----------------------+
   |
   | 1. Reject new commands
   v
+----------------------+
| Command Gateway      |
+----------------------+

   |
   | 2. Request workflow stop
   v
+----------------------+
| Workflow Engine      |
+----------------------+

   |
   | 3. Stop/park motion
   v
+----------------------+
| Motion Subsystem     |
+----------------------+

   |
   | 4. Stop acquisition
   v
+----------------------+
| Camera / Acquisition |
+----------------------+

   |
   | 5. Deactivate outputs
   v
+----------------------+
| IO / Vacuum / Clamp  |
+----------------------+

   |
   | 6. Flush data
   v
+----------------------+
| Storage / Logs       |
+----------------------+

   |
   | 7. Release resources
   v
+----------------------+
| Device Managers      |
+----------------------+

   |
   | 8. Mark clean shutdown
   v
+----------------------+
| Shutdown Marker      |
+----------------------+

The order matters.

You usually do not release devices before stopping workflows. You usually do not turn off vacuum blindly before understanding whether material is held. You usually do not stop logging before capturing the shutdown reason. You usually do not dispose the camera while acquisition callbacks are still running.

A better mental model is:

Shutdown is a controlled workflow with dependencies, timeouts, and fallback behavior.

Not:

The user closed the window, so call Dispose everywhere.


PART 4 — Safe stopping of active operations

Shutdown often happens while the machine is doing something.

Active operations may include:

  • motion in progress
  • camera acquisition
  • image processing
  • device command waiting for response
  • storage write
  • operator command executing
  • robot transfer
  • vacuum pickup
  • alignment flow
  • inspection cycle

A strong design distinguishes several stop types.

Cancel

Cancel means:

finish safely as soon as possible, cooperatively.

Example:

  • stop queueing more image processing
  • stop the recipe after the current safe step
  • cancel pending non-critical tasks

Cancel is usually software-level and cooperative.

Stop at safe boundary

This means:

do not interrupt the current physical action halfway; stop after a known safe point.

Example:

  • finish current wafer scan line
  • wait until axis reaches a stable position
  • finish current camera frame acquisition
  • complete current database transaction
  • stop before loading the next part

This is common in production workflows.

Abort immediately

Abort means:

stop the current operation now, even if production context becomes incomplete.

Example:

  • abort inspection
  • stop motion command
  • terminate acquisition
  • discard current pipeline batch

Abort may require recovery afterward.

Emergency stop

Emergency stop is different.

It should be handled by the safety system, safety PLC, drives, relays, or hardware circuit — not by normal application logic.

Software may observe and react to E-stop, but it should not be the only thing responsible for achieving a safe emergency stop.

Important distinction:

text
Graceful Stop:
Software-controlled, orderly, diagnostic-friendly.

Abort:
Software-controlled, urgent, may require recovery.

Emergency Stop:
Safety-system-controlled, hardware/safety priority.

A common mistake is treating all stop requests the same.

In machine software, “stop” is not one thing.


PART 5 — Resource cleanup and release

Industrial software often uses resources that outlive normal C# objects.

Examples:

  • native SDK handles
  • camera handles
  • frame grabber buffers
  • unmanaged image buffers
  • serial ports
  • TCP sockets
  • PLC connections
  • device ownership locks
  • file handles
  • database sessions
  • event subscriptions
  • native callbacks
  • timers
  • background workers
  • acquisition threads

A resource lifecycle should be explicit:

text
+-------------+
| Unallocated |
+-------------+
       |
       | Open / Initialize
       v
+-------------+
| Allocated   |
+-------------+
       |
       | Start / Arm / Subscribe
       v
+-------------+
| Active      |
+-------------+
       |
       | Stop / Disarm / Unsubscribe
       v
+-------------+
| Inactive    |
+-------------+
       |
       | Release / Dispose / Close
       v
+-------------+
| Released    |
+-------------+

The dangerous shortcut is this:

text
Active  --->  Dispose

That often fails.

For example:

  • camera still streaming while handle is released
  • callback fires into disposed object
  • unmanaged buffer is freed while processing thread still uses it
  • TCP connection is closed while protocol parser still expects response
  • timer continues firing after subsystem is “disposed”
  • UI closes but background worker still sends device commands

A good subsystem usually has separate methods or states:

csharp
InitializeAsync()
StartAsync()
StopAsync()
ShutdownAsync()
Dispose()

Dispose() should not be where the real machine stop logic lives.

Dispose is a final cleanup tool. Shutdown is a machine behavior.


PART 6 — Crash handling and evidence preservation

During a crash, the system may have very little control.

The priority is not:

recover everything immediately.

The priority is:

  1. preserve evidence
  2. avoid making physical state worse
  3. mark the state as uncertain
  4. require controlled restart/recovery

Useful crash evidence includes:

  • exception details
  • crash dump
  • current workflow step
  • active command
  • current recipe/job/lot/wafer
  • machine state snapshot
  • device health/status
  • last alarms
  • last operator action
  • last device communication
  • pending storage operations
  • recent logs/events

A good crash flow looks like this:

text
+------------------+
| Unhandled Fault  |
+------------------+
          |
          v
+------------------+
| Capture Evidence |
+------------------+
          |
          v
+----------------------+
| Mark State Uncertain |
+----------------------+
          |
          v
+--------------------------+
| Avoid Further Commands   |
+--------------------------+
          |
          v
+--------------------------+
| Require Controlled Start |
+--------------------------+

One of the worst mistakes is cleaning up too aggressively before preserving evidence.

For example:

  • clear current workflow state
  • reset alarms
  • delete temporary files
  • retry device initialization
  • overwrite last-known state
  • rotate logs immediately
  • hide crash details from operator/service engineer

That makes root cause analysis much harder.

In production, the question after a crash is not only:

Can we restart?

It is also:

Can we prove what happened?


PART 7 — Restart readiness after shutdown or crash

Startup and shutdown are connected.

A machine should not start with this assumption:

The previous process ended cleanly, so everything is fine.

It should check:

  • Was the previous shutdown clean?
  • Is there a crash marker?
  • Was a workflow active?
  • Was material inside the machine?
  • Were devices released correctly?
  • Are device handles available?
  • Is the motion controller in a known state?
  • Are outputs in expected state?
  • Are there incomplete storage operations?
  • Does the operator need a recovery procedure?

A safe startup model:

text
+---------+
| Startup |
+---------+
     |
     v
+--------------------------+
| Check Previous Shutdown  |
+--------------------------+
     |
     +------------------+
     | Clean            |
     v                  v
+---------+      +-------------------+
| Ready   |      | Recovery Required |
+---------+      +-------------------+
                         |
                         v
                +-------------------+
                | Operator/Service  |
                | Recovery Flow     |
                +-------------------+

The key principle:

After a crash, the UI should not simply show Ready.

It should show something like:

  • Recovery Required
  • Previous shutdown abnormal
  • Machine state uncertain
  • Verify material position
  • Re-home required
  • Clear device fault
  • Resume/reject incomplete job

This prevents stale software assumptions from becoming dangerous physical actions.


PART 8 — Real-world failure scenarios

1. App exits while motion controller is still executing

What it looks like:

  • UI disappears
  • axis continues moving
  • next startup sees unexpected position
  • operator loses trust
  • motion controller reports active or faulted state

Why it happens:

  • software sent a move command
  • app closed without canceling/stopping motion
  • controller owns execution after command is accepted

Prevention:

  • motion subsystem has explicit shutdown behavior
  • shutdown coordinator asks motion to stop or park
  • startup checks actual controller state
  • UI does not assume software state equals physical state

2. Acquisition is not stopped before camera handle is released

What it looks like:

  • crash during shutdown
  • access violation in native SDK
  • next startup cannot open camera
  • random callback into disposed object

Why it happens:

  • camera streaming thread still active
  • callback subscription not removed
  • buffer still owned by native SDK
  • managed object disposed before native acquisition stops

Prevention:

  • stop acquisition first
  • wait for acquisition stopped confirmation
  • unsubscribe callbacks
  • release buffers
  • close camera handle last

3. Native SDK crash prevents normal cleanup

What it looks like:

  • process disappears without managed exception
  • no normal shutdown logs
  • dump may show native DLL failure
  • device may remain locked

Why it happens:

  • unsafe native driver
  • bad pointer
  • SDK internal thread crash
  • incompatible driver/firmware version

Prevention:

  • isolate risky SDK calls where possible
  • capture dumps
  • use watchdog/startup recovery
  • mark abnormal shutdown
  • verify device state on next startup

4. UI closes but background worker continues using device

What it looks like:

  • window closes slowly or hangs
  • device commands continue after operator requested exit
  • logs appear after UI is gone
  • shutdown race conditions occur

Why it happens:

  • UI owns lifecycle incorrectly
  • background worker not cancellation-aware
  • device service outlives UI state
  • no central shutdown coordinator

Prevention:

  • application-level lifecycle owner
  • cancellation tokens propagated through workers
  • command gateway rejects new work during shutdown
  • background workers must acknowledge stop

5. Storage queue loses inspection results during shutdown

What it looks like:

  • image exists but database row missing
  • database row exists but image missing
  • report incomplete
  • traceability gap

Why it happens:

  • async storage queue still had pending work
  • process exited before flush
  • no bounded drain strategy
  • no incomplete-operation marker

Prevention:

  • storage pipeline supports drain/finalize
  • shutdown waits with timeout
  • pending items are recorded
  • incomplete result state is explicit
  • restart can reconcile image/database mismatch

6. Shutdown hangs forever waiting for device response

What it looks like:

  • operator clicks Exit
  • app freezes on “Stopping...”
  • service engineer kills process
  • evidence is incomplete

Why it happens:

  • shutdown waits indefinitely
  • device does not respond
  • no timeout or fallback path
  • cleanup assumes happy path

Prevention:

  • every shutdown step has timeout
  • distinguish graceful stop from forced stop
  • log which subsystem blocked shutdown
  • escalate to recovery-required marker

7. Previous crash leaves machine unknown, but UI starts as Ready

What it looks like:

  • app starts normally
  • operator presses Start
  • machine behaves incorrectly
  • material is in unexpected position
  • workflow context is stale

Why it happens:

  • startup does not check previous shutdown marker
  • software reconstructs state too optimistically
  • physical state is not revalidated

Prevention:

  • abnormal shutdown detection
  • startup recovery checks
  • require homing/revalidation
  • show Recovery Required instead of Ready

8. Operator kills app to recover, destroying evidence

What it looks like:

  • operator says “the machine froze”
  • logs stop suddenly
  • no clear fault reason
  • engineering cannot reproduce

Why it happens:

  • shutdown/recovery UX is poor
  • app appears stuck
  • operator has no safe recovery option
  • diagnostics are not preserved quickly enough

Prevention:

  • visible “Stopping / Recovery / Collecting diagnostics” states
  • watchdog health monitoring
  • fast diagnostic snapshot
  • operator procedure for abnormal stop
  • crash dumps and last-event buffers

PART 9 — Software design implications

Graceful shutdown must be an explicit architecture path.

It should not be hidden inside:

  • WPF window close event
  • random Dispose() methods
  • finalizers
  • destructors
  • process exit events
  • scattered try/finally blocks

A strong design has a shutdown coordinator.

text
+-------------------+       +----------------------+
| Shutdown Request  | ----> | Shutdown Coordinator |
+-------------------+       +----------------------+
                                      |
+-------------------+                 |
| Crash Detector    | ----------------+
+-------------------+
                                      |
                                      v
       +----------------+----------------+----------------+
       |                |                |                |
       v                v                v                v
+--------------+ +---------------+ +--------------+ +----------------+
| Workflow Stop| | Device Disarm | | Storage Flush| | Diagnostics    |
+--------------+ +---------------+ +--------------+ +----------------+
       |                |                |                |
       +----------------+----------------+----------------+
                                      |
                                      v
                    +--------------------------------+
                    | Clean Shutdown Marker OR       |
                    | Recovery Required Marker       |
                    +--------------------------------+

Good shutdown design includes:

  • central shutdown coordinator
  • ordered subsystem stop
  • explicit subsystem lifecycle contracts
  • cancellation-aware workflows
  • timeout-aware cleanup
  • command rejection during shutdown
  • safe output deactivation
  • device ownership tracking
  • diagnostic capture
  • abnormal shutdown marker
  • startup recovery gate

Bad approaches:

text
Bad:
- Window_Closing does everything
- Dispose randomly stops hardware
- no shutdown ordering
- no timeout
- no crash marker
- startup always shows Ready
- logs are flushed after devices are already killed
- cleanup hides the original failure

Good approaches:

text
Good:
- shutdown is a first-class workflow
- each subsystem has Stop/Shutdown semantics
- shutdown is ordered by dependency
- physical state is treated as uncertain after crash
- evidence is captured before cleanup
- restart checks previous shutdown result

A useful subsystem contract might look conceptually like this:

csharp
public interface IMachineSubsystem
{
    string Name { get; }

    Task StopOperationsAsync(
        ShutdownContext context,
        CancellationToken cancellationToken);

    Task DisarmAsync(
        ShutdownContext context,
        CancellationToken cancellationToken);

    Task ReleaseResourcesAsync(
        ShutdownContext context,
        CancellationToken cancellationToken);
}

The important idea is not the exact interface.

The important idea is separation:

  • stop active behavior
  • put device/output into safe state
  • release software resources

Those are not the same thing.


PART 10 — Interview / real-world talking points

A strong interview explanation could be:

In industrial software, graceful shutdown is not just process cleanup. The application may be controlling motion, cameras, IO, vacuum, clamps, storage, and active workflows. A safe shutdown must stop accepting new commands, stop workflows at safe boundaries, disarm devices, deactivate outputs safely, flush diagnostics and storage, release hardware resources, and mark whether shutdown was clean. For crashes, we cannot assume cleanup runs, so we preserve evidence, mark the machine state as uncertain, and force startup through recovery checks instead of showing Ready immediately.

Common mistakes engineers make when entering industrial systems:

  • treating shutdown like a web app or desktop app lifecycle
  • assuming Dispose() means the machine is safe
  • assuming process exit stops hardware
  • ignoring native SDK/resource ownership
  • not designing startup checks for abnormal shutdown
  • letting UI close while workflows still run
  • waiting forever for devices during shutdown
  • clearing evidence too early
  • showing Ready after a crash without revalidation

What strong engineers understand:

  • shutdown is part of safety and reliability
  • physical state may outlive software state
  • ordered shutdown matters
  • every subsystem needs lifecycle ownership
  • crash handling is mostly about evidence and containment
  • startup must verify whether the previous shutdown was clean
  • recovery-required is safer than pretending everything is normal

The core sentence to remember:

In machine software, shutdown is successful only when the process exits, the hardware is safe, resources are released, evidence is preserved, and the next startup knows whether recovery is required.

Docs-first project memory for AI-assisted implementation.