Observability Runbook

This runbook covers the structured logging, crash reporting, metrics, and single-instance guard introduced by TASK-006 (Observability Baseline).

1. Structured logs

Where logs live

Logs are written to rolling daily files under:

%LOCALAPPDATA%\InspectionPrototype\logs\app-<yyyyMMdd>.log

Each file rolls at midnight or at 10 MB (whichever comes first).

Retention policy

The seven most recent log files are kept; older files are deleted automatically by Serilog at roll time.

Log levels

Environment	Minimum level
Debugger attached	Debug
Normal launch	Information

Every log entry is enriched with ThreadId and ProcessId.

Reading the log

The file is plain-text structured JSON (one line per event when using the default Serilog text formatter). Use any text editor, or pipe through jq:

powershell

Get-Content "$env:LOCALAPPDATA\InspectionPrototype\logs\app-$(Get-Date -f yyyyMMdd).log" |
    Select-String "Error|Warning|Critical"

2. Crash files

Where crash files live

%LOCALAPPDATA%\InspectionPrototype\crashes\crash-<timestamp>.txt

One file is created per unhandled exception (UI thread, AppDomain, or unobserved task).

What a crash file contains

Section	Description
Exception	Full type name, message, and stack trace
Inner exception	Recursively included when present
Source	Which handler caught it: `UI`, `AppDomain`, or `UnobservedTask`
WorkflowState	Snapshot of `AppState.WorkflowState` at time of crash
ActiveRun	RunId, recipe name, scan-point progress (if run was active)
RecentDiagnostics	Last 50 diagnostics timeline entries
ProcessInfo	PID, start time, OS, runtime version

How to read one

powershell

Get-Content "$env:LOCALAPPDATA\InspectionPrototype\crashes\crash-*.txt" |
    Select-Object -First 80

After a crash the UI shows a non-modal yellow banner naming the crash file path. The operator can copy the path to the clipboard directly from the banner.

3. Live metrics with dotnet-counters

The application publishes seven counters under the InspectionPrototype meter using System.Diagnostics.Metrics (the modern .NET counters API).

Attaching a counters session

powershell

dotnet-counters monitor `
    --name InspectionPrototype.App `
    --counters InspectionPrototype,System.Runtime

dotnet-counters is part of the .NET diagnostic tools. Install with: dotnet tool install -g dotnet-counters

Counter reference

Counter name	Unit	Description
`frames.ingested`	frames	Frames successfully propagated to AppState
`frames.dropped`	frames	Frames dropped by the bounded channel (consumer lagging)
`telemetry.ingested`	samples	Telemetry snapshots successfully propagated to AppState
`telemetry.coalesced`	samples	Telemetry snapshots dropped (consumer lagging)
`runs.started`	runs	Workflow transitions into Running state
`runs.completed`	runs	Runs that reached terminal status Completed
`runs.faulted`	runs	Runs that reached terminal status Faulted

All counters are additive and never reset. They accumulate for the lifetime of the process and survive workflow transitions (e.g., multiple sequential runs increment the same counters).

Typical healthy output (after one completed run)

[InspectionPrototype]
    frames.ingested (Count / 1 sec)           2
    frames.dropped  (Count / 1 sec)           0
    telemetry.ingested (Count / 1 sec)        5
    telemetry.coalesced (Count / 1 sec)       0
    runs.started    (Count / 1 sec)           0
    runs.completed  (Count / 1 sec)           0
    runs.faulted    (Count / 1 sec)           0

Counters show delta per interval; totals are the cumulative sum over time.

4. Single-instance guard

How it works

On startup — before Serilog configuration and before the host is built — the application acquires a named system mutex:

Global\InspectionPrototype-<first 16 hex chars of SHA-256("%LOCALAPPDATA%\InspectionPrototype")>

If the mutex is already owned by another process:

A line is appended to %LOCALAPPDATA%\InspectionPrototype\logs\bootstrap.log.
A MessageBox is shown: "Another instance is already running."
The process exits with code 1.

The mutex is released and disposed in OnExit, which the OS calls on any clean process termination path.

Recovery after a hard crash (mutex stuck)

If the application is hard-killed (power loss, taskkill /F, BSOD) the OS automatically releases all mutexes owned by the terminated process when the process handle closes. No manual cleanup is needed. If a second launch is still rejected after a hard crash, verify the first instance is truly gone:

powershell

Get-Process -Name InspectionPrototype.App -ErrorAction SilentlyContinue

If no process is found and the guard is still firing, a reboot will always clear any lingering kernel objects.

5. Note on phase-1 measurements

This observability surface (logs + crash files + metrics) exists to populate the phase-1-measurements table that tracks baseline system health before introducing specialized pipeline modules. The counters listed above are the minimum required by the acceptance criteria in SLICE-006; future phases will extend the InspectionPrototype meter with additional instruments as new subsystems are introduced.

For the procedure to capture a measurement row (tooling, scenario scripts, CSV extraction, where results get committed) see the capturing measurements runbook. The populated measurements live in docs/reviews/phase-1-measurements.md.

Refer to docs/implementation/ROADMAP.md for the planned expansion schedule.

Domains

Terms

1 Machine Control and Motion Systems

2 Hardware Integration and Device Control

3 Industrial Software Architecture

4 Industrial Communication and Integration

5 Vision, Imaging and Inspection Systems

6 UI HMI Operator Experience

7 Reliability Safety and Production Readiness

Industrial Desktop Systems

Streaming Pipelines Dotnet Real World

Observability Runbook

1. Structured logs

Where logs live

Retention policy

Log levels

Reading the log

2. Crash files

Where crash files live

What a crash file contains

How to read one

3. Live metrics with dotnet-counters

Attaching a counters session

Counter reference

Typical healthy output (after one completed run)

4. Single-instance guard

How it works

Recovery after a hard crash (mutex stuck)

5. Note on phase-1 measurements

Streaming Pipelines Dotnet Real World

Observability Runbook ​

1. Structured logs ​

Where logs live ​

Retention policy ​

Log levels ​

Reading the log ​

2. Crash files ​

Where crash files live ​

What a crash file contains ​

How to read one ​

3. Live metrics with dotnet-counters ​

Attaching a counters session ​

Counter reference ​

Typical healthy output (after one completed run) ​

4. Single-instance guard ​

How it works ​

Recovery after a hard crash (mutex stuck) ​

5. Note on phase-1 measurements ​

Observability Runbook

1. Structured logs

Where logs live

Retention policy

Log levels

Reading the log

2. Crash files

Where crash files live

What a crash file contains

How to read one

3. Live metrics with dotnet-counters

Attaching a counters session

Counter reference

Typical healthy output (after one completed run)

4. Single-instance guard

How it works

Recovery after a hard crash (mutex stuck)

5. Note on phase-1 measurements