Observability Runbook
This runbook covers the structured logging, crash reporting, metrics, and single-instance guard introduced by TASK-006 (Observability Baseline).
1. Structured logs
Where logs live
Logs are written to rolling daily files under:
%LOCALAPPDATA%\InspectionPrototype\logs\app-<yyyyMMdd>.logEach file rolls at midnight or at 10 MB (whichever comes first).
Retention policy
The seven most recent log files are kept; older files are deleted automatically by Serilog at roll time.
Log levels
| Environment | Minimum level |
|---|---|
| Debugger attached | Debug |
| Normal launch | Information |
Every log entry is enriched with ThreadId and ProcessId.
Reading the log
The file is plain-text structured JSON (one line per event when using the default Serilog text formatter). Use any text editor, or pipe through jq:
Get-Content "$env:LOCALAPPDATA\InspectionPrototype\logs\app-$(Get-Date -f yyyyMMdd).log" |
Select-String "Error|Warning|Critical"2. Crash files
Where crash files live
%LOCALAPPDATA%\InspectionPrototype\crashes\crash-<timestamp>.txtOne file is created per unhandled exception (UI thread, AppDomain, or unobserved task).
What a crash file contains
| Section | Description |
|---|---|
| Exception | Full type name, message, and stack trace |
| Inner exception | Recursively included when present |
| Source | Which handler caught it: UI, AppDomain, or UnobservedTask |
| WorkflowState | Snapshot of AppState.WorkflowState at time of crash |
| ActiveRun | RunId, recipe name, scan-point progress (if run was active) |
| RecentDiagnostics | Last 50 diagnostics timeline entries |
| ProcessInfo | PID, start time, OS, runtime version |
How to read one
Get-Content "$env:LOCALAPPDATA\InspectionPrototype\crashes\crash-*.txt" |
Select-Object -First 80After a crash the UI shows a non-modal yellow banner naming the crash file path. The operator can copy the path to the clipboard directly from the banner.
3. Live metrics with dotnet-counters
The application publishes seven counters under the InspectionPrototype meter using System.Diagnostics.Metrics (the modern .NET counters API).
Attaching a counters session
dotnet-counters monitor `
--name InspectionPrototype.App `
--counters InspectionPrototype,System.Runtime
dotnet-countersis part of the .NET diagnostic tools. Install with:dotnet tool install -g dotnet-counters
Counter reference
| Counter name | Unit | Description |
|---|---|---|
frames.ingested | frames | Frames successfully propagated to AppState |
frames.dropped | frames | Frames dropped by the bounded channel (consumer lagging) |
telemetry.ingested | samples | Telemetry snapshots successfully propagated to AppState |
telemetry.coalesced | samples | Telemetry snapshots dropped (consumer lagging) |
runs.started | runs | Workflow transitions into Running state |
runs.completed | runs | Runs that reached terminal status Completed |
runs.faulted | runs | Runs that reached terminal status Faulted |
All counters are additive and never reset. They accumulate for the lifetime of the process and survive workflow transitions (e.g., multiple sequential runs increment the same counters).
Typical healthy output (after one completed run)
[InspectionPrototype]
frames.ingested (Count / 1 sec) 2
frames.dropped (Count / 1 sec) 0
telemetry.ingested (Count / 1 sec) 5
telemetry.coalesced (Count / 1 sec) 0
runs.started (Count / 1 sec) 0
runs.completed (Count / 1 sec) 0
runs.faulted (Count / 1 sec) 0Counters show delta per interval; totals are the cumulative sum over time.
4. Single-instance guard
How it works
On startup — before Serilog configuration and before the host is built — the application acquires a named system mutex:
Global\InspectionPrototype-<first 16 hex chars of SHA-256("%LOCALAPPDATA%\InspectionPrototype")>If the mutex is already owned by another process:
- A line is appended to
%LOCALAPPDATA%\InspectionPrototype\logs\bootstrap.log. - A MessageBox is shown: "Another instance is already running."
- The process exits with code 1.
The mutex is released and disposed in OnExit, which the OS calls on any clean process termination path.
Recovery after a hard crash (mutex stuck)
If the application is hard-killed (power loss, taskkill /F, BSOD) the OS automatically releases all mutexes owned by the terminated process when the process handle closes. No manual cleanup is needed. If a second launch is still rejected after a hard crash, verify the first instance is truly gone:
Get-Process -Name InspectionPrototype.App -ErrorAction SilentlyContinueIf no process is found and the guard is still firing, a reboot will always clear any lingering kernel objects.
5. Note on phase-1 measurements
This observability surface (logs + crash files + metrics) exists to populate the phase-1-measurements table that tracks baseline system health before introducing specialized pipeline modules. The counters listed above are the minimum required by the acceptance criteria in SLICE-006; future phases will extend the InspectionPrototype meter with additional instruments as new subsystems are introduced.
For the procedure to capture a measurement row (tooling, scenario scripts, CSV extraction, where results get committed) see the capturing measurements runbook. The populated measurements live in docs/reviews/phase-1-measurements.md.
Refer to docs/implementation/ROADMAP.md for the planned expansion schedule.