Skip to content

Observability Runbook

This runbook covers the structured logging, crash reporting, metrics, and single-instance guard introduced by TASK-006 (Observability Baseline).


1. Structured logs

Where logs live

Logs are written to rolling daily files under:

%LOCALAPPDATA%\InspectionPrototype\logs\app-<yyyyMMdd>.log

Each file rolls at midnight or at 10 MB (whichever comes first).

Retention policy

The seven most recent log files are kept; older files are deleted automatically by Serilog at roll time.

Log levels

EnvironmentMinimum level
Debugger attachedDebug
Normal launchInformation

Every log entry is enriched with ThreadId and ProcessId.

Reading the log

The file is plain-text structured JSON (one line per event when using the default Serilog text formatter). Use any text editor, or pipe through jq:

powershell
Get-Content "$env:LOCALAPPDATA\InspectionPrototype\logs\app-$(Get-Date -f yyyyMMdd).log" |
    Select-String "Error|Warning|Critical"

2. Crash files

Where crash files live

%LOCALAPPDATA%\InspectionPrototype\crashes\crash-<timestamp>.txt

One file is created per unhandled exception (UI thread, AppDomain, or unobserved task).

What a crash file contains

SectionDescription
ExceptionFull type name, message, and stack trace
Inner exceptionRecursively included when present
SourceWhich handler caught it: UI, AppDomain, or UnobservedTask
WorkflowStateSnapshot of AppState.WorkflowState at time of crash
ActiveRunRunId, recipe name, scan-point progress (if run was active)
RecentDiagnosticsLast 50 diagnostics timeline entries
ProcessInfoPID, start time, OS, runtime version

How to read one

powershell
Get-Content "$env:LOCALAPPDATA\InspectionPrototype\crashes\crash-*.txt" |
    Select-Object -First 80

After a crash the UI shows a non-modal yellow banner naming the crash file path. The operator can copy the path to the clipboard directly from the banner.


3. Live metrics with dotnet-counters

The application publishes seven counters under the InspectionPrototype meter using System.Diagnostics.Metrics (the modern .NET counters API).

Attaching a counters session

powershell
dotnet-counters monitor `
    --name InspectionPrototype.App `
    --counters InspectionPrototype,System.Runtime

dotnet-counters is part of the .NET diagnostic tools. Install with: dotnet tool install -g dotnet-counters

Counter reference

Counter nameUnitDescription
frames.ingestedframesFrames successfully propagated to AppState
frames.droppedframesFrames dropped by the bounded channel (consumer lagging)
telemetry.ingestedsamplesTelemetry snapshots successfully propagated to AppState
telemetry.coalescedsamplesTelemetry snapshots dropped (consumer lagging)
runs.startedrunsWorkflow transitions into Running state
runs.completedrunsRuns that reached terminal status Completed
runs.faultedrunsRuns that reached terminal status Faulted

All counters are additive and never reset. They accumulate for the lifetime of the process and survive workflow transitions (e.g., multiple sequential runs increment the same counters).

Typical healthy output (after one completed run)

[InspectionPrototype]
    frames.ingested (Count / 1 sec)           2
    frames.dropped  (Count / 1 sec)           0
    telemetry.ingested (Count / 1 sec)        5
    telemetry.coalesced (Count / 1 sec)       0
    runs.started    (Count / 1 sec)           0
    runs.completed  (Count / 1 sec)           0
    runs.faulted    (Count / 1 sec)           0

Counters show delta per interval; totals are the cumulative sum over time.


4. Single-instance guard

How it works

On startup — before Serilog configuration and before the host is built — the application acquires a named system mutex:

Global\InspectionPrototype-<first 16 hex chars of SHA-256("%LOCALAPPDATA%\InspectionPrototype")>

If the mutex is already owned by another process:

  1. A line is appended to %LOCALAPPDATA%\InspectionPrototype\logs\bootstrap.log.
  2. A MessageBox is shown: "Another instance is already running."
  3. The process exits with code 1.

The mutex is released and disposed in OnExit, which the OS calls on any clean process termination path.

Recovery after a hard crash (mutex stuck)

If the application is hard-killed (power loss, taskkill /F, BSOD) the OS automatically releases all mutexes owned by the terminated process when the process handle closes. No manual cleanup is needed. If a second launch is still rejected after a hard crash, verify the first instance is truly gone:

powershell
Get-Process -Name InspectionPrototype.App -ErrorAction SilentlyContinue

If no process is found and the guard is still firing, a reboot will always clear any lingering kernel objects.


5. Note on phase-1 measurements

This observability surface (logs + crash files + metrics) exists to populate the phase-1-measurements table that tracks baseline system health before introducing specialized pipeline modules. The counters listed above are the minimum required by the acceptance criteria in SLICE-006; future phases will extend the InspectionPrototype meter with additional instruments as new subsystems are introduced.

For the procedure to capture a measurement row (tooling, scenario scripts, CSV extraction, where results get committed) see the capturing measurements runbook. The populated measurements live in docs/reviews/phase-1-measurements.md.

Refer to docs/implementation/ROADMAP.md for the planned expansion schedule.

Docs-first project memory for AI-assisted implementation.