TASK-006: Implement Observability Baseline
- Status: Proposed
- Date: 2026-04-22
- Spec: SLICE-006: Observability Baseline
- Depends on: TASK-005: Implement CI and Quality Gates
Objective
Add the minimum observability surface the application needs to be honest under load: structured logging with a rolling file sink, three unhandled-exception handlers, a metrics meter with starter counters, and a single-instance guard.
Scope
- add Serilog (or equivalent) wiring under the existing
Microsoft.Extensions.Logginghost; file sink, console sink, and a diagnostics-timeline bridge forWarningand above - register
DispatcherUnhandledException,AppDomain.UnhandledException, andTaskScheduler.UnobservedTaskExceptionhandlers during host startup - define a shared crash-log writer that captures exception details plus current workflow and active-run state
- add a
System.Diagnostics.Metrics.MeternamedInspectionPrototypewith the starter counter set defined in the spec - wire starter counters into the existing pipeline services (frame, telemetry, workflow) so they increment under real runs
- add a single-instance mutex keyed on the data directory; second-launch attempts exit with a non-zero exit code and a log entry
- add a short runbook entry under
docs/runbook/describing log paths, crash-log location, anddotnet-countersusage
Non-Scope
- OpenTelemetry exporters or external collectors
- log forwarding to any external service
- alerting, dashboards, paging
- crash-reporter UX beyond a log entry and non-blocking UI banner
- distributed tracing
- changes to
DiagnosticsEntriescap or schema beyond what the bridge requires - additional operational counters beyond the starter set — later phases add their own
AI Tool Guidance
This task is larger than TASK-005 and is best handled in three focused passes rather than one prompt:
- Logging and diagnostics bridge — introduce Serilog wiring, file sink, console sink, and the
DiagnosticsEntriesbridge. Verify existing tests pass; add new tests for the bridge. - Unhandled exception handling and crash log — register the three handlers, implement the crash-log writer, and verify each handler fires via a debug-only test hook. Include a test that the workflow transitions to
Faultedon UI-thread crash. - Metrics, single-instance guard, and runbook — add the meter and starter counters, wire them into pipeline services, add the mutex, and write the runbook. Verify counters increment under an end-to-end scripted run.
Keep each pass as its own commit so regressions are bisectable. Do not combine unhandled-exception work with metrics work in the same change.
Acceptance Criteria Mapping
The implementation must satisfy all acceptance criteria from SLICE-006.
Copilot Agent Prompts
This task is larger than TASK-005 and is best split across three separate Copilot sessions, one per pass, so each pass gets crisp context and a distinct review gate. Do not paste all three prompts into a single session.
- Pass 1: logging + diagnostics bridge
- Pass 2: unhandled exception handlers + crash log
- Pass 3: metrics + single-instance mutex + runbook
After each pass, review the commit, run dotnet test, and only then kick off the next session.
Pass 1 — Logging and diagnostics bridge
You are implementing Pass 1 of TASK-006 in this repository: wire up structured
logging via Serilog, with a file sink, a console sink, and a bridge that routes
Warning-and-above log events into the existing DiagnosticsEntries timeline.
## Authoritative references
Read these before making changes:
- docs/specs/SLICE-006-observability-baseline.md (the requirements)
- docs/tasks/TASK-006-implement-observability-baseline.md (the objective and AI guidance)
- src/InspectionPrototype.App/App.xaml.cs (current host bootstrap — no logging yet)
- src/InspectionPrototype.Application/State/AppState.cs (DiagnosticsEntries lives here)
- src/InspectionPrototype.Application/Services/AppStateExtensions.cs (WithDiagnosticsEntry)
The spec's Acceptance Criteria items 1 and 2 are the definition of done for this pass.
## Scope of this pass
Logging only. Do NOT touch unhandled exception handlers, metrics, or the single-instance
mutex — those are Passes 2 and 3.
## Deliverables
1. Add Serilog packages to Directory.Packages.props and reference them from
src/InspectionPrototype.App/InspectionPrototype.App.csproj (only the App project
needs direct references; the rest keep using Microsoft.Extensions.Logging.Abstractions):
- Serilog.Extensions.Hosting (latest stable)
- Serilog.Sinks.File (latest stable)
- Serilog.Sinks.Console (latest stable)
- Serilog.Settings.Configuration (latest stable)
2. Serilog configuration in App.xaml.cs OnStartup, BEFORE Host.CreateDefaultBuilder():
- file sink rolling daily, size-limited to 10 MB per file, 7 files retained
- path: Environment.GetFolderPath(SpecialFolder.LocalApplicationData) + "\InspectionPrototype\logs\app-.log"
- console sink active only when Debugger.IsAttached
- minimum level: Information (Debug when Debugger.IsAttached)
- enrich with ThreadId and ProcessId
3. Bridge Serilog into the Microsoft.Extensions.Logging host via UseSerilog() on the
HostBuilder. Existing ILogger<T> call sites must continue to work unchanged.
4. Create a new custom Serilog sink that bridges to AppState:
- file: src/InspectionPrototype.Application/Services/DiagnosticsTimelineSink.cs
- depends on IAppStateStore (resolved at construction)
- filters at LogEventLevel.Warning or higher
- appends a DiagnosticsEntry via the existing WithDiagnosticsEntry extension
- uses the existing 200-cap behavior in DiagnosticsEntries — do not change the cap
- register the sink via .WriteTo.Sink<DiagnosticsTimelineSink>() in the Serilog config
5. Ensure the App project's .csproj references Application for the sink type. The sink
lives in Application because IAppStateStore lives there; the Serilog wiring lives in App.
6. Add tests under tests/InspectionPrototype.Tests/DiagnosticsTimelineSinkTests.cs:
- log at Warning → DiagnosticsEntries grows by 1
- log at Information → DiagnosticsEntries unchanged
- log at Error → DiagnosticsEntries grows by 1 with Severity mapped appropriately
- do not test Serilog itself — only the bridge behavior
## Constraints
- Do NOT replace or suppress the existing ILogger<T> call sites. Serilog plugs into
Microsoft.Extensions.Logging; existing code is unaware of the swap.
- Do NOT modify the DiagnosticsEntries cap or the DiagnosticsEntry record shape.
The sink must conform to what exists today.
- Do NOT add log calls to business code for the sake of this pass. The goal is
wiring, not retroactively instrumenting every service.
- Do NOT touch src/InspectionPrototype.Infrastructure. The Application project already
has the abstractions we need.
## Verification before you report done
dotnet build --configuration Release (zero warnings, zero errors)
dotnet test --configuration Release (all tests pass, including new sink tests)
Then run the app manually and confirm:
- a log file appears under %LOCALAPPDATA%\InspectionPrototype\logs\app-<date>.log
- connecting, loading a recipe, and starting a run each produce log lines in the file
- triggering a fault (via existing fault injection UI) produces a DiagnosticsEntry
visible in the diagnostics pane AND a line in the log file
## Report format when finished
- files created and files modified
- confirmation that all existing tests still pass
- the path of a log file produced during manual verification
- a single commit hash
- commit message: "feat(obs): add structured logging with Serilog and diagnostics-timeline bridge (pass 1/3 of TASK-006)"Pass 2 — Unhandled exception handlers and crash log
You are implementing Pass 2 of TASK-006. Pass 1 (Serilog logging) is already merged;
this pass adds the three unhandled-exception handlers and the crash-log writer.
## Authoritative references
Read these before making changes:
- docs/specs/SLICE-006-observability-baseline.md (the requirements — items 3, 4, 5)
- docs/tasks/TASK-006-implement-observability-baseline.md
- src/InspectionPrototype.App/App.xaml.cs (where the handlers get registered)
- src/InspectionPrototype.Application/Services/WorkflowService.cs (how Faulted transitions work today)
- src/InspectionPrototype.Application/State/AppState.cs (what state to snapshot in crash dumps)
Confirm that Pass 1's Serilog wiring is already in place before starting. If it is
not, stop and alert the operator.
## Scope of this pass
Unhandled exception handling and crash logs only. Do NOT touch metrics or the
single-instance mutex — that is Pass 3.
## Deliverables
1. A new ICrashReporter abstraction and implementation:
- interface: src/InspectionPrototype.Application/Diagnostics/ICrashReporter.cs
Task ReportAsync(Exception exception, string source, CancellationToken ct);
- implementation: src/InspectionPrototype.Application/Diagnostics/CrashReporter.cs
depends on IAppStateStore and ILogger<CrashReporter>
writes a crash file at:
%LOCALAPPDATA%\InspectionPrototype\crashes\crash-{yyyy-MM-ddTHH-mm-ssZ}.txt
crash file contents (plain text, human-readable):
- timestamp (UTC)
- source tag ("UI", "AppDomain", "UnobservedTask")
- full exception, including inner exceptions and stack traces
- current WorkflowState, ActiveRun id, and recipe name from AppState snapshot
- last 50 DiagnosticsEntries from AppState snapshot
- process info: PID, uptime, working set
also logs at LogLevel.Critical with the same context so the log file captures it
- register both via AddApplicationServices() in
src/InspectionPrototype.Application/ApplicationServiceCollectionExtensions.cs
2. Register the three handlers in App.xaml.cs OnStartup, AFTER Host.StartAsync()
completes (so the DI container is available):
- this.DispatcherUnhandledException: mark e.Handled = true, call
crashReporter.ReportAsync(e.Exception, "UI"), append a diagnostics entry,
transition any active run to Faulted via IWorkflowService (use whatever
method exists today — do not invent a new one), surface a non-blocking
banner (see item 3 below). Do NOT terminate the process.
- AppDomain.CurrentDomain.UnhandledException: call crashReporter.ReportAsync
synchronously with a short timeout (3 seconds). Process is terminating;
best-effort is fine. Let it terminate after reporting.
- TaskScheduler.UnobservedTaskException: call crashReporter.ReportAsync
with source "UnobservedTask", call e.SetObserved(), append a diagnostics
entry. Do NOT terminate the process.
3. A new "crash banner" field in AppState and a corresponding UI surface:
- add a nullable field to AppState: CrashBannerState? CrashBanner
- record: record CrashBannerState(string Message, string CrashFilePath, DateTimeOffset OccurredAt)
- add a corresponding AppState extension: WithCrashBanner / WithClearedCrashBanner
- in MainWindow.xaml (or whatever the current main layout is), add a non-blocking
banner row at the top that binds to the new AppState field via MainViewModel
and is visible when CrashBanner is not null; includes a "Copy path" button
and a "Dismiss" button. Do not make it modal.
- the banner text format: "A background error occurred. Crash log: {path}"
4. Preserve run-history integrity on AppDomain exit:
- confirm the existing JsonRunHistoryStore writes via temp-file-then-move
(it already does per the summary) — no change needed, but add a comment
in the crash handler acknowledging this and do NOT attempt to flush
anything in-flight from the handler itself.
5. Tests under tests/InspectionPrototype.Tests:
- CrashReporterTests.cs: given a test IAppStateStore with a seeded state,
ReportAsync writes a file with the expected sections. Use a tempdir.
- Do NOT test DispatcherUnhandledException wiring directly (requires a UI
thread); instead, test that CrashReporter handles each source tag
correctly.
## Constraints
- Do NOT swallow exceptions without writing them to both the log and a crash
file. Every handler must produce both artifacts.
- Do NOT use MessageBox for the UI surface — the spec requires a non-blocking
banner. Modal dialogs are forbidden.
- Do NOT add a crash-uploader, opt-in dialog, or issue-filer. Out of scope.
- Do NOT call crashReporter.ReportAsync.Wait() from the dispatcher handler —
use fire-and-forget with a 3-second timeout.
- Do NOT modify the existing JsonRunHistoryStore atomic-write logic.
## Verification before you report done
dotnet build --configuration Release
dotnet test --configuration Release
Manual verification steps documented for the operator (write these into the
runbook file created in Pass 3 — for now, describe them in the report):
- force a UI-thread exception via a debug-only "crash me" button (add one
under #if DEBUG in MainWindow for testing; ok to commit if behind #if DEBUG)
- confirm a crash file appears under %LOCALAPPDATA%\InspectionPrototype\crashes\
- confirm the banner appears in the UI with the crash-log path
- confirm the process does not terminate
## Report format when finished
- files created and files modified
- confirmation that all existing tests still pass plus new CrashReporter tests
- the path of a crash file produced during manual verification
- a single commit hash
- commit message: "feat(obs): add unhandled exception handlers and crash reporter (pass 2/3 of TASK-006)"Pass 3 — Metrics, single-instance mutex, runbook
You are implementing Pass 3 of TASK-006, the final pass. Passes 1 and 2 are
already merged; this pass adds the metrics meter, wires counters into pipeline
services, adds the single-instance mutex, and writes the runbook.
## Authoritative references
Read these before making changes:
- docs/specs/SLICE-006-observability-baseline.md (items 6, 7, 9 of acceptance criteria)
- docs/tasks/TASK-006-implement-observability-baseline.md
- src/InspectionPrototype.Application/Services/FramePipelineService.cs
- src/InspectionPrototype.Application/Services/TelemetryPipelineService.cs
- src/InspectionPrototype.Application/Services/WorkflowService.cs
- src/InspectionPrototype.App/App.xaml.cs (mutex goes here, before host build)
## Scope of this pass
Metrics meter + starter counters + single-instance mutex + runbook. Nothing
else. Do NOT add log calls or crash behavior — Passes 1 and 2 own those.
## Deliverables
1. A new metrics abstraction exposed from Application:
- file: src/InspectionPrototype.Application/Diagnostics/AppMetrics.cs
- public class AppMetrics : IDisposable
private readonly Meter _meter = new("InspectionPrototype");
public Counter<long> FramesIngested { get; } = _meter.CreateCounter<long>("frames.ingested");
public Counter<long> FramesDropped { get; } = _meter.CreateCounter<long>("frames.dropped");
public Counter<long> TelemetryIngested { get; } = _meter.CreateCounter<long>("telemetry.ingested");
public Counter<long> TelemetryCoalesced { get; } = _meter.CreateCounter<long>("telemetry.coalesced");
public Counter<long> RunsStarted { get; } = _meter.CreateCounter<long>("runs.started");
public Counter<long> RunsCompleted { get; } = _meter.CreateCounter<long>("runs.completed");
public Counter<long> RunsFaulted { get; } = _meter.CreateCounter<long>("runs.faulted");
public void Dispose() => _meter.Dispose();
- register as Singleton in AddApplicationServices()
2. Wire counters into the three services (constructor-inject AppMetrics):
- FramePipelineService: .FramesIngested.Add(1) on each frame successfully
propagated to AppState; .FramesDropped.Add(n) when the bounded-channel
drop counter advances (read delta from the counter itself).
- TelemetryPipelineService: .TelemetryIngested.Add(1) on each snapshot
propagated; .TelemetryCoalesced.Add(n) when the coalesce counter advances.
- WorkflowService: .RunsStarted.Add(1) on transition into Running;
.RunsCompleted.Add(1) on successful completion; .RunsFaulted.Add(1) on
fault transition.
Counters only go up. Do not decrement. Do not reset on workflow transitions.
3. Single-instance mutex in App.xaml.cs OnStartup, BEFORE anything else
(before Serilog config, before Host building):
- compute a data-directory-scoped mutex name:
$"Global\\InspectionPrototype-{Convert.ToHexString(SHA256.HashData(Encoding.UTF8.GetBytes(dataDir)))[..16]}"
where dataDir is %LOCALAPPDATA%\InspectionPrototype
- attempt new Mutex(initiallyOwned: true, name, out bool createdNew);
if !createdNew, log a message (to Serilog, if Pass 1 config already ran;
otherwise to a bootstrap log file under %LOCALAPPDATA%\InspectionPrototype\logs\bootstrap.log),
show a one-line MessageBox ("Another instance is already running."),
and Environment.Exit(1).
- hold the mutex for the process lifetime; release on OnExit.
4. Runbook at docs/runbook/observability.md:
- where logs live, rolling and retention policy
- where crash files live, what is in them, how to read one
- how to attach a counters session:
dotnet-counters monitor --name InspectionPrototype.App --counters InspectionPrototype,System.Runtime
- how the single-instance guard works, how to recover if the mutex is stuck
after a hard crash (it auto-releases on process exit; no manual cleanup needed)
- one-paragraph note on the phase-1-measurements.md table this observability
surface exists to feed
5. Tests under tests/InspectionPrototype.Tests:
- AppMetricsTests.cs: creating an AppMetrics instance exposes all seven
counters; counters are reachable via the System.Diagnostics.Metrics API
by name; disposing disposes the underlying Meter.
- Do NOT test mutex behavior — integration-only, documented as manual.
## Constraints
- Do NOT add new counters beyond the seven listed above. Future phases will
extend them as needed.
- Do NOT use System.Diagnostics.PerformanceCounter (the old Win32 API). Use
System.Diagnostics.Metrics (the modern .NET counters API).
- Do NOT reset counters on workflow transitions. Additive only.
- Do NOT expose AppMetrics through a public static — inject via DI only.
- Do NOT use the mutex to coordinate anything other than single-instance
launch. It is a guard, not a lock.
## Verification before you report done
dotnet build --configuration Release
dotnet test --configuration Release
Manual verification:
- launch the app
- open a second terminal and run:
dotnet-counters monitor --name InspectionPrototype.App --counters InspectionPrototype,System.Runtime
confirm all seven counters appear
- connect, start a run; confirm frames.ingested, telemetry.ingested, and
runs.started all increment live
- attempt to launch a second copy of the app; confirm it exits with code 1
and a log entry appears
## Report format when finished
- files created and files modified
- confirmation that all existing tests still pass plus new AppMetrics tests
- a screenshot or copy-paste of the dotnet-counters output showing counters > 0
- confirmation that second-launch is blocked
- a single commit hash
- commit message: "feat(obs): add metrics meter, starter counters, and single-instance guard (pass 3/3 of TASK-006)"Operator notes
- One pass per Copilot session. Start a fresh chat per pass. Do not feed all three prompts into a single agent session — the context bloat will degrade pass 3.
- Review and commit between passes. Each pass ends with a single commit message template. Run
dotnet testlocally and confirm the counters/logs/crash behavior manually before kicking off the next session. - Pass 2 is the riskiest. Unhandled-exception wiring interacts with the dispatcher, DI lifetimes, and the workflow state machine in non-obvious ways. If Pass 2 feels off after one round with the agent, bail and write the handlers yourself — the prompts in Passes 1 and 3 will still be usable.
- After Pass 3, capture the demo baseline. Run the app for 10 minutes with a short scripted scenario, save the
dotnet-countersoutput todocs/captures/demo-baseline-<date>.csv, and add row 0 todocs/reviews/phase-1-measurements.md. That is the reference Phase 1 gets measured against.