Skip to content

TASK-006: Implement Observability Baseline

Objective

Add the minimum observability surface the application needs to be honest under load: structured logging with a rolling file sink, three unhandled-exception handlers, a metrics meter with starter counters, and a single-instance guard.

Scope

  • add Serilog (or equivalent) wiring under the existing Microsoft.Extensions.Logging host; file sink, console sink, and a diagnostics-timeline bridge for Warning and above
  • register DispatcherUnhandledException, AppDomain.UnhandledException, and TaskScheduler.UnobservedTaskException handlers during host startup
  • define a shared crash-log writer that captures exception details plus current workflow and active-run state
  • add a System.Diagnostics.Metrics.Meter named InspectionPrototype with the starter counter set defined in the spec
  • wire starter counters into the existing pipeline services (frame, telemetry, workflow) so they increment under real runs
  • add a single-instance mutex keyed on the data directory; second-launch attempts exit with a non-zero exit code and a log entry
  • add a short runbook entry under docs/runbook/ describing log paths, crash-log location, and dotnet-counters usage

Non-Scope

  • OpenTelemetry exporters or external collectors
  • log forwarding to any external service
  • alerting, dashboards, paging
  • crash-reporter UX beyond a log entry and non-blocking UI banner
  • distributed tracing
  • changes to DiagnosticsEntries cap or schema beyond what the bridge requires
  • additional operational counters beyond the starter set — later phases add their own

AI Tool Guidance

This task is larger than TASK-005 and is best handled in three focused passes rather than one prompt:

  1. Logging and diagnostics bridge — introduce Serilog wiring, file sink, console sink, and the DiagnosticsEntries bridge. Verify existing tests pass; add new tests for the bridge.
  2. Unhandled exception handling and crash log — register the three handlers, implement the crash-log writer, and verify each handler fires via a debug-only test hook. Include a test that the workflow transitions to Faulted on UI-thread crash.
  3. Metrics, single-instance guard, and runbook — add the meter and starter counters, wire them into pipeline services, add the mutex, and write the runbook. Verify counters increment under an end-to-end scripted run.

Keep each pass as its own commit so regressions are bisectable. Do not combine unhandled-exception work with metrics work in the same change.

Acceptance Criteria Mapping

The implementation must satisfy all acceptance criteria from SLICE-006.

Copilot Agent Prompts

This task is larger than TASK-005 and is best split across three separate Copilot sessions, one per pass, so each pass gets crisp context and a distinct review gate. Do not paste all three prompts into a single session.

  • Pass 1: logging + diagnostics bridge
  • Pass 2: unhandled exception handlers + crash log
  • Pass 3: metrics + single-instance mutex + runbook

After each pass, review the commit, run dotnet test, and only then kick off the next session.

Pass 1 — Logging and diagnostics bridge

You are implementing Pass 1 of TASK-006 in this repository: wire up structured
logging via Serilog, with a file sink, a console sink, and a bridge that routes
Warning-and-above log events into the existing DiagnosticsEntries timeline.

## Authoritative references

Read these before making changes:
- docs/specs/SLICE-006-observability-baseline.md   (the requirements)
- docs/tasks/TASK-006-implement-observability-baseline.md   (the objective and AI guidance)
- src/InspectionPrototype.App/App.xaml.cs   (current host bootstrap — no logging yet)
- src/InspectionPrototype.Application/State/AppState.cs   (DiagnosticsEntries lives here)
- src/InspectionPrototype.Application/Services/AppStateExtensions.cs   (WithDiagnosticsEntry)

The spec's Acceptance Criteria items 1 and 2 are the definition of done for this pass.

## Scope of this pass

Logging only. Do NOT touch unhandled exception handlers, metrics, or the single-instance
mutex — those are Passes 2 and 3.

## Deliverables

1. Add Serilog packages to Directory.Packages.props and reference them from
   src/InspectionPrototype.App/InspectionPrototype.App.csproj (only the App project
   needs direct references; the rest keep using Microsoft.Extensions.Logging.Abstractions):
     - Serilog.Extensions.Hosting           (latest stable)
     - Serilog.Sinks.File                   (latest stable)
     - Serilog.Sinks.Console                (latest stable)
     - Serilog.Settings.Configuration       (latest stable)

2. Serilog configuration in App.xaml.cs OnStartup, BEFORE Host.CreateDefaultBuilder():
     - file sink rolling daily, size-limited to 10 MB per file, 7 files retained
     - path: Environment.GetFolderPath(SpecialFolder.LocalApplicationData) + "\InspectionPrototype\logs\app-.log"
     - console sink active only when Debugger.IsAttached
     - minimum level: Information (Debug when Debugger.IsAttached)
     - enrich with ThreadId and ProcessId

3. Bridge Serilog into the Microsoft.Extensions.Logging host via UseSerilog() on the
   HostBuilder. Existing ILogger<T> call sites must continue to work unchanged.

4. Create a new custom Serilog sink that bridges to AppState:
     - file: src/InspectionPrototype.Application/Services/DiagnosticsTimelineSink.cs
     - depends on IAppStateStore (resolved at construction)
     - filters at LogEventLevel.Warning or higher
     - appends a DiagnosticsEntry via the existing WithDiagnosticsEntry extension
     - uses the existing 200-cap behavior in DiagnosticsEntries — do not change the cap
     - register the sink via .WriteTo.Sink<DiagnosticsTimelineSink>() in the Serilog config

5. Ensure the App project's .csproj references Application for the sink type. The sink
   lives in Application because IAppStateStore lives there; the Serilog wiring lives in App.

6. Add tests under tests/InspectionPrototype.Tests/DiagnosticsTimelineSinkTests.cs:
     - log at Warning → DiagnosticsEntries grows by 1
     - log at Information → DiagnosticsEntries unchanged
     - log at Error → DiagnosticsEntries grows by 1 with Severity mapped appropriately
     - do not test Serilog itself — only the bridge behavior

## Constraints

- Do NOT replace or suppress the existing ILogger<T> call sites. Serilog plugs into
  Microsoft.Extensions.Logging; existing code is unaware of the swap.
- Do NOT modify the DiagnosticsEntries cap or the DiagnosticsEntry record shape.
  The sink must conform to what exists today.
- Do NOT add log calls to business code for the sake of this pass. The goal is
  wiring, not retroactively instrumenting every service.
- Do NOT touch src/InspectionPrototype.Infrastructure. The Application project already
  has the abstractions we need.

## Verification before you report done

  dotnet build --configuration Release          (zero warnings, zero errors)
  dotnet test --configuration Release           (all tests pass, including new sink tests)

Then run the app manually and confirm:
  - a log file appears under %LOCALAPPDATA%\InspectionPrototype\logs\app-<date>.log
  - connecting, loading a recipe, and starting a run each produce log lines in the file
  - triggering a fault (via existing fault injection UI) produces a DiagnosticsEntry
    visible in the diagnostics pane AND a line in the log file

## Report format when finished

- files created and files modified
- confirmation that all existing tests still pass
- the path of a log file produced during manual verification
- a single commit hash
- commit message: "feat(obs): add structured logging with Serilog and diagnostics-timeline bridge (pass 1/3 of TASK-006)"

Pass 2 — Unhandled exception handlers and crash log

You are implementing Pass 2 of TASK-006. Pass 1 (Serilog logging) is already merged;
this pass adds the three unhandled-exception handlers and the crash-log writer.

## Authoritative references

Read these before making changes:
- docs/specs/SLICE-006-observability-baseline.md   (the requirements — items 3, 4, 5)
- docs/tasks/TASK-006-implement-observability-baseline.md
- src/InspectionPrototype.App/App.xaml.cs   (where the handlers get registered)
- src/InspectionPrototype.Application/Services/WorkflowService.cs   (how Faulted transitions work today)
- src/InspectionPrototype.Application/State/AppState.cs   (what state to snapshot in crash dumps)

Confirm that Pass 1's Serilog wiring is already in place before starting. If it is
not, stop and alert the operator.

## Scope of this pass

Unhandled exception handling and crash logs only. Do NOT touch metrics or the
single-instance mutex — that is Pass 3.

## Deliverables

1. A new ICrashReporter abstraction and implementation:
     - interface: src/InspectionPrototype.Application/Diagnostics/ICrashReporter.cs
         Task ReportAsync(Exception exception, string source, CancellationToken ct);
     - implementation: src/InspectionPrototype.Application/Diagnostics/CrashReporter.cs
         depends on IAppStateStore and ILogger<CrashReporter>
         writes a crash file at:
           %LOCALAPPDATA%\InspectionPrototype\crashes\crash-{yyyy-MM-ddTHH-mm-ssZ}.txt
         crash file contents (plain text, human-readable):
           - timestamp (UTC)
           - source tag ("UI", "AppDomain", "UnobservedTask")
           - full exception, including inner exceptions and stack traces
           - current WorkflowState, ActiveRun id, and recipe name from AppState snapshot
           - last 50 DiagnosticsEntries from AppState snapshot
           - process info: PID, uptime, working set
         also logs at LogLevel.Critical with the same context so the log file captures it
     - register both via AddApplicationServices() in
       src/InspectionPrototype.Application/ApplicationServiceCollectionExtensions.cs

2. Register the three handlers in App.xaml.cs OnStartup, AFTER Host.StartAsync()
   completes (so the DI container is available):
     - this.DispatcherUnhandledException: mark e.Handled = true, call
       crashReporter.ReportAsync(e.Exception, "UI"), append a diagnostics entry,
       transition any active run to Faulted via IWorkflowService (use whatever
       method exists today — do not invent a new one), surface a non-blocking
       banner (see item 3 below). Do NOT terminate the process.
     - AppDomain.CurrentDomain.UnhandledException: call crashReporter.ReportAsync
       synchronously with a short timeout (3 seconds). Process is terminating;
       best-effort is fine. Let it terminate after reporting.
     - TaskScheduler.UnobservedTaskException: call crashReporter.ReportAsync
       with source "UnobservedTask", call e.SetObserved(), append a diagnostics
       entry. Do NOT terminate the process.

3. A new "crash banner" field in AppState and a corresponding UI surface:
     - add a nullable field to AppState: CrashBannerState? CrashBanner
     - record: record CrashBannerState(string Message, string CrashFilePath, DateTimeOffset OccurredAt)
     - add a corresponding AppState extension: WithCrashBanner / WithClearedCrashBanner
     - in MainWindow.xaml (or whatever the current main layout is), add a non-blocking
       banner row at the top that binds to the new AppState field via MainViewModel
       and is visible when CrashBanner is not null; includes a "Copy path" button
       and a "Dismiss" button. Do not make it modal.
     - the banner text format: "A background error occurred. Crash log: {path}"

4. Preserve run-history integrity on AppDomain exit:
     - confirm the existing JsonRunHistoryStore writes via temp-file-then-move
       (it already does per the summary) — no change needed, but add a comment
       in the crash handler acknowledging this and do NOT attempt to flush
       anything in-flight from the handler itself.

5. Tests under tests/InspectionPrototype.Tests:
     - CrashReporterTests.cs: given a test IAppStateStore with a seeded state,
       ReportAsync writes a file with the expected sections. Use a tempdir.
     - Do NOT test DispatcherUnhandledException wiring directly (requires a UI
       thread); instead, test that CrashReporter handles each source tag
       correctly.

## Constraints

- Do NOT swallow exceptions without writing them to both the log and a crash
  file. Every handler must produce both artifacts.
- Do NOT use MessageBox for the UI surface — the spec requires a non-blocking
  banner. Modal dialogs are forbidden.
- Do NOT add a crash-uploader, opt-in dialog, or issue-filer. Out of scope.
- Do NOT call crashReporter.ReportAsync.Wait() from the dispatcher handler —
  use fire-and-forget with a 3-second timeout.
- Do NOT modify the existing JsonRunHistoryStore atomic-write logic.

## Verification before you report done

  dotnet build --configuration Release
  dotnet test --configuration Release

Manual verification steps documented for the operator (write these into the
runbook file created in Pass 3 — for now, describe them in the report):
  - force a UI-thread exception via a debug-only "crash me" button (add one
    under #if DEBUG in MainWindow for testing; ok to commit if behind #if DEBUG)
  - confirm a crash file appears under %LOCALAPPDATA%\InspectionPrototype\crashes\
  - confirm the banner appears in the UI with the crash-log path
  - confirm the process does not terminate

## Report format when finished

- files created and files modified
- confirmation that all existing tests still pass plus new CrashReporter tests
- the path of a crash file produced during manual verification
- a single commit hash
- commit message: "feat(obs): add unhandled exception handlers and crash reporter (pass 2/3 of TASK-006)"

Pass 3 — Metrics, single-instance mutex, runbook

You are implementing Pass 3 of TASK-006, the final pass. Passes 1 and 2 are
already merged; this pass adds the metrics meter, wires counters into pipeline
services, adds the single-instance mutex, and writes the runbook.

## Authoritative references

Read these before making changes:
- docs/specs/SLICE-006-observability-baseline.md   (items 6, 7, 9 of acceptance criteria)
- docs/tasks/TASK-006-implement-observability-baseline.md
- src/InspectionPrototype.Application/Services/FramePipelineService.cs
- src/InspectionPrototype.Application/Services/TelemetryPipelineService.cs
- src/InspectionPrototype.Application/Services/WorkflowService.cs
- src/InspectionPrototype.App/App.xaml.cs   (mutex goes here, before host build)

## Scope of this pass

Metrics meter + starter counters + single-instance mutex + runbook. Nothing
else. Do NOT add log calls or crash behavior — Passes 1 and 2 own those.

## Deliverables

1. A new metrics abstraction exposed from Application:
     - file: src/InspectionPrototype.Application/Diagnostics/AppMetrics.cs
     - public class AppMetrics : IDisposable
         private readonly Meter _meter = new("InspectionPrototype");
         public Counter<long> FramesIngested { get; }   = _meter.CreateCounter<long>("frames.ingested");
         public Counter<long> FramesDropped { get; }    = _meter.CreateCounter<long>("frames.dropped");
         public Counter<long> TelemetryIngested { get; }   = _meter.CreateCounter<long>("telemetry.ingested");
         public Counter<long> TelemetryCoalesced { get; }  = _meter.CreateCounter<long>("telemetry.coalesced");
         public Counter<long> RunsStarted { get; }      = _meter.CreateCounter<long>("runs.started");
         public Counter<long> RunsCompleted { get; }    = _meter.CreateCounter<long>("runs.completed");
         public Counter<long> RunsFaulted { get; }      = _meter.CreateCounter<long>("runs.faulted");
         public void Dispose() => _meter.Dispose();
     - register as Singleton in AddApplicationServices()

2. Wire counters into the three services (constructor-inject AppMetrics):
     - FramePipelineService: .FramesIngested.Add(1) on each frame successfully
       propagated to AppState; .FramesDropped.Add(n) when the bounded-channel
       drop counter advances (read delta from the counter itself).
     - TelemetryPipelineService: .TelemetryIngested.Add(1) on each snapshot
       propagated; .TelemetryCoalesced.Add(n) when the coalesce counter advances.
     - WorkflowService: .RunsStarted.Add(1) on transition into Running;
       .RunsCompleted.Add(1) on successful completion; .RunsFaulted.Add(1) on
       fault transition.

   Counters only go up. Do not decrement. Do not reset on workflow transitions.

3. Single-instance mutex in App.xaml.cs OnStartup, BEFORE anything else
   (before Serilog config, before Host building):
     - compute a data-directory-scoped mutex name:
         $"Global\\InspectionPrototype-{Convert.ToHexString(SHA256.HashData(Encoding.UTF8.GetBytes(dataDir)))[..16]}"
       where dataDir is %LOCALAPPDATA%\InspectionPrototype
     - attempt new Mutex(initiallyOwned: true, name, out bool createdNew);
       if !createdNew, log a message (to Serilog, if Pass 1 config already ran;
       otherwise to a bootstrap log file under %LOCALAPPDATA%\InspectionPrototype\logs\bootstrap.log),
       show a one-line MessageBox ("Another instance is already running."),
       and Environment.Exit(1).
     - hold the mutex for the process lifetime; release on OnExit.

4. Runbook at docs/runbook/observability.md:
     - where logs live, rolling and retention policy
     - where crash files live, what is in them, how to read one
     - how to attach a counters session:
         dotnet-counters monitor --name InspectionPrototype.App --counters InspectionPrototype,System.Runtime
     - how the single-instance guard works, how to recover if the mutex is stuck
       after a hard crash (it auto-releases on process exit; no manual cleanup needed)
     - one-paragraph note on the phase-1-measurements.md table this observability
       surface exists to feed

5. Tests under tests/InspectionPrototype.Tests:
     - AppMetricsTests.cs: creating an AppMetrics instance exposes all seven
       counters; counters are reachable via the System.Diagnostics.Metrics API
       by name; disposing disposes the underlying Meter.
     - Do NOT test mutex behavior — integration-only, documented as manual.

## Constraints

- Do NOT add new counters beyond the seven listed above. Future phases will
  extend them as needed.
- Do NOT use System.Diagnostics.PerformanceCounter (the old Win32 API). Use
  System.Diagnostics.Metrics (the modern .NET counters API).
- Do NOT reset counters on workflow transitions. Additive only.
- Do NOT expose AppMetrics through a public static — inject via DI only.
- Do NOT use the mutex to coordinate anything other than single-instance
  launch. It is a guard, not a lock.

## Verification before you report done

  dotnet build --configuration Release
  dotnet test --configuration Release

Manual verification:
  - launch the app
  - open a second terminal and run:
      dotnet-counters monitor --name InspectionPrototype.App --counters InspectionPrototype,System.Runtime
    confirm all seven counters appear
  - connect, start a run; confirm frames.ingested, telemetry.ingested, and
    runs.started all increment live
  - attempt to launch a second copy of the app; confirm it exits with code 1
    and a log entry appears

## Report format when finished

- files created and files modified
- confirmation that all existing tests still pass plus new AppMetrics tests
- a screenshot or copy-paste of the dotnet-counters output showing counters > 0
- confirmation that second-launch is blocked
- a single commit hash
- commit message: "feat(obs): add metrics meter, starter counters, and single-instance guard (pass 3/3 of TASK-006)"

Operator notes

  • One pass per Copilot session. Start a fresh chat per pass. Do not feed all three prompts into a single agent session — the context bloat will degrade pass 3.
  • Review and commit between passes. Each pass ends with a single commit message template. Run dotnet test locally and confirm the counters/logs/crash behavior manually before kicking off the next session.
  • Pass 2 is the riskiest. Unhandled-exception wiring interacts with the dispatcher, DI lifetimes, and the workflow state machine in non-obvious ways. If Pass 2 feels off after one round with the agent, bail and write the handlers yourself — the prompts in Passes 1 and 3 will still be usable.
  • After Pass 3, capture the demo baseline. Run the app for 10 minutes with a short scripted scenario, save the dotnet-counters output to docs/captures/demo-baseline-<date>.csv, and add row 0 to docs/reviews/phase-1-measurements.md. That is the reference Phase 1 gets measured against.

Docs-first project memory for AI-assisted implementation.