Skip to content

SLICE-006: Observability Baseline

Goal

Make the application observable enough that every later phase can answer "what actually happened?" and "what are the current rates?" with numbers rather than guesses.

Why This Slice

Every Phase 1 scaling goal (more telemetry tags, real frame payloads, encoder-rate motion, storm and soak profiles) and every Phase 2 store-refactor goal (slice AppState, lift out the data-plane, cut fan-out) depends on being able to see the pipeline under load. The prototype currently has:

  • a DiagnosticsEntries list in canonical state, bounded to 200
  • basic pipeline counters (TelemetryCoalesced, FramesDropped) visible in the UI
  • no structured log file
  • no unhandled-exception handlers for the dispatcher, the app domain, or unobserved tasks
  • no System.Diagnostics.Metrics surface, so dotnet-counters sees nothing
  • no single-instance guard, so two copies of the app can fight over the same recipe and run-history files

In practical terms that means a crash mid-run disappears silently, long soak runs leave no trace, and "did the new simulator profile actually change the drop rate?" is not answerable outside the UI.

This slice adds the minimum surface that makes the next phases honest.

Requirements Coverage

In Scope

  • structured logging via Serilog (or an equivalently configured Microsoft.Extensions.Logging backend) with at minimum:
    • a rolling file sink under %LOCALAPPDATA%\InspectionPrototype\logs\ (or equivalent project convention)
    • a console sink active when running under a debugger
    • a sink (or bridge) that routes log events into the existing DiagnosticsEntries timeline at Warning or higher
  • three unhandled-exception handlers registered during host startup:
    • App.DispatcherUnhandledException
    • AppDomain.CurrentDomain.UnhandledException
    • TaskScheduler.UnobservedTaskException
  • a crash-log path: every unhandled exception produces a dedicated crash file with full exception, inner exceptions, and the current workflow/run state at the moment of failure
  • a System.Diagnostics.Metrics.Meter (named InspectionPrototype) exposing starter counters that later phases will populate:
    • frames.ingested (counter)
    • frames.dropped (counter)
    • telemetry.ingested (counter)
    • telemetry.coalesced (counter)
    • runs.started (counter)
    • runs.completed (counter)
    • runs.faulted (counter)
  • a single-instance mutex that prevents a second copy of the app from launching against the same data directory; second-launch attempts must fail loudly, not silently
  • documentation under docs/runbook/ (or equivalent) describing where logs go, how to read a crash file, and how to attach dotnet-counters monitor InspectionPrototype

Out of Scope

  • OpenTelemetry exporters, OTLP endpoints, external collectors
  • log forwarding to Elastic / Seq / Loki / cloud services
  • alerting, paging, or dashboards
  • distributed tracing, Activity propagation across process boundaries (there is only one process)
  • a full "crash reporter" UX (dialog, opt-in upload, issue creation)
  • changes to the existing DiagnosticsEntries 200-cap or its schema beyond what routing log events requires
  • adding new operational counters beyond the starter set above — Phase 1 slices extend them as they need

Runtime Behavior

Logging

  • every existing ILogger<T> call site continues to work unchanged
  • Information and above are persisted to the rolling file
  • Warning and above also become a DiagnosticsEntry in canonical state so the operator sees them in the UI
  • files roll by day and by size (a sensible default such as 10 MB per file, 7 days retained)
  • no secrets or recipe contents are logged at Information or below; full recipe objects are logged only at Debug

Unhandled Exceptions

  • a UI-thread exception does not tear down the process silently; it is logged, a diagnostics entry is appended, the active run (if any) is marked Faulted with a correlation id, and the UI surfaces a non-blocking banner directing the operator to the crash log
  • a background-task exception (unobserved task, BackgroundService failure) is logged with the owning service name and either recovers the service or transitions the workflow to Faulted
  • an AppDomain.UnhandledException produces a final crash file before the process exits; no partially-written run history is left behind (existing atomic-write behavior must be preserved)

Metrics

  • starting the app exposes the InspectionPrototype meter
  • dotnet-counters monitor --name InspectionPrototype.App shows all starter counters updating live as runs execute
  • the counters are additive (counters only go up); rates are computed by the consumer
  • counters survive workflow transitions: restarting a run does not reset them

Single-Instance Guard

  • launching a second copy of the app detects the existing instance and exits with a non-zero exit code and a log entry
  • the guard is per data directory, not per machine — a future multi-tenant scenario remains possible

Acceptance Criteria

This slice is satisfied only if all of the following are true:

  1. A structured log file is produced under the documented path during normal operation, rolls by day and size, and contains at minimum startup, recipe-load, run start/stop/abort/fault, and shutdown events.
  2. Warning and above log events appear in the in-app diagnostics timeline in addition to the file.
  3. A forced exception on the UI thread produces a diagnostics entry, a log line, a crash file, and a visible non-blocking indication to the operator; the process does not terminate silently.
  4. A forced unobserved-task exception produces a log line naming the owning service and does not kill the process unless the service itself is unrecoverable.
  5. A forced AppDomain.UnhandledException produces a crash file before exit, and any in-flight run-history write completes or is cleanly abandoned without corrupting the store.
  6. dotnet-counters monitor --name InspectionPrototype.App shows all starter counters, and running a short scripted scenario produces non-zero values for frames.ingested, telemetry.ingested, and runs.started.
  7. Launching a second instance of the application against the same data directory fails with a clear error and a log entry, and does not corrupt or double-write recipe or run-history files.
  8. The behaviors above are covered by automated tests where practical (log sink configuration, unhandled-exception handler wiring, counter increments, single-instance guard); UI-surface behavior may be verified manually and documented.
  9. A runbook entry documents where logs and crash files live, how long they are retained, and how to attach a counters session.

Verification Notes

The implementation task for this spec must include verification for:

  • unhandled-exception handlers actually fire for each of the three paths (verified by a debug-only "crash me" test control or an automated harness)
  • the log file path is writable on a fresh machine without elevation
  • the crash file contains enough state (workflow, active run id, last diagnostics entries) to diagnose a failure without a debugger
  • metrics counters increment under a real simulator run, not just a synthetic test
  • the single-instance guard does not leak the mutex handle on crash
  • warnings-as-errors from SLICE-005 does not silently downgrade any of the new log or metrics code to conditional compilation

Docs-first project memory for AI-assisted implementation.