SLICE-006: Observability Baseline
- Status: Proposed
- Date: 2026-04-22
- Depends on: Requirements, Evolution Roadmap, SLICE-005: CI and Quality Gates
Goal
Make the application observable enough that every later phase can answer "what actually happened?" and "what are the current rates?" with numbers rather than guesses.
Why This Slice
Every Phase 1 scaling goal (more telemetry tags, real frame payloads, encoder-rate motion, storm and soak profiles) and every Phase 2 store-refactor goal (slice AppState, lift out the data-plane, cut fan-out) depends on being able to see the pipeline under load. The prototype currently has:
- a
DiagnosticsEntrieslist in canonical state, bounded to 200 - basic pipeline counters (
TelemetryCoalesced,FramesDropped) visible in the UI - no structured log file
- no unhandled-exception handlers for the dispatcher, the app domain, or unobserved tasks
- no
System.Diagnostics.Metricssurface, sodotnet-counterssees nothing - no single-instance guard, so two copies of the app can fight over the same recipe and run-history files
In practical terms that means a crash mid-run disappears silently, long soak runs leave no trace, and "did the new simulator profile actually change the drop rate?" is not answerable outside the UI.
This slice adds the minimum surface that makes the next phases honest.
Requirements Coverage
- 04. UI and Technical Requirements: diagnostics surface, measurable pipeline behavior
- 05. Failure Modes and Workflow Requirements: unhandled failures must not lose the run history or silently crash the UI
- 07. AI Delivery Constraints and Roadmap: downstream phases need reproducible measurements
In Scope
- structured logging via Serilog (or an equivalently configured
Microsoft.Extensions.Loggingbackend) with at minimum:- a rolling file sink under
%LOCALAPPDATA%\InspectionPrototype\logs\(or equivalent project convention) - a console sink active when running under a debugger
- a sink (or bridge) that routes log events into the existing
DiagnosticsEntriestimeline atWarningor higher
- a rolling file sink under
- three unhandled-exception handlers registered during host startup:
App.DispatcherUnhandledExceptionAppDomain.CurrentDomain.UnhandledExceptionTaskScheduler.UnobservedTaskException
- a crash-log path: every unhandled exception produces a dedicated crash file with full exception, inner exceptions, and the current workflow/run state at the moment of failure
- a
System.Diagnostics.Metrics.Meter(namedInspectionPrototype) exposing starter counters that later phases will populate:frames.ingested(counter)frames.dropped(counter)telemetry.ingested(counter)telemetry.coalesced(counter)runs.started(counter)runs.completed(counter)runs.faulted(counter)
- a single-instance mutex that prevents a second copy of the app from launching against the same data directory; second-launch attempts must fail loudly, not silently
- documentation under
docs/runbook/(or equivalent) describing where logs go, how to read a crash file, and how to attachdotnet-counters monitor InspectionPrototype
Out of Scope
- OpenTelemetry exporters, OTLP endpoints, external collectors
- log forwarding to Elastic / Seq / Loki / cloud services
- alerting, paging, or dashboards
- distributed tracing,
Activitypropagation across process boundaries (there is only one process) - a full "crash reporter" UX (dialog, opt-in upload, issue creation)
- changes to the existing
DiagnosticsEntries200-cap or its schema beyond what routing log events requires - adding new operational counters beyond the starter set above — Phase 1 slices extend them as they need
Runtime Behavior
Logging
- every existing
ILogger<T>call site continues to work unchanged Informationand above are persisted to the rolling fileWarningand above also become aDiagnosticsEntryin canonical state so the operator sees them in the UI- files roll by day and by size (a sensible default such as 10 MB per file, 7 days retained)
- no secrets or recipe contents are logged at
Informationor below; full recipe objects are logged only atDebug
Unhandled Exceptions
- a UI-thread exception does not tear down the process silently; it is logged, a diagnostics entry is appended, the active run (if any) is marked
Faultedwith a correlation id, and the UI surfaces a non-blocking banner directing the operator to the crash log - a background-task exception (unobserved task,
BackgroundServicefailure) is logged with the owning service name and either recovers the service or transitions the workflow toFaulted - an
AppDomain.UnhandledExceptionproduces a final crash file before the process exits; no partially-written run history is left behind (existing atomic-write behavior must be preserved)
Metrics
- starting the app exposes the
InspectionPrototypemeter dotnet-counters monitor --name InspectionPrototype.Appshows all starter counters updating live as runs execute- the counters are additive (counters only go up); rates are computed by the consumer
- counters survive workflow transitions: restarting a run does not reset them
Single-Instance Guard
- launching a second copy of the app detects the existing instance and exits with a non-zero exit code and a log entry
- the guard is per data directory, not per machine — a future multi-tenant scenario remains possible
Acceptance Criteria
This slice is satisfied only if all of the following are true:
- A structured log file is produced under the documented path during normal operation, rolls by day and size, and contains at minimum startup, recipe-load, run start/stop/abort/fault, and shutdown events.
Warningand above log events appear in the in-app diagnostics timeline in addition to the file.- A forced exception on the UI thread produces a diagnostics entry, a log line, a crash file, and a visible non-blocking indication to the operator; the process does not terminate silently.
- A forced unobserved-task exception produces a log line naming the owning service and does not kill the process unless the service itself is unrecoverable.
- A forced
AppDomain.UnhandledExceptionproduces a crash file before exit, and any in-flight run-history write completes or is cleanly abandoned without corrupting the store. dotnet-counters monitor --name InspectionPrototype.Appshows all starter counters, and running a short scripted scenario produces non-zero values forframes.ingested,telemetry.ingested, andruns.started.- Launching a second instance of the application against the same data directory fails with a clear error and a log entry, and does not corrupt or double-write recipe or run-history files.
- The behaviors above are covered by automated tests where practical (log sink configuration, unhandled-exception handler wiring, counter increments, single-instance guard); UI-surface behavior may be verified manually and documented.
- A runbook entry documents where logs and crash files live, how long they are retained, and how to attach a counters session.
Verification Notes
The implementation task for this spec must include verification for:
- unhandled-exception handlers actually fire for each of the three paths (verified by a debug-only "crash me" test control or an automated harness)
- the log file path is writable on a fresh machine without elevation
- the crash file contains enough state (workflow, active run id, last diagnostics entries) to diagnose a failure without a debugger
- metrics counters increment under a real simulator run, not just a synthetic test
- the single-instance guard does not leak the mutex handle on crash
- warnings-as-errors from SLICE-005 does not silently downgrade any of the new log or metrics code to conditional compilation