SLICE-006: Observability Baseline

Status: Proposed
Date: 2026-04-22
Depends on: Requirements, Evolution Roadmap, SLICE-005: CI and Quality Gates

Goal

Make the application observable enough that every later phase can answer "what actually happened?" and "what are the current rates?" with numbers rather than guesses.

Why This Slice

Every Phase 1 scaling goal (more telemetry tags, real frame payloads, encoder-rate motion, storm and soak profiles) and every Phase 2 store-refactor goal (slice AppState, lift out the data-plane, cut fan-out) depends on being able to see the pipeline under load. The prototype currently has:

a DiagnosticsEntries list in canonical state, bounded to 200
basic pipeline counters (TelemetryCoalesced, FramesDropped) visible in the UI
no structured log file
no unhandled-exception handlers for the dispatcher, the app domain, or unobserved tasks
no System.Diagnostics.Metrics surface, so dotnet-counters sees nothing
no single-instance guard, so two copies of the app can fight over the same recipe and run-history files

In practical terms that means a crash mid-run disappears silently, long soak runs leave no trace, and "did the new simulator profile actually change the drop rate?" is not answerable outside the UI.

This slice adds the minimum surface that makes the next phases honest.

Requirements Coverage

04. UI and Technical Requirements: diagnostics surface, measurable pipeline behavior
05. Failure Modes and Workflow Requirements: unhandled failures must not lose the run history or silently crash the UI
07. AI Delivery Constraints and Roadmap: downstream phases need reproducible measurements

In Scope

structured logging via Serilog (or an equivalently configured Microsoft.Extensions.Logging backend) with at minimum:
- a rolling file sink under %LOCALAPPDATA%\InspectionPrototype\logs\ (or equivalent project convention)
- a console sink active when running under a debugger
- a sink (or bridge) that routes log events into the existing DiagnosticsEntries timeline at Warning or higher
three unhandled-exception handlers registered during host startup:
- App.DispatcherUnhandledException
- AppDomain.CurrentDomain.UnhandledException
- TaskScheduler.UnobservedTaskException
a crash-log path: every unhandled exception produces a dedicated crash file with full exception, inner exceptions, and the current workflow/run state at the moment of failure
a System.Diagnostics.Metrics.Meter (named InspectionPrototype) exposing starter counters that later phases will populate:
- frames.ingested (counter)
- frames.dropped (counter)
- telemetry.ingested (counter)
- telemetry.coalesced (counter)
- runs.started (counter)
- runs.completed (counter)
- runs.faulted (counter)
a single-instance mutex that prevents a second copy of the app from launching against the same data directory; second-launch attempts must fail loudly, not silently
documentation under docs/runbook/ (or equivalent) describing where logs go, how to read a crash file, and how to attach dotnet-counters monitor InspectionPrototype

Out of Scope

OpenTelemetry exporters, OTLP endpoints, external collectors
log forwarding to Elastic / Seq / Loki / cloud services
alerting, paging, or dashboards
distributed tracing, Activity propagation across process boundaries (there is only one process)
a full "crash reporter" UX (dialog, opt-in upload, issue creation)
changes to the existing DiagnosticsEntries 200-cap or its schema beyond what routing log events requires
adding new operational counters beyond the starter set above — Phase 1 slices extend them as they need

Runtime Behavior

Logging

every existing ILogger<T> call site continues to work unchanged
Information and above are persisted to the rolling file
Warning and above also become a DiagnosticsEntry in canonical state so the operator sees them in the UI
files roll by day and by size (a sensible default such as 10 MB per file, 7 days retained)
no secrets or recipe contents are logged at Information or below; full recipe objects are logged only at Debug

Unhandled Exceptions

a UI-thread exception does not tear down the process silently; it is logged, a diagnostics entry is appended, the active run (if any) is marked Faulted with a correlation id, and the UI surfaces a non-blocking banner directing the operator to the crash log
a background-task exception (unobserved task, BackgroundService failure) is logged with the owning service name and either recovers the service or transitions the workflow to Faulted
an AppDomain.UnhandledException produces a final crash file before the process exits; no partially-written run history is left behind (existing atomic-write behavior must be preserved)

Metrics

starting the app exposes the InspectionPrototype meter
dotnet-counters monitor --name InspectionPrototype.App shows all starter counters updating live as runs execute
the counters are additive (counters only go up); rates are computed by the consumer
counters survive workflow transitions: restarting a run does not reset them

Single-Instance Guard

launching a second copy of the app detects the existing instance and exits with a non-zero exit code and a log entry
the guard is per data directory, not per machine — a future multi-tenant scenario remains possible

Acceptance Criteria

This slice is satisfied only if all of the following are true:

A structured log file is produced under the documented path during normal operation, rolls by day and size, and contains at minimum startup, recipe-load, run start/stop/abort/fault, and shutdown events.
Warning and above log events appear in the in-app diagnostics timeline in addition to the file.
A forced exception on the UI thread produces a diagnostics entry, a log line, a crash file, and a visible non-blocking indication to the operator; the process does not terminate silently.
A forced unobserved-task exception produces a log line naming the owning service and does not kill the process unless the service itself is unrecoverable.
A forced AppDomain.UnhandledException produces a crash file before exit, and any in-flight run-history write completes or is cleanly abandoned without corrupting the store.
dotnet-counters monitor --name InspectionPrototype.App shows all starter counters, and running a short scripted scenario produces non-zero values for frames.ingested, telemetry.ingested, and runs.started.
Launching a second instance of the application against the same data directory fails with a clear error and a log entry, and does not corrupt or double-write recipe or run-history files.
The behaviors above are covered by automated tests where practical (log sink configuration, unhandled-exception handler wiring, counter increments, single-instance guard); UI-surface behavior may be verified manually and documented.
A runbook entry documents where logs and crash files live, how long they are retained, and how to attach a counters session.

Verification Notes

The implementation task for this spec must include verification for:

unhandled-exception handlers actually fire for each of the three paths (verified by a debug-only "crash me" test control or an automated harness)
the log file path is writable on a fresh machine without elevation
the crash file contains enough state (workflow, active run id, last diagnostics entries) to diagnose a failure without a debugger
metrics counters increment under a real simulator run, not just a synthetic test
the single-instance guard does not leak the mutex handle on crash
warnings-as-errors from SLICE-005 does not silently downgrade any of the new log or metrics code to conditional compilation

Domains

Terms

1 Machine Control and Motion Systems

2 Hardware Integration and Device Control

3 Industrial Software Architecture

4 Industrial Communication and Integration

5 Vision, Imaging and Inspection Systems

6 UI HMI Operator Experience

7 Reliability Safety and Production Readiness

Industrial Desktop Systems

Streaming Pipelines Dotnet Real World

SLICE-006: Observability Baseline

Goal

Why This Slice

Requirements Coverage

In Scope

Out of Scope

Runtime Behavior

Logging

Unhandled Exceptions

Metrics

Single-Instance Guard

Acceptance Criteria

Verification Notes

Streaming Pipelines Dotnet Real World

SLICE-006: Observability Baseline ​

Goal ​

Why This Slice ​

Requirements Coverage ​

In Scope ​

Out of Scope ​

Runtime Behavior ​

Logging ​

Unhandled Exceptions ​

Metrics ​

Single-Instance Guard ​

Acceptance Criteria ​

Verification Notes ​

SLICE-006: Observability Baseline

Goal

Why This Slice

Requirements Coverage

In Scope

Out of Scope

Runtime Behavior

Logging

Unhandled Exceptions

Metrics

Single-Instance Guard

Acceptance Criteria

Verification Notes