Skip to content

TASK-1.4: Implement Storm & Soak Profiles

Objective

Add storm-and-soak load-shaping knobs to SimulatorProfile, ship the two new profiles ChaosMonkey and Soak8h, and wire them through the simulator (TelemetryDropoutChance, TimeCompressionFactor, network-latency injection, defect-shower service, alarm-burster service, flaky-SDK decorator). Capture two row blocks: a 30-minute ChaosMonkey capture proving every WorkflowService fault branch is exercised, and an 8-hour Soak8h capture proving working-set growth stays under 50 MB. Together those two rows close the Phase 1 exit gate.

Scope

  • new SimulatorProfile fields: DefectShowerEveryMs, DefectShowerDurationMs, AlarmBurstEveryMs, TelemetryDropoutChance, NetworkLatencyMeanMs, NetworkLatencyStddevMs, TimeCompressionFactor
  • new FlakySdkOptions config block (Simulator:FlakySdk) and FlakySdkOptionsValidator
  • extension of SimulatorProfilesValidator to enforce the new fields' ranges and consistency rules
  • new IDefectShowerSchedule abstraction + DefectShowerService (IHostedService)
  • new AlarmBursterService (IHostedService)
  • SimulatedTagSource honors TelemetryDropoutChance per emitter cycle
  • SimulatedMachineConnection honors TimeCompressionFactor and NetworkLatencyMeanMs/Stddev
  • SimulatedMotionController.InterpolateAsync honors TimeCompressionFactor for per-tick wait only
  • FlakySdkDecorator<IMachineConnection> (timeout-hang, ignore-cancellation, out-of-band-throw)
  • conditional DI registration: when Simulator:FlakySdk:Enabled == true, IMachineConnection resolves through the decorator
  • new seed profiles ChaosMonkey and Soak8h in appsettings.json, plus a top-level Simulator:FlakySdk block
  • new MeasurementExtraction.psm1 helpers: Get-WorkingSetGrowthMb, Get-FaultCyclesCount; ConvertTo-MeasurementRow adds two new rows
  • new runbook §4.5 (ChaosMonkey, 30 min) and §4.6 (Soak8h, 8 h)
  • two row blocks in phase-1-measurements.md: slice-1-4-chaos-monkey, slice-1-4-soak-8h
  • tests: validator rejection cases (profile + flaky-sdk); DefectShowerService activation cycle; AlarmBursterService inject-clear-recover cycle; FlakySdkDecorator per-branch behavior; SimulatedTagSource dropout honor; SimulatedMachineConnection time-compression and Gaussian latency honor; reproducibility of prior rows (no regression)

Non-Scope

  • decorating IMotionController with FlakySdkDecorator — connection-only is enough to satisfy criterion 11's connect-failure / out-of-band-throw branches; motion-decorator is a documented follow-up
  • per-profile FlakySdkOptions — the Simulator:FlakySdk block is global; per-profile is a follow-up
  • introducing a TimeProvider-style global clock — TimeCompressionFactor is a localized scalar applied at two sites only
  • TagQuality.Stale transitions on dropouts — staleness handling is Phase 2 (SLICE-2.3 data-plane lift-out)
  • engineering-panel UI for runtime chaos tuning — Phase 3
  • changing IFaultInjector's interface — AlarmBursterService calls the existing API as-is
  • a new FlaUI scenario class — MultiTagSoakFlaUi with -Profile ChaosMonkey / -Profile Soak8h is the capture path
  • automated CI runs of the Soak8h profile — manual capture per release; CI guards stay green for sub-second tests
  • modifying SimulatedMotionController.InterpolateAsync's motion arithmetic — only the per-tick wait is scaled

Touched Projects

  • src/InspectionPrototype.ApplicationIDefectShowerSchedule (Abstractions), DefectShowerService + AlarmBursterService (Services); SimulatorProfile.cs field additions (State); FramePipelineService.cs (consult IDefectShowerSchedule)
  • src/InspectionPrototype.InfrastructureSimulatorProfileOptions field additions, FlakySdkOptions + validator, SimulatorProfilesValidator extension, SimulatedTagSource (dropout), SimulatedMachineConnection (time-compression + latency), SimulatedMotionController (per-tick wait scale only), FlakySdkDecorator, InfrastructureServiceCollectionExtensions (DI wiring + conditional decorator)
  • src/InspectionPrototype.Appappsettings.json (ChaosMonkey profile, Soak8h profile, Simulator:FlakySdk block, EncoderIntervalMs: 5 on the two new profiles)
  • tests/InspectionPrototype.TestsSimulatorProfileFieldsTests, SimulatorProfilesValidatorChaosTests, FlakySdkOptionsValidatorTests, DefectShowerServiceTests, AlarmBursterServiceTests, FlakySdkDecoratorTests, SimulatedTagSourceDropoutTests, SimulatedMachineConnectionTimeCompressionTests, SimulatedMachineConnectionNetworkLatencyTests; recording stubs for IFaultInjector and IWorkflowService (under Stubs/) if not already present
  • tools/MeasurementExtraction.psm1Get-WorkingSetGrowthMb, Get-FaultCyclesCount, ConvertTo-MeasurementRow extension
  • tests/Tools/MeasurementExtraction.Tests.ps1 — Pester tests for the new helpers
  • docs/runbook/capturing-measurements.md — new §4.5 and §4.6 (replace the §4.5+ placeholder)
  • docs/reviews/phase-1-measurements.md — two new row blocks
  • docs/captures/ — two new CSV files (slice-1-4-chaos-monkey-<date>.csv, slice-1-4-soak-8h-<date>.csv)
  • (no changes to) IMotionController interface, IFaultInjector interface, IMachineConnection interface, IAppStateStore, MainViewModel, AppState record, AppMetrics (no new counters)

AI Tool Guidance

Three Copilot passes; one-pass-per-session protocol as in TASK-1.2 / TASK-1.3.

  1. Profile fields + validators + telemetry dropout + time compression + network latency. Add the seven new SimulatorProfile fields, extend SimulatorProfilesValidator, add FlakySdkOptions + validator (the config block only; decorator is Pass 2). Wire TelemetryDropoutChance into SimulatedTagSource. Wire TimeCompressionFactor into SimulatedMachineConnection.ConnectAsync and SimulatedMotionController.InterpolateAsync per-tick wait. Wire NetworkLatencyMeanMs/Stddev into SimulatedMachineConnection.ConnectAsync (Gaussian latency). Add ChaosMonkey and Soak8h profiles + Simulator:FlakySdk block to appsettings.json. Tests for each. NO DefectShowerService, NO AlarmBursterService, NO FlakySdkDecorator, NO measurement-extraction work.
  2. Defect-shower + alarm-burster services + flaky-SDK decorator. Implement IDefectShowerSchedule + DefectShowerService, wire FramePipelineService.ProcessDefectsForFrame to consult it. Implement AlarmBursterService. Implement FlakySdkDecorator<IMachineConnection> and conditional DI wiring (decorator only when Simulator:FlakySdk:Enabled == true). Add Get-WorkingSetGrowthMb and Get-FaultCyclesCount to MeasurementExtraction.psm1; extend ConvertTo-MeasurementRow. Tests for each. NO captures.
  3. 30-min ChaosMonkey + 8-hour Soak8h captures + row blocks + runbook §4.5 + §4.6 + session-handoff updates. Run both captures (30 min, then 8 h on a clean session — same machine, both wall-clock real-time). Append two row blocks. Write runbook §4.5 (ChaosMonkey) and §4.6 (Soak8h). Inspect logs for fault-branch coverage evidence. Update CLAUDE.md / roadmap-progress. NO code changes.

Acceptance Criteria Mapping

The implementation must satisfy all acceptance criteria from SLICE-1.4:

  • Pass 1 covers criteria 1, 2, 3, 4 (config-only portions), 7, 8, 9, and the validator/simulator portions of 15
  • Pass 2 covers criteria 5, 6, 10, 13, and the service / decorator / extraction portions of 15
  • Pass 3 covers criteria 4 (capture-driven verification), 11, 12, 14, 16

Copilot Agent Prompts

Pass 1 — Profile fields + validators + telemetry dropout + time compression + network latency

You are implementing Pass 1 of TASK-1.4 in this repository: add the storm-and-
soak profile fields, extend SimulatorProfilesValidator, add FlakySdkOptions
(config-only — the decorator is Pass 2), and wire the three "low-friction"
load-shaping knobs (TelemetryDropoutChance into SimulatedTagSource;
TimeCompressionFactor and NetworkLatencyMeanMs/Stddev into
SimulatedMachineConnection; TimeCompressionFactor into
SimulatedMotionController per-tick wait). Add ChaosMonkey and Soak8h profiles
plus the Simulator:FlakySdk block to appsettings.json.

NO DefectShowerService, NO AlarmBursterService, NO FlakySdkDecorator, NO
MeasurementExtraction.psm1 changes, NO captures.

## Authoritative references

Read these before making changes:
- docs/specs/SLICE-1.4-storm-and-soak-profiles.md       (the requirements)
- docs/tasks/TASK-1.4-implement-storm-and-soak-profiles.md (this task)
- src/InspectionPrototype.Application/State/SimulatorProfile.cs
- src/InspectionPrototype.Application/State/NoiseModelEvaluator.cs (Box-Muller helper to reuse)
- src/InspectionPrototype.Infrastructure/Simulator/SimulatorProfilesOptions.cs
- src/InspectionPrototype.Infrastructure/Simulator/SimulatorProfilesValidator.cs
- src/InspectionPrototype.Infrastructure/Simulator/SimulatorEncoderOptionsValidator.cs (parallel pattern)
- src/InspectionPrototype.Infrastructure/Simulator/SimulatedTagSource.cs
- src/InspectionPrototype.Infrastructure/Simulator/SimulatedMachineConnection.cs
- src/InspectionPrototype.Infrastructure/Simulator/SimulatedMotionController.cs
- src/InspectionPrototype.Infrastructure/InfrastructureServiceCollectionExtensions.cs
- src/InspectionPrototype.App/appsettings.json

Spec acceptance criteria 1, 2, 3, 4 (config portions), 7, 8, 9, and the
validator/simulator portions of 15 are the definition of done for this pass.

## Scope of this pass

Profile-record + options-class field additions, validator extensions,
SimulatorTagSource dropout, SimulatedMachineConnection time-compression and
Gaussian latency, SimulatedMotionController per-tick wait scale, appsettings.json
updates, tests for each. NO DefectShowerService, NO AlarmBursterService,
NO FlakySdkDecorator, NO MeasurementExtraction.psm1 changes.

## Deliverables

1. SimulatorProfile (src/InspectionPrototype.Application/State/SimulatorProfile.cs):
   Add seven properties (with XML doc) at the end of the record:
       int DefectShowerEveryMs       = 0   // 0 = disabled
       int DefectShowerDurationMs    = 0
       int AlarmBurstEveryMs         = 0
       double TelemetryDropoutChance = 0.0
       double NetworkLatencyMeanMs   = 0.0
       double NetworkLatencyStddevMs = 0.0
       double TimeCompressionFactor  = 1.0
   Update SimulatorProfile.Default to keep its current values (all new fields
   default-zero / 1.0).

2. SimulatorProfileOptions
   (src/InspectionPrototype.Infrastructure/Simulator/SimulatorProfilesOptions.cs):
   Mirror the seven property additions on the JSON-binding shape with the same
   defaults. Keep the property setters (this is the binder's mutable shape).

3. SimulatorProfilesValidator
   (src/InspectionPrototype.Infrastructure/Simulator/SimulatorProfilesValidator.cs):
   Extend `Validate(...)` with rules per spec "Configuration validation":
   - DefectShowerEveryMs in [0, 3_600_000]
   - DefectShowerDurationMs in [0, 60_000]
   - if DefectShowerEveryMs > 0 then DefectShowerDurationMs > 0 and DefectShowerDurationMs ≤ DefectShowerEveryMs
   - AlarmBurstEveryMs in [0, 3_600_000]
   - TelemetryDropoutChance in [0.0, 1.0]
   - NetworkLatencyMeanMs in [0.0, 30_000.0]
   - NetworkLatencyStddevMs in [0.0, 30_000.0]
   - TimeCompressionFactor in [0.1, 100.0]
   Each failure names the offending profile and field, mirroring the existing
   EncoderIntervalMs failure-message style.

4. FlakySdkOptions (src/InspectionPrototype.Infrastructure/Simulator/FlakySdkOptions.cs):
   New sealed class:
       public const string SectionName = "Simulator:FlakySdk";
       public bool Enabled { get; set; } = false;
       public double TimeoutChance { get; set; } = 0.0;
       public double IgnoreCancellationChance { get; set; } = 0.0;
       public double OutOfBandThrowChance { get; set; } = 0.0;
       /// <summary>For tests: how long the timeout-hang branch waits. Default 30 s in production.</summary>
       public int TimeoutHangMs { get; set; } = 30_000;

5. FlakySdkOptionsValidator
   (src/InspectionPrototype.Infrastructure/Simulator/FlakySdkOptionsValidator.cs):
   IValidateOptions<FlakySdkOptions> — reject any of the three Chance fields
   outside [0.0, 1.0]; reject TimeoutHangMs < 1 or > 600_000.

6. SimulatedTagSource (dropout wiring):
   In the per-emitter loop, before computing/publishing the sample, draw
   `Random.Shared.NextDouble()`. If the draw is less than the active
   profile's TelemetryDropoutChance, skip the publish but still advance the
   noise ref state (so the random-walk is visibly continuous after dropout
   ends). The active profile is read from ISimulatorProfileProvider.CurrentProfile.

   IMPORTANT: do NOT call _metrics.SamplesIngested.Add when dropping; do NOT
   call _metrics.SamplesCoalesced.Add (a dropout is not a coalesce).

7. SimulatedMachineConnection (time-compression + Gaussian latency):
   Replace the current `await Task.Delay(_connectDelay, cancellationToken)`
   block with:
       var profile = _profileProvider.CurrentProfile;
       var compressedDelay = TimeSpan.FromMilliseconds(
           _connectDelay.TotalMilliseconds / profile.TimeCompressionFactor);
       await Task.Delay(compressedDelay, cancellationToken);
       if (profile.NetworkLatencyMeanMs > 0) {
           var jitter = SampleGaussianClampedMs(
               profile.NetworkLatencyMeanMs, profile.NetworkLatencyStddevMs);
           if (jitter > 0)
               await Task.Delay(TimeSpan.FromMilliseconds(jitter), cancellationToken);
       }
   `SampleGaussianClampedMs(mean, stddev)` is a private static helper that
   uses Box-Muller (mirror NoiseModelEvaluator's existing implementation; do
   not depend on or import from Application). Negative draws clamp to 0.

8. SimulatedMotionController (per-tick wait scale):
   In `InterpolateAsync`, scale only the per-iteration `Task.Delay(20ms, ct)`
   call. The motion arithmetic, _currentX/_currentY, the PositionChanged event,
   and the loop count are unchanged. Read TimeCompressionFactor from
   ISimulatorProfileProvider.CurrentProfile each iteration (so a profile change
   mid-move takes effect on the next tick — matches existing convention).

9. DI wiring (InfrastructureServiceCollectionExtensions):
   - services.AddSingleton<IValidateOptions<FlakySdkOptions>, FlakySdkOptionsValidator>();
   - services.AddOptions<FlakySdkOptions>()
       .BindConfiguration(FlakySdkOptions.SectionName)
       .ValidateOnStart();
   The decorator wiring is Pass 2; in Pass 1, IMachineConnection still resolves
   directly to SimulatedMachineConnection.

10. appsettings.json (src/InspectionPrototype.App/appsettings.json):
    Add the ChaosMonkey profile (after EncoderRate):
        Name: "ChaosMonkey"
        MotionSpeedUnitsPerSecond: 50.0
        TelemetryIntervalMs: 50
        FrameIntervalMs: 100
        FrameWidth: 1024
        FrameHeight: 768
        BytesPerPixel: 1
        EncoderIntervalMs: 5
        DefectProbabilityPerFrame: 0.05
        ConnectionFailureProbability: 0.30
        DefectShowerEveryMs: 30000
        DefectShowerDurationMs: 3000
        AlarmBurstEveryMs: 45000
        TelemetryDropoutChance: 0.05
        NetworkLatencyMeanMs: 250
        NetworkLatencyStddevMs: 150
        TimeCompressionFactor: 1.0
    Add the Soak8h profile (after ChaosMonkey):
        Name: "Soak8h"
        MotionSpeedUnitsPerSecond: 30.0
        TelemetryIntervalMs: 100
        FrameIntervalMs: 250
        FrameWidth: 1024
        FrameHeight: 768
        BytesPerPixel: 1
        EncoderIntervalMs: 5
        DefectProbabilityPerFrame: 0.05
        ConnectionFailureProbability: 0.05
        DefectShowerEveryMs: 600000
        DefectShowerDurationMs: 5000
        AlarmBurstEveryMs: 0
        TelemetryDropoutChance: 0.01
        NetworkLatencyMeanMs: 50
        NetworkLatencyStddevMs: 20
        TimeCompressionFactor: 1.0
    Add a new sibling block to "Simulator" called "FlakySdk":
        Enabled: true
        TimeoutChance: 0.05
        IgnoreCancellationChance: 0.05
        OutOfBandThrowChance: 0.05
        TimeoutHangMs: 30000

11. Tests under tests/InspectionPrototype.Tests/:
    - SimulatorProfileFieldsTests:
        record round-trip for the seven new fields; SimulatorProfile.Default
        carries all defaults.
    - SimulatorProfilesValidatorChaosTests (or extend the existing
      SimulatorProfilesValidatorTests): one [Theory] per new rule covering
      the boundary cases (e.g., DefectShowerEveryMs = -1, 0, 1, 3_600_000,
      3_600_001).
    - FlakySdkOptionsValidatorTests: reject each Chance field outside [0.0, 1.0];
      reject TimeoutHangMs of 0 and 600_001.
    - SimulatedTagSourceDropoutTests:
        Construct SimulatedTagSource with a single tag at 100 Hz, a fake
        ISimulatorProfileProvider that yields a profile with
        TelemetryDropoutChance = 0.5, run for 1 second, count
        AppMetrics.SamplesIngested entries against a MeterListener (or count
        cell-write events via a recording cell-store). Assert count is in
        [400, 600] (50% ± 10%). Run a second test with TelemetryDropoutChance
        = 1.0 and assert SamplesIngested == 0 and SamplesCoalesced == 0.
    - SimulatedMachineConnectionTimeCompressionTests:
        With TimeCompressionFactor = 1.0, ConnectAsync wall-clock is in
        [1400ms, 1700ms]. With TimeCompressionFactor = 5.0, in [280ms, 360ms]
        (ignoring jitter; both tests set NetworkLatencyMeanMs = 0).
    - SimulatedMachineConnectionNetworkLatencyTests:
        With TimeCompressionFactor = 5.0 (to keep total wall-clock low),
        NetworkLatencyMeanMs = 100, NetworkLatencyStddevMs = 20:
        run 100 ConnectAsync calls; measure each beyond the compressed
        connect-delay floor; assert mean across the 100 samples is in
        [80, 120] ms.
        Set ConnectionFailureProbability = 0 to remove the false-return path.
    - SimulatedMotionControllerTimeCompressionTests:
        Construct with FakeSimulatorProfileProvider yielding TimeCompressionFactor
        = 5.0; call MoveToAsync(distance such that real-time would take ~500ms);
        assert wall-clock duration in [80ms, 200ms] and the final
        _currentX/_currentY match the destination.

## Constraints

- Do NOT implement DefectShowerService or AlarmBursterService — Pass 2.
- Do NOT implement FlakySdkDecorator — Pass 2.
- Do NOT change AppMetrics or AppState — no new counters or fields outside
  SimulatorProfile.
- Do NOT introduce a TimeProvider abstraction. TimeCompressionFactor is a
  scalar applied at exactly two sites.
- Do NOT modify SimulatedMotionController's motion arithmetic, _currentX,
  _currentY, or the PositionChanged event. Only the per-tick wait is scaled.
- Do NOT remove or rename any existing SimulatorProfile field. Existing
  profiles must compile and run unchanged at runtime when no chaos field is
  set.
- Do NOT change MainViewModel, IFaultInjector, IWorkflowService, IAppStateStore.

## Verification before you report done

  dotnet build --configuration Release
  dotnet test --configuration Release

Manual smoke test:
  - Launch interactively (no --scenario flag); app starts with Normal profile
    (all chaos fields default-zero); no warnings or crashes about missing
    Simulator:FlakySdk config; no diagnostics-warning log spam.
  - Switch to ChaosMonkey via the profile combo-box; observe the simulator
    log warnings about ConnectionFailureProbability=0.3 and the latency
    injection paths firing. The DefectShowerEvery/AlarmBurstEvery fields are
    set in config but no service consumes them yet (Pass 2's job) — the
    interactive smoke test should NOT see defect storms or alarm bursts. That
    is correct.
  - Switch to Soak8h; behavior similarly muted (most chaos fields set low).
  - Switch back to Normal; everything returns to current behavior.

## Report format when finished

- files created and modified
- confirmation that all existing tests pass plus new tests
- a single commit hash
- commit message: "feat(sim): add storm/soak profile fields, validators, and three load-shaping knobs (pass 1/3 of TASK-1.4)"

Pass 2 — Defect-shower + alarm-burster services + flaky-SDK decorator

You are implementing Pass 2 of TASK-1.4. Pass 1 (profile fields, validators,
TelemetryDropoutChance / TimeCompressionFactor / NetworkLatency wiring,
ChaosMonkey + Soak8h profiles, Simulator:FlakySdk config block) is already
merged. This pass adds the three new background services / decorators that
consume the chaos fields, plus the measurement-extraction helpers.

## Authoritative references

Read these before making changes:
- docs/specs/SLICE-1.4-storm-and-soak-profiles.md  (criteria 5, 6, 10, 13)
- src/InspectionPrototype.Application/State/SimulatorProfile.cs (Pass 1 — chaos fields are present)
- src/InspectionPrototype.Application/Services/FramePipelineService.cs
- src/InspectionPrototype.Application/Abstractions/IFaultInjector.cs
- src/InspectionPrototype.Application/Abstractions/IWorkflowService.cs
- src/InspectionPrototype.Application/Services/EncoderStreamPipelineService.cs (parallel BackgroundService pattern)
- src/InspectionPrototype.Infrastructure/Simulator/SimulatedMachineConnection.cs
- src/InspectionPrototype.Infrastructure/Simulator/FlakySdkOptions.cs (Pass 1)
- src/InspectionPrototype.Infrastructure/InfrastructureServiceCollectionExtensions.cs
- tools/MeasurementExtraction.psm1
- tests/Tools/MeasurementExtraction.Tests.ps1

Pass 1 must be merged. Confirm by inspecting that SimulatorProfile carries
the seven new fields and that appsettings.json has ChaosMonkey, Soak8h, and
the Simulator:FlakySdk block.

## Scope of this pass

DefectShowerService, AlarmBursterService, FlakySdkDecorator<IMachineConnection>
+ conditional DI, MeasurementExtraction.psm1 helpers (Get-WorkingSetGrowthMb,
Get-FaultCyclesCount), ConvertTo-MeasurementRow extension. Tests for each.
NO captures, NO simulator-side changes beyond the FramePipelineService
consult of IDefectShowerSchedule, NO new FlaUI tests.

## Deliverables

1. IDefectShowerSchedule
   (src/InspectionPrototype.Application/Abstractions/IDefectShowerSchedule.cs):
   public interface IDefectShowerSchedule {
       /// <summary>
       /// True iff the active simulator profile has DefectShowerEveryMs > 0
       /// AND we are currently inside an active shower window.
       /// FramePipelineService consults this on every frame.
       /// </summary>
       bool IsShowerActive { get; }
   }

2. DefectShowerService
   (src/InspectionPrototype.Application/Services/DefectShowerService.cs):
   - Implements IDefectShowerSchedule and IHostedService
   - Constructor takes ISimulatorProfileProvider, IAppStateStore (for
     WithDiagnosticsEntry on transitions), ILogger<DefectShowerService>
   - Field: private volatile bool _isActive = false
   - StartAsync: subscribe to ISimulatorProfileProvider.ProfileChanged;
     start background task that, while not stopping:
       * read profile.DefectShowerEveryMs; if 0, await ProfileChanged event
         (or a 1-second poll), continue
       * else: await Task.Delay(EveryMs - DurationMs); set _isActive = true,
         log Info "Defect shower active"; await Task.Delay(DurationMs); set
         _isActive = false, log Info "Defect shower ended". Loop.
     Also write a one-line diagnostics-timeline entry on each transition
     via _store.Update(s => s.WithDiagnosticsEntry(DiagnosticsSource.Pipeline, ...))
   - StopAsync: cancel the background task; _isActive = false.

   FramePipelineService.ProcessDefectsForFrame: inject IDefectShowerSchedule;
   when IsShowerActive == true, skip the per-frame probability check (always
   produce a defect). The per-defect severity distribution and
   AppState.ActiveRun.DefectsCritical/Major/Minor accumulation flow as today.

3. AlarmBursterService
   (src/InspectionPrototype.Application/Services/AlarmBursterService.cs):
   - Implements IHostedService
   - Constructor takes IFaultInjector, IWorkflowService, ISimulatorProfileProvider,
     ILogger<AlarmBursterService>
   - Field: private static readonly string[] _pool = ["CHAOS-001",...,"CHAOS-005"]
   - Field: private int _index = 0
   - StartAsync: subscribe to ProfileChanged; start background task:
       * read profile.AlarmBurstEveryMs; if 0, await ProfileChanged event
         (or a 1-second poll), continue
       * else: while not stopping:
            await Task.Delay(EveryMs, ct);
            var code = _pool[Interlocked.Increment(ref _index) % _pool.Length];
            try {
                _faultInjector.InjectCriticalFault(code, $"ChaosMonkey burst at {DateTimeOffset.UtcNow:HH:mm:ss.fff}");
                await Task.Delay(500, ct);
                _faultInjector.ClearFault(code);
                await Task.Delay(500, ct);
                await _workflow.RecoverAsync();
            } catch (Exception ex) when (ex is not OperationCanceledException) {
                _logger.LogWarning(ex, "Alarm burst cycle failed; continuing.");
            }
   - StopAsync: cancel; do not re-throw.

4. FlakySdkDecorator<IMachineConnection>
   (src/InspectionPrototype.Infrastructure/Simulator/FlakySdkDecorator.cs):
   public sealed class FlakySdkDecorator : IMachineConnection {
       private readonly IMachineConnection _inner;
       private readonly IOptionsMonitor<FlakySdkOptions> _options;
       private readonly ILogger<FlakySdkDecorator> _logger;

       public async Task<bool> ConnectAsync(CancellationToken ct) {
           var opts = _options.CurrentValue;
           if (!opts.Enabled) return await _inner.ConnectAsync(ct);

           if (Random.Shared.NextDouble() < opts.TimeoutChance) {
               _logger.LogInformation("FlakySdk: timeout-hang branch fired.");
               try { await Task.Delay(opts.TimeoutHangMs, ct); }
               catch (OperationCanceledException) { throw; }
               // If not cancelled (long enough wait), fall through to inner.
           }

           if (Random.Shared.NextDouble() < opts.IgnoreCancellationChance) {
               _logger.LogInformation("FlakySdk: ignore-cancellation branch fired.");
               // Ignore caller's CancellationToken.
               return await _inner.ConnectAsync(CancellationToken.None);
           }

           if (Random.Shared.NextDouble() < opts.OutOfBandThrowChance) {
               _logger.LogWarning("FlakySdk: out-of-band-throw branch fired.");
               throw new InvalidOperationException(
                   "FlakySdk: simulated out-of-band SDK exception.");
           }

           return await _inner.ConnectAsync(ct);
       }

       public Task DisconnectAsync() => _inner.DisconnectAsync();
   }

5. Conditional DI wiring (InfrastructureServiceCollectionExtensions):
   Replace the current line
       services.AddSingleton<IMachineConnection, SimulatedMachineConnection>();
   with:
       services.AddSingleton<SimulatedMachineConnection>();
       services.AddSingleton<IMachineConnection>(sp => {
           var opts = sp.GetRequiredService<IOptionsMonitor<FlakySdkOptions>>().CurrentValue;
           var inner = sp.GetRequiredService<SimulatedMachineConnection>();
           return opts.Enabled
               ? new FlakySdkDecorator(
                     inner,
                     sp.GetRequiredService<IOptionsMonitor<FlakySdkOptions>>(),
                     sp.GetRequiredService<ILogger<FlakySdkDecorator>>())
               : (IMachineConnection)inner;
       });
   Also register:
       services.AddSingleton<DefectShowerService>();
       services.AddSingleton<IDefectShowerSchedule>(sp => sp.GetRequiredService<DefectShowerService>());
       services.AddHostedService(sp => sp.GetRequiredService<DefectShowerService>());
       services.AddSingleton<AlarmBursterService>();
       services.AddHostedService(sp => sp.GetRequiredService<AlarmBursterService>());

6. tools/MeasurementExtraction.psm1:
   Add and export Get-WorkingSetGrowthMb:
       function Get-WorkingSetGrowthMb {
           [CmdletBinding()]
           param([Parameter(Mandatory)][object[]] $Csv)
           $rows = $Csv | Where-Object {
               $_.'Counter Name' -match 'dotnet\.process\.memory\.working_set'
           } | Sort-Object Timestamp
           if ($rows.Count -lt 2) { return $null }
           $first = [double]$rows[0].'Mean/Increment'
           $last = [double]$rows[-1].'Mean/Increment'
           return [math]::Round(($last - $first) / 1MB, 1)
       }

   Add and export Get-FaultCyclesCount:
       function Get-FaultCyclesCount {
           [CmdletBinding()]
           param([Parameter(Mandatory)][object[]] $Csv)
           $rows = $Csv | Where-Object {
               $_.'Counter Name' -match 'runs\.faulted'
           }
           if ($rows.Count -eq 0) { return 0 }
           return [int](($rows | Measure-Object -Property 'Mean/Increment' -Sum).Sum)
       }

   Update ConvertTo-MeasurementRow to call both helpers and append two rows:
       | working-set growth (MB) | <Get-WorkingSetGrowthMb output, or "—" if null> |
       | fault-cycles (count)    | <Get-FaultCyclesCount output> |
   Use the same "—" sentinel pattern from SLICE-1.2 / SLICE-1.3 for missing data.

7. tests/Tools/MeasurementExtraction.Tests.ps1:
   Four new Pester tests:
   - "WorkingSetGrowthMb_OnFixture_ComputesLastMinusFirst": synthetic CSV
     with two working_set rows, assert correct (last - first) / 1MB.
   - "WorkingSetGrowthMb_OnEmptyCsv_ReturnsNull": empty CSV → $null.
   - "FaultCyclesCount_OnFixture_SumsRunsFaulted": fixture with three rows of
     runs.faulted, Mean/Increment = 1, 2, 3 → returns 6.
   - "ConvertTo-MeasurementRow_AppendsTwoNewRows_WhenCsvHasData": fixture
     with working_set + runs.faulted; assert markdown contains both rows.

8. Tests under tests/InspectionPrototype.Tests/:
   - DefectShowerServiceTests: with DefectShowerEveryMs = 200, DurationMs = 100,
     run for 1 second; assert IsShowerActive transitions ≥ 3 times.
   - AlarmBursterServiceTests: with AlarmBurstEveryMs = 100, recording
     IFaultInjector + IWorkflowService stubs, run for 500 ms; assert
     InjectCriticalFault was called ≥ 3 times AND ClearFault was called for
     each injection AND RecoverAsync was called the same number of times AND
     the alarm codes cycled through CHAOS-001 → 005 → 001 in order.
     Second test: stub IWorkflowService.RecoverAsync to throw
     InvalidOperationException; assert the service logs a Warning and continues
     ticking (no host fault).
   - FlakySdkDecoratorTests: three [Fact]s — one per branch — by setting one
     Chance to 1.0 and the others to 0.0.
       * TimeoutHang: with TimeoutHangMs = 100 and caller CTS that cancels
         at 50ms, assert OperationCanceledException is thrown.
       * IgnoreCancellation: caller cancels CTS immediately; assert
         ConnectAsync still returns successfully (the wrapped inner sees no
         cancellation).
       * OutOfBandThrow: assert InvalidOperationException is thrown.
     Plus a passthrough [Fact] with Enabled=false: assert no extra delay,
     no extra throws — bypasses to inner directly.
   - FramePipelineServiceShowerTests (extend the existing suite): with a
     fake IDefectShowerSchedule that returns IsShowerActive=true and a
     profile DefectProbabilityPerFrame=0.0, push 10 frames; assert
     ActiveRun.DefectCount == 10.

## Constraints

- Do NOT change AppMetrics, AppState, IAppStateStore, IFaultInjector, or
  IWorkflowService.
- Do NOT add new metric counters. The two new measurement-extraction rows
  are derived from existing counters (working_set, runs.faulted).
- Do NOT decorate IMotionController. Connection-only is the slice's scope.
- Do NOT make AlarmBursterService crash the host on any inner exception.
  The "swallow and log" pattern is intentional — the service must outlive
  individual cycle failures.
- Do NOT make DefectShowerService produce defects directly. It is a *schedule*;
  FramePipelineService is what produces the defects when IsShowerActive == true.
- The FlakySdk decorator must not retry, log per-call, or accumulate state.
  The three branches are independent and stateless beyond the options snapshot.

## Verification before you report done

  dotnet build --configuration Release
  dotnet test --configuration Release

Plus:
  - Pester: Invoke-Pester tests/Tools/MeasurementExtraction.Tests.ps1
    All four new tests pass plus the existing tests.
  - Manual smoke capture (60 seconds, ChaosMonkey profile):
      tools/Capture-Measurements.ps1 -Scenario MultiTagSoak `
        -DurationSeconds 60 -Profile ChaosMonkey `
        -OutputCsv docs/captures/_smoke.csv `
        -CommitHash $(git rev-parse --short HEAD) -AllowDirty
    Verify:
      * exit code 0 OR a non-zero exit due to a fault-induced run termination
        (this is acceptable for ChaosMonkey — log the exit reason)
      * the printed row block has working-set growth (MB) and
        fault-cycles (count) rows present
      * fault-cycles ≥ 1 (at least one alarm-burster cycle landed in 60 s
        given AlarmBurstEveryMs = 45_000)
      * the diagnostics-timeline log shows "ChaosMonkey burst at ..." entries
    Delete the smoke CSV before commit.

## Report format when finished

- files created and modified
- confirmation all C# tests + Pester tests pass
- the smoke-capture stdout (the row block) included as evidence
- a single commit hash
- commit message: "feat(app,sim,tools): add chaos services + flaky-SDK decorator + measurement helpers (pass 2/3 of TASK-1.4)"

Pass 3 — Captures + row blocks + runbook §4.5 + §4.6

You are implementing Pass 3 of TASK-1.4, the final pass. Passes 1 and 2 are
merged. This pass runs the 30-minute ChaosMonkey capture and the 8-hour Soak8h
capture, appends two row blocks, writes runbook §4.5 and §4.6, and updates
session-handoff documents. NO code changes — Passes 1 and 2 own those.

## Authoritative references

Read these before making changes:
- docs/specs/SLICE-1.4-storm-and-soak-profiles.md   (criteria 11, 12, 14, 16)
- docs/runbook/capturing-measurements.md            (existing §3a, §4.1–§4.4)
- docs/reviews/phase-1-measurements.md              (slice-1-2-real-frame-payloads,
                                                     slice-1-3-encoder-rate-motion
                                                     rows to mirror)
- CLAUDE.md, docs/reviews/roadmap-progress.md
- tools/Capture-Measurements.ps1

## Scope of this pass

Two captures, two table edits, two runbook sections (§4.5 + §4.6), session-
handoff updates. No code or test changes.

## Deliverables

1. Disable system sleep AND hibernate AND screen-saver before the 8-hour
   soak; for the 30-min ChaosMonkey, sleep-disable is sufficient.
       powercfg /change standby-timeout-ac 0
       powercfg /change monitor-timeout-ac 0
       powercfg /hibernate off  # for the soak only
   Note the previous values in the session-log entry so they can be restored.

2. Run the 30-minute ChaosMonkey capture FIRST (it is shorter; if it surfaces
   a regression, do not waste 8 hours on the soak):
       $date = Get-Date -Format 'yyyy-MM-dd'
       tools/Capture-Measurements.ps1 -Scenario MultiTagSoak `
         -DurationSeconds 1800 -Profile ChaosMonkey `
         -OutputCsv "docs/captures/slice-1-4-chaos-monkey-$date.csv" `
         -CommitHash $(git rev-parse --short HEAD) `
         -SliceTag slice-1-4-chaos-monkey

   Verify:
       * The capture completed (no host crash; CSV span ≥ 1700 s).
       * runs.started ≥ 5 (criterion 11).
       * runs.faulted ≥ 5 (criterion 11).
       * The Logs/inspection-prototype-*.log file from the capture window
         contains entries showing every fault branch landed:
            (a) connect-failure: "Connection failed (simulated failure)" OR
                "FlakySdk: out-of-band-throw branch fired" OR
                "Connection error:" from DoConnectAsync
            (b) fault-during-home: "CRITICAL FAULT: [CHAOS-..." entries with
                surrounding "Homing started" / "Homing aborted"
            (c) fault-during-run: "CRITICAL FAULT: [CHAOS-..." entries with
                surrounding "Run running" / "Run loop interrupted"
            (d) fault-clear-and-recover: "Fault condition cleared: [CHAOS-..."
                followed by "Recovery completed."
       * The printed row block has 22 metrics, including the two new
         working-set growth (MB) and fault-cycles (count) rows.

   If criterion 11 fails (any branch is missing from the log), STOP — file
   the gap as a follow-up, do not proceed to §4.5 / §4.6 edits or to the
   soak. The most likely cause for a missing branch is a profile-config
   typo (e.g., AlarmBurstEveryMs accidentally 0); inspect the active
   profile snapshot in the diagnostics timeline.

3. Run the 8-hour Soak8h capture on a sleep-disabled session that is NOT
   shared with other workloads:
       tools/Capture-Measurements.ps1 -Scenario MultiTagSoak `
         -DurationSeconds 28800 -Profile Soak8h `
         -OutputCsv "docs/captures/slice-1-4-soak-8h-$date.csv" `
         -CommitHash $(git rev-parse --short HEAD) `
         -SliceTag slice-1-4-soak-8h

   Verify:
       * Capture span ≥ 28_500 s (≤ 1% drift from 8 h).
       * working-set growth (MB) ≤ 50 (criterion 12).
       * gen-2-gc-count is in the same order of magnitude as
         slice-1-2-real-frame-payloads (no Gen-2 runaway). If it is more than
         4× the slice-1-2 rate, criterion 12 fails.
       * runs.faulted is non-zero only because of ChaosMonkey-style activity
         from the spec? — under Soak8h AlarmBurstEveryMs = 0, so runs.faulted
         should be near 0 (any non-zero value comes from
         ConnectionFailureProbability = 0.05 misconnects, which are not
         critical-fault paths but log warnings; runs.faulted should still be
         0 under nominal Soak8h).
       * No unhandled-exception entries in the log.

   If any of these fails, STOP. The 50 MB ceiling is the slice's primary
   exit gate — failing it is not a documentation problem.

4. Append TWO row blocks to docs/reviews/phase-1-measurements.md:

   "### Row — slice-1-4-chaos-monkey" (mirror slice-1-3 format):
   - 22 metrics: existing 18 + gc-pause-p95 + LOH-alloc-rate avg +
     working-set growth (MB) + fault-cycles (count)
   - Baseline = slice-1-3-encoder-rate-motion values for the 20 metrics that
     overlap; "—" for working-set growth and fault-cycles since SLICE-1.3
     predates them
   - Notes section with at least:
     (a) Why slice-1-3 is the baseline.
     (b) Per-fault-branch evidence: one bullet per (a)/(b)/(c)/(d) listing
         the log line counts confirming the branch was hit.
     (c) Whether anything surprised — e.g., did the ignore-cancellation
         branch surface a new race? did the diagnostics timeline survive the
         alarm-burst cadence?

   "### Row — slice-1-4-soak-8h":
   - 22 metrics: same set as above
   - Baseline = slice-1-2-real-frame-payloads values for 18 overlapping
     metrics; "—" for the 4 new ones
   - Notes section with:
     (a) Why slice-1-2 is the baseline (continuous-load, FlaUI-captured row).
     (b) Working-set first-second / last-second numbers and the (last - first)
         growth math, evidencing criterion 12.
     (c) Gen-2 GC count rate (per hour) compared to slice-1-2's rate.
     (d) Per-tag samples.ingested distribution at a coarse level — note any
         tag whose rate dropped by more than the 1% TelemetryDropoutChance
         predicts.
     (e) Anything else that surprised: working-set sawtooth shape vs
         monotonic, alloc-rate trend, etc.

5. Add §4.5 to docs/runbook/capturing-measurements.md:
   - title: "### 4.5 Chaos-monkey scenario — SLICE-1.4, `ChaosMonkey` profile"
   - placement: after §4.4 (encoder-rate motion) and before §4.6 (Soak8h)
   - content:
       * one-paragraph rationale (links back to SLICE-1.4 spec)
       * 30-minute step list mirroring §4.4 but with profile = ChaosMonkey
       * sanity checks: runs.started ≥ 5, runs.faulted ≥ 5, fault-cycles
         (count) ≥ 5, frames.dropped recorded, the four log-line branches
         (a)/(b)/(c)/(d) all present
       * the row block is 22-metric; name working-set growth (MB) and
         fault-cycles (count) and where they come from
       * a PowerShell `Select-String` recipe over the inspection-prototype
         log files to count each fault-branch landing — copy-pasteable;
         this is the criterion-11 verification recipe
       * Implemented by: MultiTagSoakFlaUi with `--profile ChaosMonkey`

6. Add §4.6 to docs/runbook/capturing-measurements.md:
   - title: "### 4.6 Soak scenario — SLICE-1.4, `Soak8h` profile"
   - placement: after §4.5
   - content:
       * one-paragraph rationale: this is the slice's leak-detection bar;
         8 hours real-time on a dedicated session
       * a strong "do not run on a host you also intend to use" warning
       * prerequisites stronger than §4.5: hibernate disabled, screen-saver
         disabled, no other interactive use of the host during the run
       * 8-hour step list — same Capture-Measurements.ps1 invocation with
         -DurationSeconds 28800 -Profile Soak8h
       * sanity checks: working-set growth (MB) ≤ 50, gen-2-gc-count rate
         within 4× of slice-1-2-real-frame-payloads's rate, no
         unhandled-exception entries in the log, capture span ≥ 28_500 s
       * what to do if the capture is interrupted: discard the partial CSV
         and restart; partial captures dilute the leak math
       * Implemented by: MultiTagSoakFlaUi with `--profile Soak8h`

7. Replace the "### 4.5+ — pending Phase 1 scenarios" placeholder section
   with: "### 4.7+ — pending Phase 2 scenarios" listing only "Reserved for
   Phase 2 slices once they open." Phase 1 is complete after this slice.

8. Update CLAUDE.md "Current position" block:
   - Phase: 1 (Simulator to scale) — complete
   - Last completed action: TASK-1.4 Pass 3 — captured 30-min ChaosMonkey
     (fault-cycles=<N>, runs.faulted=<N>) and 8-hour Soak8h
     (working-set growth=<X> MB), 22-metric row blocks appended, runbook
     §4.5 + §4.6 added; commit <hash>
   - Next action: open Phase 2 — review Phase-1 measurement evidence to
     prioritize SLICE-2.1 / 2.2 / 2.3 / 2.4 ordering
   - Blocked on: nothing
   - Last updated: <today's date>

9. Append a session-log entry to docs/reviews/roadmap-progress.md under
   today's date covering: both CSV paths, both row-block headline numbers,
   the criterion-11 log-evidence recipe output (per-branch counts), the
   commit hash, and a one-line "Phase 1 exit gate met on YYYY-MM-DD"
   declaration. Mark SLICE-1.4 as Completed in the progress table. Add a
   one-line note under the Phase 1 section heading: "**Phase 1 exit gate:**
   met on YYYY-MM-DD, see rows slice-1-4-chaos-monkey and slice-1-4-soak-8h."

10. Restore powercfg settings after both captures complete:
       powercfg /change standby-timeout-ac <previous_value>
       powercfg /change monitor-timeout-ac <previous_value>
       powercfg /hibernate on  # if previously on

## Constraints

- Do NOT make any code or test changes in this pass.
- Do NOT modify the SLICE-1.4 spec — the row blocks are the slice's exit-gate
  evidence, not an opportunity to amend the spec.
- Do NOT skip the 8-hour soak. A shorter soak does not satisfy criterion 12;
  leak signal needs the longer wall-clock window to separate from
  steady-state fluctuation.
- Do NOT capture without disabling system sleep first (deliverable 1).
- Do NOT capture the soak with another high-CPU workload running on the host.
- Do NOT skip the 30-min ChaosMonkey capture in favor of the soak alone —
  the two evidence different exit-gate criteria and both rows are required.
- Do NOT proceed to §4.5/§4.6 edits if criterion 11 (every fault branch hit)
  fails. File the gap as a follow-up first.

## Verification before you report done

  dotnet build --configuration Release
  dotnet test --configuration Release

Plus:
  - both docs/captures/slice-1-4-chaos-monkey-<date>.csv and
    docs/captures/slice-1-4-soak-8h-<date>.csv exist and are committed
  - both row blocks are in docs/reviews/phase-1-measurements.md with all 22
    metrics filled, criterion 11 + 12 satisfied
  - §4.5 and §4.6 render correctly (no broken markdown tables or links)
  - CLAUDE.md current-position block reflects SLICE-1.4 closure and Phase 1
    exit-gate met
  - The Phase 1 exit-gate banner line is present in roadmap-progress.md

## Report format when finished

- files created and modified (note: there is no source code change in this pass)
- both captured row blocks (the 22-metric markdown tables) included in the report
- working-set growth (MB) value for Soak8h, fault-cycles (count) value and
  per-branch log-evidence counts for ChaosMonkey, with one-sentence
  interpretation of each
- a single commit hash
- commit message: "feat(measurements): close SLICE-1.4 and Phase 1; chaos-monkey + 8h soak rows + runbook §4.5/§4.6 (pass 3/3 of TASK-1.4)"

Operator notes

  • One pass per Copilot session. Same protocol as TASK-1.2 / TASK-1.3.
  • Pass 1 keeps the seven new fields default-zero / 1.0. Existing profiles must compile and run unchanged. Any test regression in the existing slice-1-1 / slice-1-2 / slice-1-3 measurement reproducibility is a Pass 1 bug — do not paper over it in Pass 2.
  • Pass 2's load-bearing detail is the conditional decorator wiring. When Simulator:FlakySdk:Enabled == false, the decorator must not be in the call path — IMachineConnection resolves directly to SimulatedMachineConnection. A test asserts this. The decorator's no-op-when-disabled branch is not the same as not registering the decorator at all; we want the latter so an existing Normal/Demo/MultiTag capture is bit-for-bit reproducible.
  • Pass 3's 8-hour soak is the slice's one true gate. Working-set growth ≤ 50 MB is non-negotiable. If the soak fails, the slice is not done; do not paper over it by adjusting the criterion. Phase 2 may end up motivated by exactly the leak that the soak surfaces, in which case the row stays in the table as evidence and Phase 2 opens.
  • AlarmBurstEveryMs = 0 under Soak8h is intentional. The alarm-burst path interrupts runs and dominates run throughput; under the soak, we want continuous runs accumulating wall-clock hours. ChaosMonkey gets AlarmBurstEveryMs = 45_000; Soak8h gets 0. Do not make the two profiles share the same alarm cadence "for symmetry" — they have different jobs.
  • TimeCompressionFactor: 1.0 on both new profiles. Time compression's main use is for tightening developer feedback loops on the chaos paths; for the slice's exit-gate captures, real-time keeps the data plane representative. Keep the field plumbed (Pass 1) and validated (Pass 1) but use it at 1.0 for the captures (Pass 3).
  • The 30-minute ChaosMonkey is verified by log inspection, not counters. Counters say "X faults occurred"; logs say "the fault occurred during Homing" / "during Running" / etc. Pass 3's runbook §4.5 includes the recipe so future captures reproduce the verification without bespoke per-capture code.
  • The flaky-SDK decorator is connection-only. Wrapping IMotionController is a documented follow-up. The slice's spec does not gate on motion-side flakiness; the connect-side coverage of DoConnectAsync is enough to claim the criterion-A coverage of the connect-fail / out-of-band-throw branches of WorkflowService.
  • Update the index files only at the end of the phase, not per-slice. Same rationale as earlier tasks. Phase 1's full retrospective banner goes into roadmap-progress.md under the Phase 1 heading after Pass 3 lands.

Docs-first project memory for AI-assisted implementation.