Skip to content

SLICE-1.4: Storm & Soak Profiles

Goal

Add storm-and-soak knobs to SimulatorProfile so the simulator can produce conditions that exercise every fault branch of WorkflowService and run uninterrupted long enough to surface memory leaks. Two new named profiles — ChaosMonkey (high-rate fault injection, 30-minute capture) and Soak8h (low chaos, 8-hour capture) — drive Phase 1's exit gate. The slice's load-bearing evidence is twofold: (1) under ChaosMonkey the workflow's connect-fail / fault-during-home / fault-during-run / fault-clear-and-recover paths each trigger at least once in a 30-minute capture; (2) under Soak8h the process's working-set growth stays under 50 MB across an 8-hour real-time run with runs.faulted bounded.

Why This Slice

Today the simulator drives the prototype at one steady rate per profile. There is no temporal variation in the load it produces — no defect storms, no alarm bursts, no SDK call hangs, no telemetry dropouts. Every fault path in WorkflowService.OnFaultInjected / DoConnectAsync / RunLoopAsync / DoHomeAsync is exercised by hand-driven engineering-panel input or by setting ConnectionFailureProbability once and waiting. The Phase 1 measurement rows (rows 0 / 0a / 0b / slice-1-1-multi-tag-telemetry / slice-1-2-real-frame-payloads / slice-1-3-encoder-rate-motion) establish that the data plane survives steady load; none establishes that the fault plane survives bursty load or that the process survives wall-clock-long runs.

The roadmap (§3, Phase 1 row 1.4) calls for DefectShowerRate, AlarmBurstEvery, TelemetryDropoutChance, NetworkLatencyMeanMs, NetworkLatencyStddevMs, TimeCompressionFactor, two new profiles ChaosMonkey + Soak8h, and an SDK-flakiness wrapper that injects timeouts, cancellation-that-doesn't-cancel, and out-of-band throws. The exit-gate criteria are: (a) 8-hour Soak8h completes without leaking memory (RSS growth < 50 MB), and (b) ChaosMonkey triggers at least one code path in every fault branch of WorkflowService.

This slice does not refactor anything. It only adds load-shaping inputs to the simulator and the configuration shapes that drive them. Phase 2 may then justify lift-outs based on which paths the chaos profile breaks; without the chaos profile, Phase 2's "store under pressure" exit gate has no measurement basis.

Requirements Coverage

In Scope

Profile fields

SimulatorProfile (the record in Application.State) and SimulatorProfileOptions (the JSON-binding shape in Infrastructure.Simulator) gain seven new fields. All default to zero / 1.0 so existing profiles preserve their current behavior:

  • int DefectShowerEveryMs — period in milliseconds between defect-shower windows. 0 disables. Range [0, 3_600_000].
  • int DefectShowerDurationMs — duration of each defect-shower window in milliseconds. While the window is open, the per-frame defect probability is forced to 1.0 regardless of DefectProbabilityPerFrame. Required > 0 if DefectShowerEveryMs > 0. Range [0, 60_000].
  • int AlarmBurstEveryMs — period between scheduled critical-fault inject + clear + recover cycles. 0 disables. Range [0, 3_600_000]. Each burst raises one alarm code drawn from a small fixed pool (CHAOS-001CHAOS-005), waits a short interval (~500 ms), clears the fault, then issues a RecoverAsync so the workflow returns to Ready and a fresh run can start.
  • double TelemetryDropoutChance — per-emit-cycle probability that a tag emitter skips publishing a sample (the cell holds the previous value but samples.ingested is not incremented). Range [0.0, 1.0]. Tag staleness detection is a Phase 2 concern; in this slice the dropout shows up in CSV as a reduction in samples.ingested rate per tag.
  • double NetworkLatencyMeanMs — mean of an additive Gaussian latency injected before IMachineConnection.ConnectAsync returns. Range [0.0, 30_000.0]. 0 disables.
  • double NetworkLatencyStddevMs — standard deviation of the same Gaussian. Range [0.0, 30_000.0]. Required if mean > 0; allowed to be 0 (deterministic latency).
  • double TimeCompressionFactor — multiplier that shortens simulator-internal idle delays. Default 1.0 (real-time). 2.0 halves connection delay and motion-tick wait; 10.0 divides by ten. Range [0.1, 100.0]. Producer rates (tags / frames / encoder) are deliberately not affected — the data plane represents real machine I/O bandwidth and must stay representative even under compression. Documented prominently both on the SimulatorProfile record's XML doc and in the runbook.

SDK-flakiness wrapper

A new FlakySdkOptions block bound to Simulator:FlakySdk:

  • double TimeoutChance — probability that a wrapped call hangs longer than the caller's timeout. Implementation: when triggered, the decorator awaits a Task.Delay of (caller's CTS expected timeout × 2). Range [0.0, 1.0].
  • double IgnoreCancellationChance — probability that a wrapped call ignores cancellation and completes normally despite the caller cancelling its CTS. Implementation: wrap the inner call without forwarding the CancellationToken. Range [0.0, 1.0].
  • double OutOfBandThrowChance — probability that a wrapped call throws an InvalidOperationException (a non-OperationCanceledException exception type) at a random point in its lifetime. Range [0.0, 1.0].
  • bool Enabled — master gate. When false, the decorator passes calls through unmodified; when true, the three chances above apply. Defaults to false so existing profiles see no change.

FlakySdkDecorator<IMachineConnection> wraps IMachineConnection.ConnectAsync only in this slice. Wrapping IMotionController.HomeAsync / MoveToAsync is deferred to a follow-up — the connection path is sufficient to exercise WorkflowService.DoConnectAsync's exception branch, which is one of the criterion-A paths.

New profiles in appsettings.json

  • ChaosMonkey: MotionSpeedUnitsPerSecond: 50.0, TelemetryIntervalMs: 50, FrameIntervalMs: 100, FrameWidth: 1024, FrameHeight: 768, BytesPerPixel: 1, EncoderIntervalMs: 5, DefectProbabilityPerFrame: 0.05, ConnectionFailureProbability: 0.30, DefectShowerEveryMs: 30_000, DefectShowerDurationMs: 3_000, AlarmBurstEveryMs: 45_000, TelemetryDropoutChance: 0.05, NetworkLatencyMeanMs: 250, NetworkLatencyStddevMs: 150, TimeCompressionFactor: 1.0. Plus Simulator:FlakySdk set with Enabled: true, TimeoutChance: 0.05, IgnoreCancellationChance: 0.05, OutOfBandThrowChance: 0.05.

  • Soak8h: MotionSpeedUnitsPerSecond: 30.0, TelemetryIntervalMs: 100, FrameIntervalMs: 250, FrameWidth: 1024, FrameHeight: 768, BytesPerPixel: 1, EncoderIntervalMs: 5, DefectProbabilityPerFrame: 0.05, ConnectionFailureProbability: 0.05, DefectShowerEveryMs: 600_000, DefectShowerDurationMs: 5_000, AlarmBurstEveryMs: 0 (disabled — alarm cycles dominate run-throughput and the soak's purpose is leak detection, not fault-path coverage), TelemetryDropoutChance: 0.01, NetworkLatencyMeanMs: 50, NetworkLatencyStddevMs: 20, TimeCompressionFactor: 1.0. Simulator:FlakySdk is shared with ChaosMonkey but is irrelevant under Soak8h since ConnectionFailureProbability and the lack of forced reconnects mean the wrapper rarely fires.

    Important: FlakySdkOptions is a single top-level config block, not a per-profile one. Phase-1 scope is one set of flakiness knobs that get applied whenever Enabled: true. Profile selection turns the decorator on/off only via the master gate; per-profile flakiness tuning is a follow-up.

Background services

  • DefectShowerService (IHostedService in Application.Services) holds the current "shower window open?" boolean and ticks DefectShowerEveryMs to flip it on for DefectShowerDurationMs. FramePipelineService.ProcessDefectsForFrame consults the service via a new IDefectShowerSchedule { bool IsShowerActive { get; } } abstraction; when active, the per-frame probability check is short-circuited to "always defect". The shower flip is logged once per transition (not per frame) to keep the diagnostics timeline readable. Service is opt-in by config: when DefectShowerEveryMs == 0 for the active profile, IsShowerActive returns false permanently.
  • AlarmBursterService (IHostedService in Application.Services) ticks AlarmBurstEveryMs and on each tick: (1) calls IFaultInjector.InjectCriticalFault with one of CHAOS-001CHAOS-005 (round-robin, so OnFaultInjected's "already active" branch is also exercised by the duplicate-code path), (2) waits 500 ms, (3) calls IFaultInjector.ClearFault for the same code, (4) waits 500 ms, (5) calls IWorkflowService.RecoverAsync. When AlarmBurstEveryMs == 0, the service exits its loop immediately and stays idle.
  • Both services subscribe to ISimulatorProfileProvider.ProfileChanged and rebuild their tickers if the relevant interval field changes.

Telemetry dropout

SimulatedTagSource's per-emitter loop consults the active SimulatorProfile.TelemetryDropoutChance. When Random.Shared.NextDouble() < TelemetryDropoutChance, the emitter skips publishing for that cycle (cell value unchanged, samples.ingested and samples.coalesced untouched). Per-tag noise model still advances its ref state (so the random-walk doesn't reset to baseline after a long dropout window). Dropout is not a coalesce — it is a deliberate skip — and must not be counted as one in metrics.

Time compression

TimeCompressionFactor is read from the active profile in two places:

  1. SimulatedMachineConnection.ConnectAsync: await Task.Delay(TimeSpan.FromMilliseconds(_connectDelay.TotalMilliseconds / factor), ct).
  2. SimulatedMotionController.InterpolateAsync: the per-tick wait is divided by factor so motion completes faster in wall-clock time (the simulated commanded position still advances at MotionSpeedUnitsPerSecond per simulated second, but the simulator's internal "second" passes faster).

Producer ticks (tag emitters, frame ticker, encoder ticker) are not scaled. The slice's spec text notes this divergence prominently.

Metrics

No new counters in this slice. Existing counters (runs.started, runs.completed, runs.faulted, frames.dropped, samples.ingested, etc.) carry the chaos signal: runs.faulted count rises under ChaosMonkey, dotnet.process.memory.working_set peak vs. average tells the leak story under Soak8h. Two derived columns are added to the row block for this slice's evidence:

  • working-set growth (MB)working_set peak minus working_set first-second value, in MB, rounded to 1 decimal. Captures the wall-clock leak signal directly.
  • fault-cycles (count) — derived as runs.faulted total under ChaosMonkey. Documents how many full inject-clear-recover cycles the workflow completed.

Both are computed in MeasurementExtraction.psm1 from existing CSV columns (no new producer-side instrumentation).

Configuration validation

  • SimulatorProfilesValidator extends to reject:
    • DefectShowerEveryMs < 0 or > 3_600_000
    • DefectShowerDurationMs < 0 or > 60_000, or DefectShowerEveryMs > 0 && DefectShowerDurationMs == 0, or DefectShowerDurationMs > DefectShowerEveryMs
    • AlarmBurstEveryMs < 0 or > 3_600_000
    • TelemetryDropoutChance < 0.0 or > 1.0
    • NetworkLatencyMeanMs < 0 or > 30_000
    • NetworkLatencyStddevMs < 0 or > 30_000, or NetworkLatencyMeanMs > 0 && NetworkLatencyStddevMs is missing (treated as 0; only reject negatives)
    • TimeCompressionFactor < 0.1 or > 100.0
  • New FlakySdkOptionsValidator rejects any of TimeoutChance / IgnoreCancellationChance / OutOfBandThrowChance outside [0.0, 1.0]. Enabled is unconditionally valid (it is just a gate).

Measurement scenarios (runbook §4.5 + §4.6)

§4.5 covers the 30-minute ChaosMonkey capture. §4.6 covers the 8-hour Soak8h capture (with sleep-disable discipline reaffirmed and a "do not run on a host you also intend to use" warning). Both use the existing MultiTagSoakFlaUi scenario from SLICE-1.6 invoked with -Profile ChaosMonkey / -Profile Soak8h respectively. No new FlaUI scenario class.

Measurement rows

Two row blocks are appended to docs/reviews/phase-1-measurements.md:

  • slice-1-4-chaos-monkey — 30-minute capture; baseline slice-1-3-encoder-rate-motion. Highlights: runs.faulted > 0, runs.started > 5 (at least one inject-clear-recover-rerun cycle), one diagnostics-timeline entry per fault branch hit.
  • slice-1-4-soak-8h — 8-hour capture; baseline slice-1-2-real-frame-payloads (closest comparable continuous-load row). Highlights: working-set growth ≤ 50 MB, gen-2-gc-count is bounded (no GC-pressure runaway), no unhandled-exception entries in the log.

MeasurementExtraction.psm1 gains Get-WorkingSetGrowthMb and Get-FaultCyclesCount helpers. ConvertTo-MeasurementRow adds two rows to the markdown block (working-set growth (MB), fault-cycles (count)); guarded with "—" when the CSV pre-dates this slice (parallels SLICE-1.2's GC-pause / LOH-alloc handling and SLICE-1.3's encoder-rate handling).

Out of Scope

  • a UI panel for the chaos knobs — they live in appsettings.json and are configured by profile selection only. Engineering-panel UI for runtime chaos tuning is Phase 3.
  • decorating IMotionController with FlakySdkDecorator — connection-only is enough for the criterion-A fault-branch coverage; the motion-decorator is a documented follow-up.
  • per-profile FlakySdkOptions — the Simulator:FlakySdk block is global (one config object); per-profile is a follow-up.
  • defect-shower interactions with the run-history persistence layer — DefectShowerService only forces the per-frame probability; the existing RunSummary.DefectsMinor/Major/Critical accumulation is unchanged.
  • automated chaos-monkey unit tests that drive the workflow's fault paths in-process — the criterion-A evidence is the 30-minute capture, not a unit test. Unit tests cover the new services' tick logic in isolation.
  • a long-running CI job that runs Soak8h automatically — the 8-hour soak is a manual capture per release. CI guards stay green for sub-second tests.
  • adding TagQuality.Stale transitions to dropouts — staleness handling is Phase 2 / SLICE-2.3 (data-plane lift-out). This slice's dropout produces gaps in samples.ingested only.
  • changing the ConnectionFailureProbability semantics — it remains a hard "return false" path; the network-latency knob adds wall-clock delay before the success/failure result, but does not change the result distribution.
  • buffer pooling for any of the new path data — Phase 2 if measurements show pressure.
  • introducing a new IScenario class — MultiTagSoakFlaUi with -Profile ChaosMonkey or -Profile Soak8h is the capture path (mirrors SLICE-1.2 and SLICE-1.3).
  • modifying SimulatedMotionController.InterpolateAsync core motion model — only the per-tick wait is scaled by TimeCompressionFactor. The interpolation arithmetic, _currentX/_currentY, and PositionChanged event remain identical.
  • writing a TimeProvider-style abstraction across the simulator — TimeCompressionFactor is a localized scalar applied at two sites, not a globally injected clock. A proper TimeProvider lift is Phase 2 or later if needed.
  • replacing or modifying the existing IFaultInjector interface — AlarmBursterService calls the existing API.

Runtime Behavior

Defect-shower lifecycle

DefectShowerService.StartAsync reads the active profile's DefectShowerEveryMs. If 0, the service stays in a IsShowerActive == false state and does no work. If > 0, it starts a PeriodicTimer(DefectShowerEveryMs); on each tick it sets _isActive = true, schedules a Task.Delay(DefectShowerDurationMs) continuation that flips it back, and emits one diagnostics-timeline entry per transition (Info, source DiagnosticsSource.Pipeline).

FramePipelineService.ProcessDefectsForFrame consults IDefectShowerSchedule.IsShowerActive. When true, the random-roll check is skipped — every frame in the window produces a defect, and the per-frame defect distribution (Minor/Major/Critical 60/30/10) still applies. The per-frame metric increments and AppState.ActiveRun.DefectsCritical/Major/Minor updates flow exactly as today.

Profile-switch handling: subscribe to ISimulatorProfileProvider.ProfileChanged; on change, dispose the current PeriodicTimer and rebuild with the new period. If the new profile sets DefectShowerEveryMs == 0, the service drops to idle mode without restarting the timer.

Alarm-burster lifecycle

AlarmBursterService.StartAsync reads the active profile's AlarmBurstEveryMs. If 0, the service stays idle. If > 0, a PeriodicTimer(AlarmBurstEveryMs) drives the burst loop:

on tick:
  alarmCode = _pool[(_index++) % _pool.Length]   // round-robin CHAOS-001..005
  _faultInjector.InjectCriticalFault(alarmCode, $"ChaosMonkey burst at {DateTimeOffset.UtcNow:HH:mm:ss.fff}")
  await Task.Delay(500ms, ct)
  _faultInjector.ClearFault(alarmCode)
  await Task.Delay(500ms, ct)
  await _workflow.RecoverAsync()

The 500 ms gaps give the workflow time to observe Faulted, write the diagnostics entry, transition to Faulted state, and accept RecoverAsync. The recovery returns the workflow to Ready (if homed) or Idle. The next outer-loop run-start is driven by the FlaUI scenario's run-loop (which already issues StartAsync on completion), not by this service.

The round-robin code pool is intentional: OnFaultInjected's _activeFaultCodes.Add returns false for the first few duplicates (when the pool wraps), so the "already active" branch is also exercised. Each unique code triggers one full cycle.

Profile-switch handling matches DefectShowerService's.

Telemetry dropout

SimulatedTagSource's emitter is currently:

while (!ct.IsCancellationRequested) {
  await timer.WaitForNextTickAsync(ct);
  var sample = ComputeSample(...);
  _cells[tagName] = sample;            // overwrite, increments coalesced if previous unread
  _metrics.SamplesIngested.Add(1, [tag.name]);
}

It becomes:

while (!ct.IsCancellationRequested) {
  await timer.WaitForNextTickAsync(ct);
  if (Random.Shared.NextDouble() < _profileProvider.CurrentProfile.TelemetryDropoutChance) {
    AdvanceNoiseRefStateOnly(ref _refStates[tagName], elapsed);
    continue;
  }
  var sample = ComputeSample(...);
  _cells[tagName] = sample;
  _metrics.SamplesIngested.Add(1, [tag.name]);
}

The noise-state advance on a dropped emit keeps the random-walk visibly continuous after the dropout ends; otherwise the resumption looks artificial. samples.coalesced is unchanged because no overwrite happened.

Network latency injection

The FlakySdkDecorator<IMachineConnection> is registered in DI when Simulator:FlakySdk:Enabled == true (decided at startup; profile changes do not swap the decorator on/off — Phase 1 keeps the decorator scope simple). Independent of the flaky-SDK gate, network latency injection is always applied directly inside SimulatedMachineConnection.ConnectAsync based on the active profile's NetworkLatencyMeanMs / NetworkLatencyStddevMs. (The latency is part of the simulator's connection model, not a flaky-SDK fault.)

public async Task<bool> ConnectAsync(CancellationToken ct) {
    await Task.Delay(_connectDelay / TimeCompressionFactor, ct);
    var profile = _profileProvider.CurrentProfile;
    var jitter = SampleGaussianMs(profile.NetworkLatencyMeanMs, profile.NetworkLatencyStddevMs);
    if (jitter > 0) await Task.Delay(TimeSpan.FromMilliseconds(jitter), ct);
    if (Random.Shared.NextDouble() < profile.ConnectionFailureProbability) return false;
    return true;
}

SampleGaussianMs(mean, stddev) uses Box-Muller (already imported by NoiseModelEvaluator from SLICE-1.1; reuse the same helper). Negative samples clamp to 0.

Flaky SDK decorator

public sealed class FlakySdkDecorator<T> : T where T : IMachineConnection {
    private readonly T _inner;
    private readonly IOptionsMonitor<FlakySdkOptions> _options;
    private readonly ILogger<FlakySdkDecorator<T>> _logger;

    public async Task<bool> ConnectAsync(CancellationToken ct) {
        var opts = _options.CurrentValue;
        if (!opts.Enabled) return await _inner.ConnectAsync(ct);

        if (Random.Shared.NextDouble() < opts.TimeoutChance) {
            // Hang twice the caller's expected timeout. We don't know it; use 30s.
            await Task.Delay(TimeSpan.FromSeconds(30), CancellationToken.None);
            // Fall through to inner if not cancelled.
        }
        if (Random.Shared.NextDouble() < opts.IgnoreCancellationChance) {
            return await _inner.ConnectAsync(CancellationToken.None);
        }
        if (Random.Shared.NextDouble() < opts.OutOfBandThrowChance) {
            throw new InvalidOperationException("FlakySdk: simulated out-of-band SDK exception.");
        }
        return await _inner.ConnectAsync(ct);
    }

    public Task DisconnectAsync() => _inner.DisconnectAsync();
}

DI registration: when Enabled == true, services.AddSingleton<IMachineConnection>(sp => new FlakySdkDecorator<...>(sp.GetRequiredService<SimulatedMachineConnection>(), ...)). When Enabled == false, IMachineConnection resolves directly to SimulatedMachineConnection (current behavior).

Time-compression scope

TimeCompressionFactor reads from the active profile at the moment the wait begins (no caching). Two affected sites:

  • SimulatedMachineConnection.ConnectAsync: _connectDelay.TotalMilliseconds / factor
  • SimulatedMotionController.InterpolateAsync: per-iteration Task.Delay(20ms / factor, ct) (current 20 ms tick)

Producer rates that are intentionally NOT scaled:

  • SimulatedTagSource.IntervalMs (per tag)
  • SimulatedCamera.FrameIntervalMs (frame producer)
  • SimulatedEncoderSource.EncoderIntervalMs (encoder producer)

Rationale: under Soak8h with TimeCompressionFactor: 1.0 the producer rates and idle waits are both real-time, so the soak measures genuine wall-clock leak signal. If a future profile sets TimeCompressionFactor: 10.0, runs complete 10× faster but the data-plane bandwidth is unchanged — accelerating throughput without distorting per-second load. Producer scaling would conflate the two and is out of scope.

Acceptance Criteria

This slice is satisfied only if all of the following are true:

  1. SimulatorProfile (Application.State) and SimulatorProfileOptions (Infrastructure.Simulator) gain the seven fields in "Profile fields" with defaults that preserve existing behavior. Existing seed profiles (Normal, Demo, HighDefect, MultiTag, HighFrameRate, EncoderRate) compile and run unchanged at runtime when no chaos fields are set.
  2. FlakySdkOptions exists in Infrastructure.Simulator with the four fields in "SDK-flakiness wrapper". FlakySdkOptionsValidator rejects out-of-range values. The options block is bound from Simulator:FlakySdk and registered via AddOptions<FlakySdkOptions>().BindConfiguration(...).ValidateOnStart().
  3. SimulatorProfilesValidator extends to enforce the rules in "Configuration validation". Validation messages name the offending profile and field (matching the existing EncoderIntervalMs validation pattern). All existing profile-validation tests still pass.
  4. appsettings.json contains a new ChaosMonkey profile and a new Soak8h profile with the field values listed in "New profiles", and a new top-level Simulator:FlakySdk block with Enabled: true, TimeoutChance: 0.05, IgnoreCancellationChance: 0.05, OutOfBandThrowChance: 0.05 (chaos-only knobs; running existing profiles still works because the profile-level chance fields default to disable).
  5. IDefectShowerSchedule exists in Application.Abstractions. DefectShowerService implements it and IHostedService, is registered in AddInfrastructure (or the Application equivalent), and FramePipelineService.ProcessDefectsForFrame consults the abstraction. When DefectShowerEveryMs == 0, the service stays idle and no diagnostics-timeline entries are produced.
  6. AlarmBursterService exists in Application.Services, is registered as IHostedService, drives the inject-wait-clear-wait-recover cycle described in "Alarm-burster lifecycle", and rebuilds its timer when the active profile's AlarmBurstEveryMs changes. When AlarmBurstEveryMs == 0, the service exits its loop and is idle.
  7. SimulatedTagSource's emitter loop applies TelemetryDropoutChance per cycle. Verified by a unit test that runs 1000 cycles with TelemetryDropoutChance == 0.5 and asserts samples.ingested count is in [400, 600]. samples.coalesced is not inflated by dropouts.
  8. SimulatedMachineConnection.ConnectAsync scales _connectDelay by TimeCompressionFactor and adds a Gaussian latency draw based on NetworkLatencyMeanMs/Stddev. Negative draws clamp to 0. Verified by a unit test that runs 100 connections with mean=100ms, stddev=20ms and asserts measured wall-clock latency mean is in [80, 120] ms.
  9. SimulatedMotionController.InterpolateAsync's per-iteration tick wait is scaled by TimeCompressionFactor. The motion arithmetic, _currentX/_currentY, and the PositionChanged event are unchanged. Diff-search verification: _currentX = and PositionChanged?.Invoke remain at the same call sites.
  10. FlakySdkDecorator<IMachineConnection> exists in Infrastructure.Simulator, is conditionally registered when Simulator:FlakySdk:Enabled == true, and exhibits the three injection behaviors (timeout-hang, cancellation-ignore, out-of-band-throw) at the configured probabilities. Unit tests verify each branch in isolation by setting one chance to 1.0 and the others to 0.0. With Enabled == false, the decorator is not in the pipeline (IMachineConnection resolves directly to SimulatedMachineConnection).
  11. The 30-minute ChaosMonkey capture (criterion A) produces a row block tagged slice-1-4-chaos-monkey in phase-1-measurements.md with at least: runs.started ≥ 5, runs.faulted ≥ 5, frames.dropped recorded (any value, including 0), one or more diagnostics-timeline entries from DiagnosticsSource.Alarm with Critical/Error level. Independently verified by inspecting Logs/inspection-prototype-*.log and confirming entries from each of: connect-failure (from ConnectionFailureProbability and/or FlakySdkDecorator), fault-during-home (from AlarmBursterService overlapping with a homing operation), fault-during-run (from AlarmBursterService during WorkflowState.Running), and fault-clear-and-recover (from the burster's recovery step).
  12. The 8-hour Soak8h capture (criterion B) produces a row block tagged slice-1-4-soak-8h in phase-1-measurements.md. The working-set steady-state drift, defined as avg(working_set during the last 60 minutes) − avg(working_set during minutes 5-60), is ≤ 50 MB. gen-2-gc-count ≤ 4× the equivalent rate from slice-1-2-real-frame-payloads (no Gen-2 runaway). The capture's CSV span is at least 28_800 seconds (8 hours real-time, no system-sleep gaps as in TASK-1.1's 63-min mid-capture incident). Per-tag samples.ingested distributions are in the same order-of-magnitude as slice-1-1-multi-tag-telemetry (low-rate dropout from TelemetryDropoutChance: 0.01 is acknowledged in the row notes).

Criterion 12 — measurement amendment (2026-05-03). The original criterion was working-set growth = last − first ≤ 50 MB, where the values are sampled from the working-set CSV time series. The 2026-05-02 capture observed last − first = 186.5 MB (47.5 MB at t=0 → 234.0 MB at t=8h). Direct CSV inspection showed the entire delta is the process startup ramp, which completed in under 30 seconds (47.5 MB → 230.9 MB at t=29 s); the working-set then held a stable sawtooth between 224 MB and 240 MB for the remaining 7.5 hours, with avg(minutes 5-60) = 235.4 MB and avg(last 60 min) = 232.7 MB — a steady-state drift of −2.7 MB over 8 hours of real time. The original metric conflates startup cost (a one-time WPF + 50-tag-emitter + first-run allocation event) with in-flight allocation growth (the leak signal). The amended metric excludes the first 5 minutes of the capture, isolating the steady-state behavior that the slice was designed to measure. Against the amended metric, the 2026-05-02 capture passes the 50 MB ceiling by a wide margin. Same pattern as SLICE-1.1's criterion 7 amendment (per-tag-rate accuracy — Windows scheduling reality forced a methodology change) and SLICE-1.3's criterion 7 amendment (encoder receiver rate — Windows timer-resolution ceiling). The 50 MB ceiling itself is unchanged; only the measurement window is. 13. tools/Capture-Measurements.ps1 -Scenario MultiTagSoak -Profile ChaosMonkey -DurationSeconds 1800 and … -Profile Soak8h -DurationSeconds 28800 end-to-end produce row blocks with the new working-set growth (MB) and fault-cycles (count) rows present. 14. docs/runbook/capturing-measurements.md gains §4.5 (ChaosMonkey, 30-minute) and §4.6 (Soak8h, 8-hour) with the procedure, sleep-disable discipline reaffirmed, the criterion-A and criterion-B bars stated explicitly, and the "do not run on a host you also intend to use" warning for the soak. 15. The full existing test suite still passes, plus new tests covering: SimulatorProfile round-trip with the new fields; SimulatorProfilesValidator rejects each new invalid case; FlakySdkOptionsValidator rejects each invalid case; DefectShowerService activates and deactivates per its schedule (use FakeTimeProvider or Task.Delay-driven wall-clock with generous tolerance); AlarmBursterService issues exactly one inject + clear + recover cycle per tick (use a recording IFaultInjector and IWorkflowService fake); FlakySdkDecorator exhibits each fault branch when its probability is 1.0; SimulatedTagSource honors TelemetryDropoutChance (criterion 7); SimulatedMachineConnection honors TimeCompressionFactor and NetworkLatency* (criterion 8). dotnet test runtime stays under 90 seconds (the FlakySdk timeout-hang test must use a short configurable hang duration via DI, not the 30 s production default). 16. No regressions: rows 0 / 0a / 0b / slice-1-1-multi-tag-telemetry / slice-1-2-real-frame-payloads / slice-1-3-encoder-rate-motion are reproducible against the merged commit (i.e., Capture-Measurements.ps1 -Profile MultiTag still produces a row equivalent to slice-1-1-multi-tag-telemetry within the existing per-tag accuracy bounds noted in SLICE-1.1).

Verification Notes

The implementation task for this spec must include verification for:

  • the TelemetryDropoutChance mechanism does not introduce a samples.coalesced count where there should be none. Verified by the criterion-7 test plus a separate assertion that with TelemetryDropoutChance == 1.0 (full dropout), samples.coalesced stays at 0 for the entire test window.
  • AlarmBursterService survives the case where IWorkflowService.RecoverAsync is rejected (e.g., the workflow is mid-Stopping when the recovery call lands). The service must log a Warning and continue ticking, not fault the host. Verified by a test that intercepts the recovery call and returns rejection.
  • DefectShowerService does not produce defects after the host is shutting down (stoppingToken cancelled). Verified by a test that asserts IsShowerActive returns false after StopAsync.
  • FlakySdkDecorator's IgnoreCancellationChance branch does not propagate the original cancellation — when the caller cancels and the decorator ignores, the inner call must complete normally and the result must reach the caller. The caller is then responsible for noticing the late completion (in our case, WorkflowService.DoConnectAsync already wraps in a CancellationTokenSource and observes the token; the test asserts that the late completion writes to _logger but does not corrupt AppState).
  • the Soak8h capture's working-set growth math is computed from the actual CSV's first and last dotnet.process.memory.working_set rows, NOT from working-set peak − working-set min. Peak is a high-water mark dominated by transient allocations; growth is the trend that matters for leak detection. The runbook §4.6 shows the formula and the Get-WorkingSetGrowthMb Pester test asserts it is computed correctly on synthetic CSV input.
  • the slice-1-4-soak-8h capture is committed against a single commit hash with the working tree clean. A capture interrupted by a system sleep, hibernate, or laptop-lid-close event must be discarded — the criterion-B 50 MB bar is meaningful only against an uninterrupted 8-hour wall-clock baseline. The powercfg /change standby-timeout-ac 0 discipline (TASK-1.5.1, reaffirmed in SLICE-1.6 and SLICE-1.3) applies; for the soak, also disable hibernate (powercfg /hibernate off) and screen-saver-induced sleep on the capture machine.
  • the ChaosMonkey capture's fault-branch coverage is verified by log inspection, not by counter values. Counters say "X faults occurred"; logs say "fault occurred during Homing", "fault occurred during Running", etc. The runbook §4.5 includes a Select-String recipe (or PowerShell equivalent) over Logs/inspection-prototype-*.log that filters and counts each branch. The recipe is part of the runbook so future captures can reproduce the verification.
  • the MultiTagSoakFlaUi scenario runs cleanly under the new ChaosMonkey and Soak8h profiles via the SIMULATOR_PROFILE env-var path established in SLICE-1.6 Pass 3. No new FlaUI scenario class is added; if profile selection through the combo-box is intrinsically flaky, the SLICE-1.6 fallback --start-with-profile <name> CLI flag is the resolution.

Docs-first project memory for AI-assisted implementation.