Skip to content

SLICE-1.4 Design Notes — Storm & Soak Profiles

  • Slice: SLICE-1.4
  • Implementation status: Completed (2026-05-03, criterion 12 amended)
  • Audience: anyone modifying the chaos services, the flaky-SDK decorator, the simulator's load-shaping knobs, or interpreting the Phase 1 exit-gate evidence

This doc explains how the storm-and-soak chaos pipeline actually works in code — the three new background services, the flaky-SDK decorator's branch dispatch, the seven new SimulatorProfile fields and where each is consumed, the conditional DI wiring that lets the decorator be bypassed bit-for-bit when disabled, and the four FlaUI hardening fixes that the ChaosMonkey capture surfaced. This is the most complex Phase 1 slice; read this if you're tuning chaos parameters, adding a new fault-injection mode, or porting the rig to non-Windows hosts.

1. Quick reference

Three new services (all in InspectionPrototype.Application.Services except the decorator):

ServiceTypeRole
DefectShowerServiceIHostedService + IDefectShowerScheduleToggle-on every DefectShowerEveryMs; consulted by FramePipelineService
AlarmBursterServiceIHostedServicePeriodic inject → wait → clear → wait → recover cycle
FlakySdkDecoratorIMachineConnection decoratorThree injected fault modes around ConnectAsync

Seven new SimulatorProfile fields (record properties, all default-zero / 1.0):

FieldRangeConsumed by
DefectShowerEveryMs[0, 3_600_000]DefectShowerService
DefectShowerDurationMs[0, 60_000]DefectShowerService
AlarmBurstEveryMs[0, 3_600_000]AlarmBursterService
TelemetryDropoutChance[0.0, 1.0]SimulatedTagSource (per-emitter cycle)
NetworkLatencyMeanMs[0, 30_000]SimulatedMachineConnection.ConnectAsync
NetworkLatencyStddevMs[0, 30_000]SimulatedMachineConnection.ConnectAsync
TimeCompressionFactor[0.1, 100.0]SimulatedMachineConnection + SimulatedMotionController

One global config block (Simulator:FlakySdk):

OptionRangeDefaultEffect
EnabledboolfalseMaster gate; false bypasses decorator entirely
TimeoutChance[0.0, 1.0]0.0Probability of timeout-hang branch
IgnoreCancellationChance[0.0, 1.0]0.0Probability of ignore-cancellation branch
OutOfBandThrowChance[0.0, 1.0]0.0Probability of out-of-band-throw branch
TimeoutHangMs[1, 600_000]30_000Hang duration on timeout branch

Two new profiles (appsettings.json):

  • ChaosMonkey — aggressive: DefectShowerEveryMs=30s, DurationMs=3s, AlarmBurstEveryMs=45s, TelemetryDropoutChance=0.05, NetworkLatencyMeanMs=250, ConnectionFailureProbability=0.30, TimeCompressionFactor=1.0. Used with FlakySdk:Enabled=true for the criterion-11 capture.
  • Soak8h — gentle: DefectShowerEveryMs=10min, DurationMs=5s, AlarmBurstEveryMs=0 (disabled), TelemetryDropoutChance=0.01, NetworkLatencyMeanMs=50. Used with FlakySdk:Enabled=false for the criterion-12 capture.

Key files:

src/InspectionPrototype.Application/State/SimulatorProfile.cs               // 7 new fields
src/InspectionPrototype.Application/Abstractions/IDefectShowerSchedule.cs   // single-property abstraction
src/InspectionPrototype.Application/Services/DefectShowerService.cs
src/InspectionPrototype.Application/Services/AlarmBursterService.cs
src/InspectionPrototype.Application/Services/FramePipelineService.cs        // consults IDefectShowerSchedule
src/InspectionPrototype.Infrastructure/Simulator/FlakySdkOptions.cs
src/InspectionPrototype.Infrastructure/Simulator/FlakySdkOptionsValidator.cs
src/InspectionPrototype.Infrastructure/Simulator/FlakySdkDecorator.cs
src/InspectionPrototype.Infrastructure/Simulator/SimulatorProfilesOptions.cs       // 7 new binder fields
src/InspectionPrototype.Infrastructure/Simulator/SimulatorProfilesValidator.cs     // 7 new validation rules
src/InspectionPrototype.Infrastructure/Simulator/SimulatorProfileHydrationService.cs  // maps 7 new fields
src/InspectionPrototype.Infrastructure/Simulator/SimulatedTagSource.cs             // dropout
src/InspectionPrototype.Infrastructure/Simulator/SimulatedMachineConnection.cs     // time + latency
src/InspectionPrototype.Infrastructure/Simulator/SimulatedMotionController.cs      // tick scaling
src/InspectionPrototype.Infrastructure/InfrastructureServiceCollectionExtensions.cs // conditional decorator wiring
src/InspectionPrototype.App/appsettings.json                                       // 2 new profiles + FlakySdk block

Key tests (selected):

TestAsserts
DefectShowerServiceTestsActivates and deactivates per schedule
AlarmBursterServiceTests (4 tests)≥ 3 cycles in 5s; round-robin code distribution; survives RecoverAsync throw; disabled when EveryMs=0
FlakySdkDecoratorTests (4 tests)Each branch isolated by setting one chance to 1.0; bypass when Enabled=false
FlakySdk_TimeoutBranch_WhenNotCancelled_FallsThroughToInnerPre-flight fix: timeout falls through to inner when CT not cancelled
SimulatedTagSourceDropoutTestssamples.ingested ≈ 50% under TelemetryDropoutChance=0.5
SimulatedMachineConnectionTimeCompressionTestsWall-clock connect delay scales with TimeCompressionFactor
SimulatedMachineConnectionNetworkLatencyTestsGaussian latency mean within ±20% of configured
SimulatorProfilesValidatorChaosTestsEach new validation rule rejects out-of-range values
FramePipelineServiceShowerTestsShower-active forces defect on every frame

2. Class shape

The chaos pipeline is several independent subsystems that all read from ISimulatorProfileProvider but otherwise don't share state:

                          ┌────────────────────────────────────┐
                          │ ISimulatorProfileProvider          │
                          │   .CurrentProfile                  │
                          │     - DefectShowerEveryMs/Duration │
                          │     - AlarmBurstEveryMs            │
                          │     - TelemetryDropoutChance       │
                          │     - NetworkLatencyMean/Stddev    │
                          │     - TimeCompressionFactor        │
                          └────────────┬───────────────────────┘

        ┌──────────────┬───────────────┼───────────────┬──────────────────┐
        │              │               │               │                  │
        ▼              ▼               ▼               ▼                  ▼
   ┌────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐
   │Defect- │  │AlarmBurster- │  │SimulatedTag- │  │SimulatedMach-│  │SimulatedMo-  │
   │Shower- │  │Service       │  │Source        │  │ineConnection │  │tionController│
   │Service │  │              │  │ (per-emitter │  │ (per Connect-│  │ (per tick    │
   │ (back- │  │ (background  │  │  loop reads  │  │  Async call: │  │  inside      │
   │ ground │  │  inject loop │  │  Dropout-    │  │  scale delay,│  │  Interpolate-│
   │ flag   │  │  + retry-    │  │  Chance per  │  │  add Gaussian│  │  Async       │
   │ flip   │  │  resilient   │  │  cycle)      │  │  latency)    │  │  scales tick │
   │ cycle) │  │  cycle)      │  │              │  │              │  │  by TimeComp)│
   └───┬────┘  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘  └──────────────┘
       │              │                  │                 │
       │ exposes      │ calls            │ skips           │
       │ IDefect-     │ IFault-          │ samples.        │ delays /
       │ Shower-      │ Injector +       │ ingested when   │ ConnectAsync
       │ Schedule     │ IWorkflow-       │ rng < dropout   │ result
       │              │ Service          │                 │
       ▼              ▼                  ▼                 ▼
   ┌────────┐  ┌──────────────┐    (existing)        (existing)
   │Frame-  │  │SimulatorFault│
   │Pipeline│  │Injector +    │
   │Service │  │WorkflowService
   │.Process│  │ → AppState   │
   │Defects │  │   .Active-   │
   │ForFrame│  │   Alarms     │
   │ checks │  │   .Workflow- │
   │ Is-    │  │   State =    │
   │ Shower-│  │   Faulted    │
   │ Active │  │   etc.       │
   └────────┘  └──────────────┘

FlakySdkDecorator wraps IMachineConnection conditionally:

   Configuration:                                         DI registration:
   ────────────────                                       ────────────────
   "Simulator:FlakySdk":                                  IServiceCollection:
     Enabled: true   ─────┐                                 services.AddSingleton<SimulatedMachineConnection>()
                          │                                 services.AddSingleton<IMachineConnection>(sp =>
                          │                                   var opts = sp.GetRequiredService<
                          ▼                                       IOptionsMonitor<FlakySdkOptions>>().CurrentValue;
                                                                var inner = sp.GetRequiredService<SimulatedMachineConnection>();
   ┌────────────────────────────┐                              if (!opts.Enabled) return inner;
   │ FlakySdkDecorator           │                              return new FlakySdkDecorator(inner, ...);
   │   wraps SimulatedMachine-   │                          })
   │   Connection                │
   │                             │
   │   ConnectAsync:             │
   │     roll = Random.Shared    │
   │     if roll < Timeout:      │
   │       Task.Delay(HangMs)    │
   │       ct.ThrowIfCancelled() │   Pre-flight fix (commit 018bf29):
   │       return inner(ct)   ◀──┼── falls through to inner if not cancelled
   │     elif roll < T+Ignore:   │
   │       return inner(None)  ◀─┼── ignores caller's CT
   │     elif roll < T+I+Throw:  │
   │       throw InvalidOp     ◀─┼── out-of-band throw
   │     else: return inner(ct)  │
   └────────────────────────────┘

   Configuration when                                    DI registration:
   "Enabled": false:                                     ────────────────
                                                         IMachineConnection resolves DIRECTLY to
                                                         SimulatedMachineConnection — decorator NOT in
                                                         the call path. Bit-for-bit reproducibility for
                                                         pre-SLICE-1.4 captures (criterion 16).

3. Lifecycle — the three new services

DefectShowerService is IHostedService + IDefectShowerSchedule. State machine:

  host start                                                         host stop
      │                                                                  │
      ▼                                                                  ▼
   ┌──────────────────┐                                          ┌──────────────┐
   │  Start.Async     │                                          │  StopAsync   │
   │  starts back-    │                                          │  cancels CTS │
   │  ground task     │                                          │  awaits task │
   └────────┬─────────┘                                          └──────────────┘

            │  RunAsync loop:


   ┌──────────────────────────────────────────────────────────┐
   │  Read profile.DefectShowerEveryMs each iteration         │
   │                                                          │
   │  if everyMs <= 0:                                        │
   │    Task.Delay(1000); continue   (idle poll)              │
   │                                                          │
   │  quietMs = max(0, everyMs - durationMs)                  │
   │  Task.Delay(quietMs)                  ◀── window closed  │
   │                                                          │
   │  _isActive = true        ◀── window opens                │
   │  Log Info "Defect shower active"                         │
   │  Update(s => s.WithDiagnosticsEntry(Pipeline, Info, …))  │
   │                                                          │
   │  Task.Delay(durationMs)                                  │
   │                                                          │
   │  _isActive = false       ◀── window closes               │
   │  Log Info "Defect shower ended"                          │
   │  Update(s => s.WithDiagnosticsEntry(Pipeline, Info, …))  │
   │                                                          │
   │  loop ─────────────────────────────▶                     │
   └──────────────────────────────────────────────────────────┘

AlarmBursterService is IHostedService. The loop wraps each cycle in try/catch so a single failure (e.g., RecoverAsync rejected because workflow is mid-Stopping) doesn't terminate the host:

   RunAsync loop (per iteration):

   profile.AlarmBurstEveryMs == 0?  ──── yes ───▶  Task.Delay(1000)  ──┐

                                                                  loop ┘
              │ no

   Task.Delay(everyMs)


   code = _pool[Interlocked.Increment(ref _index) % 5]

              │  pool: [CHAOS-001, CHAOS-002, CHAOS-003, CHAOS-004, CHAOS-005]
              │  Round-robin so OnFaultInjected's "already active" duplicate-
              │  code branch is also exercised by the wrap-around.


   try {
     _faultInjector.InjectCriticalFault(code, "ChaosMonkey burst at HH:mm:ss.fff")
     Task.Delay(500)        ◀── workflow observes Faulted state
     _faultInjector.ClearFault(code)
     Task.Delay(500)        ◀── workflow's OnFaultCleared marks alarm inactive
     await _workflow.RecoverAsync()  ◀── workflow → Ready (if homed) or Idle
   }
   catch (OperationCanceledException) → break (graceful stop)
   catch (Exception ex) → Log Warning + continue   ◀── load-bearing resilience

FlakySdkDecorator is stateless — no lifecycle, just wraps each ConnectAsync call. Branch dispatch uses cumulative probability bands, not independent rolls:

   roll = Random.Shared.NextDouble()      ── one draw per ConnectAsync call

   roll ∈ [0, T):           Timeout branch
   roll ∈ [T, T+I):         IgnoreCancellation branch
   roll ∈ [T+I, T+I+O):     OutOfBandThrow branch
   roll ∈ [T+I+O, 1):       Pass-through to inner

   where T = TimeoutChance, I = IgnoreCancellationChance, O = OutOfBandThrowChance

This means at most one branch fires per call. The spec sketched independent draws (each branch could fire on the same call); the implementation chose mutual exclusion for simpler reasoning about the probability budget. Both satisfy criterion 10's per-branch test (set one chance to 1.0, others to 0.0). Cumulative bands are easier to reason about; independent draws are slightly more chaotic.

4. Runtime flow — fault burst lifecycle

The headline flow is the inject-clear-recover cycle. Everything else is single-actor.

  AlarmBurster      FaultInjector       WorkflowService.OnFaultInjected      AppState
  ────────────      ─────────────       ────────────────────────────         ────────

   ┌────┴────┐
   │ Tick @  │
   │ AlarmB- │
   │ urstEv- │
   │ eryMs   │
   └────┬────┘

        │ InjectCriticalFault("CHAOS-N", msg)
        │ ─────────────────────────▶
        │                                      │
        │                 ┌────────────────────┴────────────────────┐
        │                 │ if !_activeFaultCodes.Add(code):        │
        │                 │   log Info "ignored (already active)"   │
        │                 │   return    ◀── round-robin wrap hits   │
        │                 │ FaultInjected?.Invoke(args)             │
        │                 └────────────────────┬────────────────────┘
        │                                      │
        │                                      │ event subscriber
        │                                      ▼
        │                            OnFaultInjected(args):
        │                              alarm = new Alarm(Critical, …, IsActive=true)
        │                              Update(s => s with {
        │                                ActiveAlarms = … + alarm,
        │                                WorkflowState = Faulted,
        │                                MotionState = NotReady,
        │                                IsMotionHomed = false
        │                              })
        │                              + Critical/Error diagnostics entry
        │                              + WithAlarmRaised()
        │                              ─────────────────────────────────▶  Faulted
        │                              _runCts?.Cancel()    ─── cancels in-flight run
        │                              _homeCts?.Cancel()   ─── cancels in-flight home

   ┌────┴────┐
   │ Task.   │
   │ Delay   │
   │ (500)   │
   └────┬────┘

        │ ClearFault("CHAOS-N")
        │ ─────────────────────────▶
        │                                      │
        │                                      │ FaultCleared?.Invoke(code)
        │                                      ▼
        │                            OnFaultCleared(code):
        │                              Update(s => s with {
        │                                ActiveAlarms = … with IsActive=false on this code
        │                              })
        │                              + Info diagnostics entry
        │                              ─────────────────────────────────▶  Faulted (unchanged)
        │                                                                  alarm now inactive
   ┌────┴────┐
   │ Task.   │
   │ Delay   │
   │ (500)   │
   └────┬────┘

        │ await _workflow.RecoverAsync()
        │ ─────────────────────────────────────────────────────▶
        │                                                                  RecoverAsync:
        │                                                                    if !CommandGuards.CanRecover(state):
        │                                                                      log Warning + diagnostics rejection
        │                                                                      return  ◀── busy → caught by burster
        │                                                                                  catch + continue
        │                                                                    nextWorkflow = IsMotionHomed
        │                                                                                   ? Ready : Idle
        │                                                                    Update(s => s with {
        │                                                                      WorkflowState = nextWorkflow,
        │                                                                      ActiveAlarms = filtered to active only
        │                                                                    })
        │                                                                    ─────────────────▶  Ready or Idle

        │ next outer scenario tick: StartRun → ...

   ┌────┴────┐
   │ Tick @  │
   │ AlarmB- │
   │ urstEv- │
   │ eryMs   │
   └─────────┘

The 500 ms gaps are tuned empirically. Too short and the workflow doesn't observe the Faulted state before the clear arrives (some events get coalesced); too long and the recovery cycle dominates the run-throughput budget. 500 ms × 2 = 1 second of fault overhead per cycle, with AlarmBurstEveryMs=45 000 that's ~2.2% of wall-clock time spent in fault states under ChaosMonkey.

5. Decisions made during implementation

(a) [CallerMemberName] was not used here. The chaos services don't instrument AppStateStore.Update themselves; they call existing methods (InjectCriticalFault, RecoverAsync, etc.) that go through the normal WorkflowService path. SLICE-2.0 is the slice that adds caller-attribution to AppStateStore.Update. SLICE-1.4 just exercises the workflow's existing behavior repeatedly.

(b) IDefectShowerSchedule is a single-property abstraction. Just bool IsShowerActive { get; }. Could have been a richer interface (next shower start time, total shower count, etc.) but the consumer (FramePipelineService.ProcessDefectsForFrame) only needs the boolean. Keeping the interface narrow makes the dependency in FramePipelineService cheap and easy to fake in tests (FakeDefectShowerSchedule is 8 lines).

(c) AlarmBursterService survives RecoverAsync rejection. The try/catch (Exception ex) => Log Warning + continue block is load-bearing. WorkflowService.RecoverAsync rejects when the workflow isn't Faulted — possible if a previous chaos cycle's fault was absorbed by a different code path (e.g., an Abort racing with the inject). The burster needs to continue ticking, not crash the host. AlarmBursterServiceTests.AlarmBurster_WhenRecoverThrows_LogsWarningAndContinues is the regression test.

(d) Round-robin alarm-code pool of 5. Using only CHAOS-001 would let SimulatorFaultInjector._activeFaultCodes.Add reject the second injection (returning false) before the workflow could enter Faulted. Five codes mean each cycle has a fresh code, but the round-robin wraps every 5 cycles — and the wrap-around hits the "already active" branch deliberately, exercising that path too.

(e) Cumulative probability bands in the decorator (not independent draws). See §3 explanation. The mutual-exclusion property makes the probability-budget reasoning trivial: T + I + O is the total chaos rate; 1 - (T + I + O) is the pass-through rate. With independent draws, multiple branches could fire on the same call and the order of evaluation would matter. The cumulative form is simpler.

(f) Pre-flight commit pattern (commit 018bf29). Two changes had to land before the Soak8h capture started:

  1. Flip Simulator:FlakySdk:Enabled from true to false in appsettings.json so that re-captures of pre-SLICE-1.4 rows (slice-1-1, slice-1-2, slice-1-3) reproduce within the existing accuracy bounds (criterion 16).
  2. Fix FlakySdkDecorator.ConnectAsync's timeout branch to fall through to inner when the caller's CT was not cancelled during the hang. The original implementation always threw OperationCanceledException, masking the spec's intent that "the SDK eventually completes after a long delay" should be a survivable case (workflow's DoConnectAsync would catch and treat as Disconnected).

Doing these in a single commit before the soak meant the soak's evidence was clean. Doing them after would have required re-running the 8-hour capture.

(g) The SimulatorProfileHydrationService field-mapping fix (commit bf32566). Pass 1 added 7 new fields to SimulatorProfile and SimulatorProfileOptions, but the hydration service's Select() projection silently dropped them — runtime saw all-zero chaos knobs even when appsettings.json had non-zero values. The FlaUI rig with ChaosMonkey was the first thing to exercise the wiring end-to-end, surfacing the bug. The existing test suite did not catch this because no test asserts "options-bound SimulatorProfileOptions carries new field through to the runtime SimulatorProfile catalog entry." Add a binding-roundtrip test if a future slice introduces more SimulatorProfile fields — the gap remains.

(h) FlaUI scenario-rig hardening (4 commits — bf32566, 0f1596a, 5462d42, 2108272). ChaosMonkey's fault rate exposed gaps in the MultiTagSoakFlaUi scenario:

  • RecoverButton had no AutomationProperties.AutomationId — FlaUI couldn't click it after a fault.
  • The post-run loop didn't wait for Faulted → Idle transition before issuing the next Home click.
  • Home was a single-attempt operation that threw if a fault fired during the homing window (which it does ~every 45 s).
  • Connect was a single-attempt operation that threw on the first failed connect under ConnectionFailureProbability = 0.30.

Each fix is a retry loop in the scenario, not a code change in the application. Application-side behavior was correct throughout — the scenario rig was the surface that needed hardening to capture under chaos.

6. Invariants and traps

Simulator:FlakySdk:Enabled defaults to false for criterion-16 reproducibility. The merged appsettings.json ships Enabled = false. To re-run the ChaosMonkey capture, manually flip to true before building (and back to false afterward — runbook §4.5 documents this). Don't change the default to true without re-evaluating criterion 16 — it would silently break reproducibility of every pre-SLICE-1.4 row.

The conditional DI wiring is bit-for-bit reproducible when Enabled = false. The factory in InfrastructureServiceCollectionExtensions returns the inner SimulatedMachineConnection directly when Enabled=false — the decorator is not in the call path, no Random.Shared draw happens, no per-call overhead. Don't "simplify" by always returning the decorator and letting it short-circuit on Enabled=false — that would inject a (cheap but observable) Random.Shared.NextDouble() call on every connect, breaking the bit-for-bit-reproducibility guarantee.

AlarmBursterService keeps ticking even when no run is active. The cycle still fires inject → clear → recover even if the workflow is Idle between runs. The RecoverAsync rejection (workflow is not Faulted) gets caught by the swallow-and-log handler. This means under ChaosMonkey, idle time also generates fault entries in the diagnostics timeline — that's expected, not a defect.

DefectShowerService.IsShowerActive is a volatile bool, read from another thread. FramePipelineService.ProcessDefectsForFrame reads it on each frame from the consumer thread; the shower service writes it from its own background task. The volatile keyword ensures the read sees the latest write without explicit memory-barrier code. Don't remove the volatile if you change the implementation.

TimeCompressionFactor does NOT scale producer rates. The spec is explicit: tag emitters, frame producer, encoder source all run at real time regardless of TimeCompressionFactor. Only SimulatedMachineConnection.ConnectAsync (delay) and SimulatedMotionController.InterpolateAsync (per-tick wait) are scaled. If a future change applies the factor to producer rates, the data plane bandwidth becomes uncalibrated and Phase 1 measurements would no longer be reproducible with TimeCompressionFactor != 1.0.

TelemetryDropoutChance does NOT increment samples.coalesced. Dropout is a deliberate skip — neither samples.ingested nor samples.coalesced is incremented; the cell value is unchanged; the noise refstate still advances (so the random-walk is visibly continuous after dropout ends). If you ever see samples.coalesced rising under a dropout-heavy profile, something is wrong — the counter's only legitimate increment is "emitter overwrote a cell before the snapshot publisher consumed it," which has nothing to do with deliberate dropout.

FlakySdkOptions.TimeoutHangMs defaults to 30_000 (30 s). Test code that exercises the timeout branch must override this to a small value — otherwise dotnet test runs for 30 seconds per test. FlakySdkDecoratorTests.FlakySdk_TimeoutBranch_* use TimeoutHangMs = 50 or 100. If you write a new timeout-branch test, configure a small hang.

FlakySdkDecorator does NOT wrap IMotionController. Motion-side flakiness is a deferred non-scope. The criterion-11 fault-branch evidence comes from IMachineConnection-only wrapping. If a future slice wraps IMotionController, expect to surface new WorkflowService fault paths (DoHomeAsync exception branch, RunLoopAsync exception branch) that aren't currently exercised.

The DefectShowerService polls profile state at the start of each loop iteration. It does NOT subscribe to ISimulatorProfileProvider.ProfileChanged. A profile change with DefectShowerEveryMs = 0 is picked up on the next quiet/idle boundary (could be up to everyMs - durationMs later). For SLICE-1.4's purposes this is fine; for runtime tuning UIs it would feel laggy.

7. Test surface

Covered by unit tests:

  • All 7 chaos profile fields validator rules: each rejection case as a [Theory] with boundary values.
  • FlakySdkOptionsValidator rejects each chance outside [0,1] and TimeoutHangMs outside [1, 600_000].
  • DefectShowerService activates/deactivates per schedule with short timing (EveryMs=200, DurationMs=100).
  • AlarmBursterService: ≥ 3 cycles in 5s; round-robin produces 5 distinct codes; survives RecoverAsync throwing; disabled when EveryMs=0.
  • FlakySdkDecorator: each branch isolated; bypass when Enabled=false; pre-flight regression test for fall-through.
  • SimulatedTagSource honors TelemetryDropoutChance (counts consistent with dropout rate).
  • SimulatedMachineConnection honors TimeCompressionFactor (wall-clock delay scales) and NetworkLatencyMean/Stddev (Gaussian distribution mean within ±20%).
  • SimulatedMotionController honors TimeCompressionFactor (motion completes faster in wall-clock time).
  • FramePipelineServiceShowerTests: shower-active forces defect on every frame.

Covered by capture (slice-1-4-chaos-monkey + slice-1-4-soak-8h rows):

  • ChaosMonkey: 491 runs.started, 453 completes, 37 fault cycles with all four fault branches verified by log inspection (39 injected, 39 cleared, 37 recovered, 120 defect-shower transitions).
  • Soak8h: 0 faults (AlarmBurstEveryMs=0), 5 109 runs (100% completion), working-set steady-state drift = −2.7 MB across 8 hours. No leak.
  • Both: criterion-11 reproducibility check — pre-existing rows still match within bounds.

Not covered (intentional gaps):

  • Profile-roundtrip binding test for the 7 new fields. The bf32566 regression (hydration service Select() projection dropped fields) had no automated guard. A SimulatorProfileOptionsBindingTests-style test for the chaos fields specifically is filed as a follow-up but not yet implemented.
  • FlakySdk motion-side decoration. Documented non-scope. If Phase 2 motivates it, write the spec and the tests then.
  • Combined chaos × soak. No test runs ChaosMonkey for 8 hours. The slice-1-4-chaos-monkey row was 30 min; the longest under-chaos-load capture would be a future Phase 2 follow-up if needed.
  • Race between AlarmBursterService.InjectCriticalFault and an in-flight Abort. The burster doesn't synchronize with WorkflowService — under heavy concurrency the inject and the abort can race, with whichever wins cancelling the run. Behavior is well-defined (whoever wins cancels; the loser's effect is ignored) but no test exercises both paths simultaneously.

Notably absent test: there is no test for "FlakySdkOptions:Enabled toggled at runtime via IOptionsMonitor reload." The DI factory captures Enabled once at service-resolution time. If a future change wants runtime-toggleable chaos, it needs to either rebuild the decorator on IOptionsMonitor.OnChange or move the gate inside ConnectAsync (the implementation already reads _options.CurrentValue.Enabled per-call, but the registration is captured once). The spec-time decision was that runtime toggle is unnecessary; revisit if an engineering panel ever exposes the chaos knobs as live controls.

Docs-first project memory for AI-assisted implementation.