SLICE-1.4 Design Notes — Storm & Soak Profiles
- Slice: SLICE-1.4
- Implementation status: Completed (2026-05-03, criterion 12 amended)
- Audience: anyone modifying the chaos services, the flaky-SDK decorator, the simulator's load-shaping knobs, or interpreting the Phase 1 exit-gate evidence
This doc explains how the storm-and-soak chaos pipeline actually works in code — the three new background services, the flaky-SDK decorator's branch dispatch, the seven new SimulatorProfile fields and where each is consumed, the conditional DI wiring that lets the decorator be bypassed bit-for-bit when disabled, and the four FlaUI hardening fixes that the ChaosMonkey capture surfaced. This is the most complex Phase 1 slice; read this if you're tuning chaos parameters, adding a new fault-injection mode, or porting the rig to non-Windows hosts.
1. Quick reference
Three new services (all in InspectionPrototype.Application.Services except the decorator):
| Service | Type | Role |
|---|---|---|
DefectShowerService | IHostedService + IDefectShowerSchedule | Toggle-on every DefectShowerEveryMs; consulted by FramePipelineService |
AlarmBursterService | IHostedService | Periodic inject → wait → clear → wait → recover cycle |
FlakySdkDecorator | IMachineConnection decorator | Three injected fault modes around ConnectAsync |
Seven new SimulatorProfile fields (record properties, all default-zero / 1.0):
| Field | Range | Consumed by |
|---|---|---|
DefectShowerEveryMs | [0, 3_600_000] | DefectShowerService |
DefectShowerDurationMs | [0, 60_000] | DefectShowerService |
AlarmBurstEveryMs | [0, 3_600_000] | AlarmBursterService |
TelemetryDropoutChance | [0.0, 1.0] | SimulatedTagSource (per-emitter cycle) |
NetworkLatencyMeanMs | [0, 30_000] | SimulatedMachineConnection.ConnectAsync |
NetworkLatencyStddevMs | [0, 30_000] | SimulatedMachineConnection.ConnectAsync |
TimeCompressionFactor | [0.1, 100.0] | SimulatedMachineConnection + SimulatedMotionController |
One global config block (Simulator:FlakySdk):
| Option | Range | Default | Effect |
|---|---|---|---|
Enabled | bool | false | Master gate; false bypasses decorator entirely |
TimeoutChance | [0.0, 1.0] | 0.0 | Probability of timeout-hang branch |
IgnoreCancellationChance | [0.0, 1.0] | 0.0 | Probability of ignore-cancellation branch |
OutOfBandThrowChance | [0.0, 1.0] | 0.0 | Probability of out-of-band-throw branch |
TimeoutHangMs | [1, 600_000] | 30_000 | Hang duration on timeout branch |
Two new profiles (appsettings.json):
ChaosMonkey— aggressive:DefectShowerEveryMs=30s,DurationMs=3s,AlarmBurstEveryMs=45s,TelemetryDropoutChance=0.05,NetworkLatencyMeanMs=250,ConnectionFailureProbability=0.30,TimeCompressionFactor=1.0. Used withFlakySdk:Enabled=truefor the criterion-11 capture.Soak8h— gentle:DefectShowerEveryMs=10min,DurationMs=5s,AlarmBurstEveryMs=0(disabled),TelemetryDropoutChance=0.01,NetworkLatencyMeanMs=50. Used withFlakySdk:Enabled=falsefor the criterion-12 capture.
Key files:
src/InspectionPrototype.Application/State/SimulatorProfile.cs // 7 new fields
src/InspectionPrototype.Application/Abstractions/IDefectShowerSchedule.cs // single-property abstraction
src/InspectionPrototype.Application/Services/DefectShowerService.cs
src/InspectionPrototype.Application/Services/AlarmBursterService.cs
src/InspectionPrototype.Application/Services/FramePipelineService.cs // consults IDefectShowerSchedule
src/InspectionPrototype.Infrastructure/Simulator/FlakySdkOptions.cs
src/InspectionPrototype.Infrastructure/Simulator/FlakySdkOptionsValidator.cs
src/InspectionPrototype.Infrastructure/Simulator/FlakySdkDecorator.cs
src/InspectionPrototype.Infrastructure/Simulator/SimulatorProfilesOptions.cs // 7 new binder fields
src/InspectionPrototype.Infrastructure/Simulator/SimulatorProfilesValidator.cs // 7 new validation rules
src/InspectionPrototype.Infrastructure/Simulator/SimulatorProfileHydrationService.cs // maps 7 new fields
src/InspectionPrototype.Infrastructure/Simulator/SimulatedTagSource.cs // dropout
src/InspectionPrototype.Infrastructure/Simulator/SimulatedMachineConnection.cs // time + latency
src/InspectionPrototype.Infrastructure/Simulator/SimulatedMotionController.cs // tick scaling
src/InspectionPrototype.Infrastructure/InfrastructureServiceCollectionExtensions.cs // conditional decorator wiring
src/InspectionPrototype.App/appsettings.json // 2 new profiles + FlakySdk blockKey tests (selected):
| Test | Asserts |
|---|---|
DefectShowerServiceTests | Activates and deactivates per schedule |
AlarmBursterServiceTests (4 tests) | ≥ 3 cycles in 5s; round-robin code distribution; survives RecoverAsync throw; disabled when EveryMs=0 |
FlakySdkDecoratorTests (4 tests) | Each branch isolated by setting one chance to 1.0; bypass when Enabled=false |
FlakySdk_TimeoutBranch_WhenNotCancelled_FallsThroughToInner | Pre-flight fix: timeout falls through to inner when CT not cancelled |
SimulatedTagSourceDropoutTests | samples.ingested ≈ 50% under TelemetryDropoutChance=0.5 |
SimulatedMachineConnectionTimeCompressionTests | Wall-clock connect delay scales with TimeCompressionFactor |
SimulatedMachineConnectionNetworkLatencyTests | Gaussian latency mean within ±20% of configured |
SimulatorProfilesValidatorChaosTests | Each new validation rule rejects out-of-range values |
FramePipelineServiceShowerTests | Shower-active forces defect on every frame |
2. Class shape
The chaos pipeline is several independent subsystems that all read from ISimulatorProfileProvider but otherwise don't share state:
┌────────────────────────────────────┐
│ ISimulatorProfileProvider │
│ .CurrentProfile │
│ - DefectShowerEveryMs/Duration │
│ - AlarmBurstEveryMs │
│ - TelemetryDropoutChance │
│ - NetworkLatencyMean/Stddev │
│ - TimeCompressionFactor │
└────────────┬───────────────────────┘
│
┌──────────────┬───────────────┼───────────────┬──────────────────┐
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
┌────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│Defect- │ │AlarmBurster- │ │SimulatedTag- │ │SimulatedMach-│ │SimulatedMo- │
│Shower- │ │Service │ │Source │ │ineConnection │ │tionController│
│Service │ │ │ │ (per-emitter │ │ (per Connect-│ │ (per tick │
│ (back- │ │ (background │ │ loop reads │ │ Async call: │ │ inside │
│ ground │ │ inject loop │ │ Dropout- │ │ scale delay,│ │ Interpolate-│
│ flag │ │ + retry- │ │ Chance per │ │ add Gaussian│ │ Async │
│ flip │ │ resilient │ │ cycle) │ │ latency) │ │ scales tick │
│ cycle) │ │ cycle) │ │ │ │ │ │ by TimeComp)│
└───┬────┘ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ └──────────────┘
│ │ │ │
│ exposes │ calls │ skips │
│ IDefect- │ IFault- │ samples. │ delays /
│ Shower- │ Injector + │ ingested when │ ConnectAsync
│ Schedule │ IWorkflow- │ rng < dropout │ result
│ │ Service │ │
▼ ▼ ▼ ▼
┌────────┐ ┌──────────────┐ (existing) (existing)
│Frame- │ │SimulatorFault│
│Pipeline│ │Injector + │
│Service │ │WorkflowService
│.Process│ │ → AppState │
│Defects │ │ .Active- │
│ForFrame│ │ Alarms │
│ checks │ │ .Workflow- │
│ Is- │ │ State = │
│ Shower-│ │ Faulted │
│ Active │ │ etc. │
└────────┘ └──────────────┘FlakySdkDecorator wraps IMachineConnection conditionally:
Configuration: DI registration:
──────────────── ────────────────
"Simulator:FlakySdk": IServiceCollection:
Enabled: true ─────┐ services.AddSingleton<SimulatedMachineConnection>()
│ services.AddSingleton<IMachineConnection>(sp =>
│ var opts = sp.GetRequiredService<
▼ IOptionsMonitor<FlakySdkOptions>>().CurrentValue;
var inner = sp.GetRequiredService<SimulatedMachineConnection>();
┌────────────────────────────┐ if (!opts.Enabled) return inner;
│ FlakySdkDecorator │ return new FlakySdkDecorator(inner, ...);
│ wraps SimulatedMachine- │ })
│ Connection │
│ │
│ ConnectAsync: │
│ roll = Random.Shared │
│ if roll < Timeout: │
│ Task.Delay(HangMs) │
│ ct.ThrowIfCancelled() │ Pre-flight fix (commit 018bf29):
│ return inner(ct) ◀──┼── falls through to inner if not cancelled
│ elif roll < T+Ignore: │
│ return inner(None) ◀─┼── ignores caller's CT
│ elif roll < T+I+Throw: │
│ throw InvalidOp ◀─┼── out-of-band throw
│ else: return inner(ct) │
└────────────────────────────┘
Configuration when DI registration:
"Enabled": false: ────────────────
IMachineConnection resolves DIRECTLY to
SimulatedMachineConnection — decorator NOT in
the call path. Bit-for-bit reproducibility for
pre-SLICE-1.4 captures (criterion 16).3. Lifecycle — the three new services
DefectShowerService is IHostedService + IDefectShowerSchedule. State machine:
host start host stop
│ │
▼ ▼
┌──────────────────┐ ┌──────────────┐
│ Start.Async │ │ StopAsync │
│ starts back- │ │ cancels CTS │
│ ground task │ │ awaits task │
└────────┬─────────┘ └──────────────┘
│
│ RunAsync loop:
│
▼
┌──────────────────────────────────────────────────────────┐
│ Read profile.DefectShowerEveryMs each iteration │
│ │
│ if everyMs <= 0: │
│ Task.Delay(1000); continue (idle poll) │
│ │
│ quietMs = max(0, everyMs - durationMs) │
│ Task.Delay(quietMs) ◀── window closed │
│ │
│ _isActive = true ◀── window opens │
│ Log Info "Defect shower active" │
│ Update(s => s.WithDiagnosticsEntry(Pipeline, Info, …)) │
│ │
│ Task.Delay(durationMs) │
│ │
│ _isActive = false ◀── window closes │
│ Log Info "Defect shower ended" │
│ Update(s => s.WithDiagnosticsEntry(Pipeline, Info, …)) │
│ │
│ loop ─────────────────────────────▶ │
└──────────────────────────────────────────────────────────┘AlarmBursterService is IHostedService. The loop wraps each cycle in try/catch so a single failure (e.g., RecoverAsync rejected because workflow is mid-Stopping) doesn't terminate the host:
RunAsync loop (per iteration):
profile.AlarmBurstEveryMs == 0? ──── yes ───▶ Task.Delay(1000) ──┐
│
loop ┘
│ no
▼
Task.Delay(everyMs)
│
▼
code = _pool[Interlocked.Increment(ref _index) % 5]
│
│ pool: [CHAOS-001, CHAOS-002, CHAOS-003, CHAOS-004, CHAOS-005]
│ Round-robin so OnFaultInjected's "already active" duplicate-
│ code branch is also exercised by the wrap-around.
│
▼
try {
_faultInjector.InjectCriticalFault(code, "ChaosMonkey burst at HH:mm:ss.fff")
Task.Delay(500) ◀── workflow observes Faulted state
_faultInjector.ClearFault(code)
Task.Delay(500) ◀── workflow's OnFaultCleared marks alarm inactive
await _workflow.RecoverAsync() ◀── workflow → Ready (if homed) or Idle
}
catch (OperationCanceledException) → break (graceful stop)
catch (Exception ex) → Log Warning + continue ◀── load-bearing resilienceFlakySdkDecorator is stateless — no lifecycle, just wraps each ConnectAsync call. Branch dispatch uses cumulative probability bands, not independent rolls:
roll = Random.Shared.NextDouble() ── one draw per ConnectAsync call
roll ∈ [0, T): Timeout branch
roll ∈ [T, T+I): IgnoreCancellation branch
roll ∈ [T+I, T+I+O): OutOfBandThrow branch
roll ∈ [T+I+O, 1): Pass-through to inner
where T = TimeoutChance, I = IgnoreCancellationChance, O = OutOfBandThrowChanceThis means at most one branch fires per call. The spec sketched independent draws (each branch could fire on the same call); the implementation chose mutual exclusion for simpler reasoning about the probability budget. Both satisfy criterion 10's per-branch test (set one chance to 1.0, others to 0.0). Cumulative bands are easier to reason about; independent draws are slightly more chaotic.
4. Runtime flow — fault burst lifecycle
The headline flow is the inject-clear-recover cycle. Everything else is single-actor.
AlarmBurster FaultInjector WorkflowService.OnFaultInjected AppState
──────────── ───────────── ──────────────────────────── ────────
│
┌────┴────┐
│ Tick @ │
│ AlarmB- │
│ urstEv- │
│ eryMs │
└────┬────┘
│
│ InjectCriticalFault("CHAOS-N", msg)
│ ─────────────────────────▶
│ │
│ ┌────────────────────┴────────────────────┐
│ │ if !_activeFaultCodes.Add(code): │
│ │ log Info "ignored (already active)" │
│ │ return ◀── round-robin wrap hits │
│ │ FaultInjected?.Invoke(args) │
│ └────────────────────┬────────────────────┘
│ │
│ │ event subscriber
│ ▼
│ OnFaultInjected(args):
│ alarm = new Alarm(Critical, …, IsActive=true)
│ Update(s => s with {
│ ActiveAlarms = … + alarm,
│ WorkflowState = Faulted,
│ MotionState = NotReady,
│ IsMotionHomed = false
│ })
│ + Critical/Error diagnostics entry
│ + WithAlarmRaised()
│ ─────────────────────────────────▶ Faulted
│ _runCts?.Cancel() ─── cancels in-flight run
│ _homeCts?.Cancel() ─── cancels in-flight home
│
┌────┴────┐
│ Task. │
│ Delay │
│ (500) │
└────┬────┘
│
│ ClearFault("CHAOS-N")
│ ─────────────────────────▶
│ │
│ │ FaultCleared?.Invoke(code)
│ ▼
│ OnFaultCleared(code):
│ Update(s => s with {
│ ActiveAlarms = … with IsActive=false on this code
│ })
│ + Info diagnostics entry
│ ─────────────────────────────────▶ Faulted (unchanged)
│ alarm now inactive
┌────┴────┐
│ Task. │
│ Delay │
│ (500) │
└────┬────┘
│
│ await _workflow.RecoverAsync()
│ ─────────────────────────────────────────────────────▶
│ RecoverAsync:
│ if !CommandGuards.CanRecover(state):
│ log Warning + diagnostics rejection
│ return ◀── busy → caught by burster
│ catch + continue
│ nextWorkflow = IsMotionHomed
│ ? Ready : Idle
│ Update(s => s with {
│ WorkflowState = nextWorkflow,
│ ActiveAlarms = filtered to active only
│ })
│ ─────────────────▶ Ready or Idle
│
│ next outer scenario tick: StartRun → ...
│
┌────┴────┐
│ Tick @ │
│ AlarmB- │
│ urstEv- │
│ eryMs │
└─────────┘The 500 ms gaps are tuned empirically. Too short and the workflow doesn't observe the Faulted state before the clear arrives (some events get coalesced); too long and the recovery cycle dominates the run-throughput budget. 500 ms × 2 = 1 second of fault overhead per cycle, with AlarmBurstEveryMs=45 000 that's ~2.2% of wall-clock time spent in fault states under ChaosMonkey.
5. Decisions made during implementation
(a) [CallerMemberName] was not used here. The chaos services don't instrument AppStateStore.Update themselves; they call existing methods (InjectCriticalFault, RecoverAsync, etc.) that go through the normal WorkflowService path. SLICE-2.0 is the slice that adds caller-attribution to AppStateStore.Update. SLICE-1.4 just exercises the workflow's existing behavior repeatedly.
(b) IDefectShowerSchedule is a single-property abstraction. Just bool IsShowerActive { get; }. Could have been a richer interface (next shower start time, total shower count, etc.) but the consumer (FramePipelineService.ProcessDefectsForFrame) only needs the boolean. Keeping the interface narrow makes the dependency in FramePipelineService cheap and easy to fake in tests (FakeDefectShowerSchedule is 8 lines).
(c) AlarmBursterService survives RecoverAsync rejection. The try/catch (Exception ex) => Log Warning + continue block is load-bearing. WorkflowService.RecoverAsync rejects when the workflow isn't Faulted — possible if a previous chaos cycle's fault was absorbed by a different code path (e.g., an Abort racing with the inject). The burster needs to continue ticking, not crash the host. AlarmBursterServiceTests.AlarmBurster_WhenRecoverThrows_LogsWarningAndContinues is the regression test.
(d) Round-robin alarm-code pool of 5. Using only CHAOS-001 would let SimulatorFaultInjector._activeFaultCodes.Add reject the second injection (returning false) before the workflow could enter Faulted. Five codes mean each cycle has a fresh code, but the round-robin wraps every 5 cycles — and the wrap-around hits the "already active" branch deliberately, exercising that path too.
(e) Cumulative probability bands in the decorator (not independent draws). See §3 explanation. The mutual-exclusion property makes the probability-budget reasoning trivial: T + I + O is the total chaos rate; 1 - (T + I + O) is the pass-through rate. With independent draws, multiple branches could fire on the same call and the order of evaluation would matter. The cumulative form is simpler.
(f) Pre-flight commit pattern (commit 018bf29). Two changes had to land before the Soak8h capture started:
- Flip
Simulator:FlakySdk:Enabledfromtruetofalseinappsettings.jsonso that re-captures of pre-SLICE-1.4 rows (slice-1-1,slice-1-2,slice-1-3) reproduce within the existing accuracy bounds (criterion 16). - Fix
FlakySdkDecorator.ConnectAsync's timeout branch to fall through to inner when the caller's CT was not cancelled during the hang. The original implementation always threwOperationCanceledException, masking the spec's intent that "the SDK eventually completes after a long delay" should be a survivable case (workflow'sDoConnectAsyncwould catch and treat as Disconnected).
Doing these in a single commit before the soak meant the soak's evidence was clean. Doing them after would have required re-running the 8-hour capture.
(g) The SimulatorProfileHydrationService field-mapping fix (commit bf32566). Pass 1 added 7 new fields to SimulatorProfile and SimulatorProfileOptions, but the hydration service's Select() projection silently dropped them — runtime saw all-zero chaos knobs even when appsettings.json had non-zero values. The FlaUI rig with ChaosMonkey was the first thing to exercise the wiring end-to-end, surfacing the bug. The existing test suite did not catch this because no test asserts "options-bound SimulatorProfileOptions carries new field through to the runtime SimulatorProfile catalog entry." Add a binding-roundtrip test if a future slice introduces more SimulatorProfile fields — the gap remains.
(h) FlaUI scenario-rig hardening (4 commits — bf32566, 0f1596a, 5462d42, 2108272). ChaosMonkey's fault rate exposed gaps in the MultiTagSoakFlaUi scenario:
RecoverButtonhad noAutomationProperties.AutomationId— FlaUI couldn't click it after a fault.- The post-run loop didn't wait for
Faulted → Idletransition before issuing the nextHomeclick. Homewas a single-attempt operation that threw if a fault fired during the homing window (which it does ~every 45 s).Connectwas a single-attempt operation that threw on the first failed connect underConnectionFailureProbability = 0.30.
Each fix is a retry loop in the scenario, not a code change in the application. Application-side behavior was correct throughout — the scenario rig was the surface that needed hardening to capture under chaos.
6. Invariants and traps
Simulator:FlakySdk:Enabled defaults to false for criterion-16 reproducibility. The merged appsettings.json ships Enabled = false. To re-run the ChaosMonkey capture, manually flip to true before building (and back to false afterward — runbook §4.5 documents this). Don't change the default to true without re-evaluating criterion 16 — it would silently break reproducibility of every pre-SLICE-1.4 row.
The conditional DI wiring is bit-for-bit reproducible when Enabled = false. The factory in InfrastructureServiceCollectionExtensions returns the inner SimulatedMachineConnection directly when Enabled=false — the decorator is not in the call path, no Random.Shared draw happens, no per-call overhead. Don't "simplify" by always returning the decorator and letting it short-circuit on Enabled=false — that would inject a (cheap but observable) Random.Shared.NextDouble() call on every connect, breaking the bit-for-bit-reproducibility guarantee.
AlarmBursterService keeps ticking even when no run is active. The cycle still fires inject → clear → recover even if the workflow is Idle between runs. The RecoverAsync rejection (workflow is not Faulted) gets caught by the swallow-and-log handler. This means under ChaosMonkey, idle time also generates fault entries in the diagnostics timeline — that's expected, not a defect.
DefectShowerService.IsShowerActive is a volatile bool, read from another thread. FramePipelineService.ProcessDefectsForFrame reads it on each frame from the consumer thread; the shower service writes it from its own background task. The volatile keyword ensures the read sees the latest write without explicit memory-barrier code. Don't remove the volatile if you change the implementation.
TimeCompressionFactor does NOT scale producer rates. The spec is explicit: tag emitters, frame producer, encoder source all run at real time regardless of TimeCompressionFactor. Only SimulatedMachineConnection.ConnectAsync (delay) and SimulatedMotionController.InterpolateAsync (per-tick wait) are scaled. If a future change applies the factor to producer rates, the data plane bandwidth becomes uncalibrated and Phase 1 measurements would no longer be reproducible with TimeCompressionFactor != 1.0.
TelemetryDropoutChance does NOT increment samples.coalesced. Dropout is a deliberate skip — neither samples.ingested nor samples.coalesced is incremented; the cell value is unchanged; the noise refstate still advances (so the random-walk is visibly continuous after dropout ends). If you ever see samples.coalesced rising under a dropout-heavy profile, something is wrong — the counter's only legitimate increment is "emitter overwrote a cell before the snapshot publisher consumed it," which has nothing to do with deliberate dropout.
FlakySdkOptions.TimeoutHangMs defaults to 30_000 (30 s). Test code that exercises the timeout branch must override this to a small value — otherwise dotnet test runs for 30 seconds per test. FlakySdkDecoratorTests.FlakySdk_TimeoutBranch_* use TimeoutHangMs = 50 or 100. If you write a new timeout-branch test, configure a small hang.
FlakySdkDecorator does NOT wrap IMotionController. Motion-side flakiness is a deferred non-scope. The criterion-11 fault-branch evidence comes from IMachineConnection-only wrapping. If a future slice wraps IMotionController, expect to surface new WorkflowService fault paths (DoHomeAsync exception branch, RunLoopAsync exception branch) that aren't currently exercised.
The DefectShowerService polls profile state at the start of each loop iteration. It does NOT subscribe to ISimulatorProfileProvider.ProfileChanged. A profile change with DefectShowerEveryMs = 0 is picked up on the next quiet/idle boundary (could be up to everyMs - durationMs later). For SLICE-1.4's purposes this is fine; for runtime tuning UIs it would feel laggy.
7. Test surface
Covered by unit tests:
- All 7 chaos profile fields validator rules: each rejection case as a
[Theory]with boundary values. FlakySdkOptionsValidatorrejects each chance outside[0,1]andTimeoutHangMsoutside[1, 600_000].DefectShowerServiceactivates/deactivates per schedule with short timing (EveryMs=200, DurationMs=100).AlarmBursterService: ≥ 3 cycles in 5s; round-robin produces 5 distinct codes; survivesRecoverAsyncthrowing; disabled whenEveryMs=0.FlakySdkDecorator: each branch isolated; bypass whenEnabled=false; pre-flight regression test for fall-through.SimulatedTagSourcehonorsTelemetryDropoutChance(counts consistent with dropout rate).SimulatedMachineConnectionhonorsTimeCompressionFactor(wall-clock delay scales) andNetworkLatencyMean/Stddev(Gaussian distribution mean within ±20%).SimulatedMotionControllerhonorsTimeCompressionFactor(motion completes faster in wall-clock time).FramePipelineServiceShowerTests: shower-active forces defect on every frame.
Covered by capture (slice-1-4-chaos-monkey + slice-1-4-soak-8h rows):
- ChaosMonkey: 491 runs.started, 453 completes, 37 fault cycles with all four fault branches verified by log inspection (39 injected, 39 cleared, 37 recovered, 120 defect-shower transitions).
- Soak8h: 0 faults (AlarmBurstEveryMs=0), 5 109 runs (100% completion), working-set steady-state drift = −2.7 MB across 8 hours. No leak.
- Both: criterion-11 reproducibility check — pre-existing rows still match within bounds.
Not covered (intentional gaps):
- Profile-roundtrip binding test for the 7 new fields. The
bf32566regression (hydration serviceSelect()projection dropped fields) had no automated guard. ASimulatorProfileOptionsBindingTests-style test for the chaos fields specifically is filed as a follow-up but not yet implemented. FlakySdkmotion-side decoration. Documented non-scope. If Phase 2 motivates it, write the spec and the tests then.- Combined chaos × soak. No test runs ChaosMonkey for 8 hours. The
slice-1-4-chaos-monkeyrow was 30 min; the longest under-chaos-load capture would be a future Phase 2 follow-up if needed. - Race between
AlarmBursterService.InjectCriticalFaultand an in-flight Abort. The burster doesn't synchronize withWorkflowService— under heavy concurrency the inject and the abort can race, with whichever wins cancelling the run. Behavior is well-defined (whoever wins cancels; the loser's effect is ignored) but no test exercises both paths simultaneously.
Notably absent test: there is no test for "FlakySdkOptions:Enabled toggled at runtime via IOptionsMonitor reload." The DI factory captures Enabled once at service-resolution time. If a future change wants runtime-toggleable chaos, it needs to either rebuild the decorator on IOptionsMonitor.OnChange or move the gate inside ConnectAsync (the implementation already reads _options.CurrentValue.Enabled per-call, but the registration is captured once). The spec-time decision was that runtime toggle is unnecessary; revisit if an engineering panel ever exposes the chaos knobs as live controls.