SLICE-1.4: Storm & Soak Profiles
- Status: Completed (2026-05-03, criterion 12 measurement amended)
- Date: 2026-04-30
- Depends on: Requirements, Evolution Roadmap, SLICE-006: Observability Baseline, SLICE-1.1: Multi-Tag Telemetry, SLICE-1.2: Real Frame Payloads, SLICE-1.3: Encoder-Rate Motion, SLICE-1.6: FlaUI Capture
Goal
Add storm-and-soak knobs to SimulatorProfile so the simulator can produce conditions that exercise every fault branch of WorkflowService and run uninterrupted long enough to surface memory leaks. Two new named profiles — ChaosMonkey (high-rate fault injection, 30-minute capture) and Soak8h (low chaos, 8-hour capture) — drive Phase 1's exit gate. The slice's load-bearing evidence is twofold: (1) under ChaosMonkey the workflow's connect-fail / fault-during-home / fault-during-run / fault-clear-and-recover paths each trigger at least once in a 30-minute capture; (2) under Soak8h the process's working-set growth stays under 50 MB across an 8-hour real-time run with runs.faulted bounded.
Why This Slice
Today the simulator drives the prototype at one steady rate per profile. There is no temporal variation in the load it produces — no defect storms, no alarm bursts, no SDK call hangs, no telemetry dropouts. Every fault path in WorkflowService.OnFaultInjected / DoConnectAsync / RunLoopAsync / DoHomeAsync is exercised by hand-driven engineering-panel input or by setting ConnectionFailureProbability once and waiting. The Phase 1 measurement rows (rows 0 / 0a / 0b / slice-1-1-multi-tag-telemetry / slice-1-2-real-frame-payloads / slice-1-3-encoder-rate-motion) establish that the data plane survives steady load; none establishes that the fault plane survives bursty load or that the process survives wall-clock-long runs.
The roadmap (§3, Phase 1 row 1.4) calls for DefectShowerRate, AlarmBurstEvery, TelemetryDropoutChance, NetworkLatencyMeanMs, NetworkLatencyStddevMs, TimeCompressionFactor, two new profiles ChaosMonkey + Soak8h, and an SDK-flakiness wrapper that injects timeouts, cancellation-that-doesn't-cancel, and out-of-band throws. The exit-gate criteria are: (a) 8-hour Soak8h completes without leaking memory (RSS growth < 50 MB), and (b) ChaosMonkey triggers at least one code path in every fault branch of WorkflowService.
This slice does not refactor anything. It only adds load-shaping inputs to the simulator and the configuration shapes that drive them. Phase 2 may then justify lift-outs based on which paths the chaos profile breaks; without the chaos profile, Phase 2's "store under pressure" exit gate has no measurement basis.
Requirements Coverage
- 04. UI and Technical Requirements: bounded streaming with measurable behavior under load variation; long-running processes must remain stable
- 05. Failure Modes and Workflow Requirements: the workflow state machine must survive fault bursts and SDK flakiness without destabilization
- 07. AI Delivery Constraints and Roadmap: each phase ships a measurable before-and-after; this is rows
slice-1-4-chaos-monkeyandslice-1-4-soak-8hin the measurements table
In Scope
Profile fields
SimulatorProfile (the record in Application.State) and SimulatorProfileOptions (the JSON-binding shape in Infrastructure.Simulator) gain seven new fields. All default to zero / 1.0 so existing profiles preserve their current behavior:
int DefectShowerEveryMs— period in milliseconds between defect-shower windows.0disables. Range[0, 3_600_000].int DefectShowerDurationMs— duration of each defect-shower window in milliseconds. While the window is open, the per-frame defect probability is forced to1.0regardless ofDefectProbabilityPerFrame. Required> 0ifDefectShowerEveryMs > 0. Range[0, 60_000].int AlarmBurstEveryMs— period between scheduled critical-fault inject + clear + recover cycles.0disables. Range[0, 3_600_000]. Each burst raises one alarm code drawn from a small fixed pool (CHAOS-001…CHAOS-005), waits a short interval (~500 ms), clears the fault, then issues aRecoverAsyncso the workflow returns to Ready and a fresh run can start.double TelemetryDropoutChance— per-emit-cycle probability that a tag emitter skips publishing a sample (the cell holds the previous value butsamples.ingestedis not incremented). Range[0.0, 1.0]. Tag staleness detection is a Phase 2 concern; in this slice the dropout shows up in CSV as a reduction insamples.ingestedrate per tag.double NetworkLatencyMeanMs— mean of an additive Gaussian latency injected beforeIMachineConnection.ConnectAsyncreturns. Range[0.0, 30_000.0].0disables.double NetworkLatencyStddevMs— standard deviation of the same Gaussian. Range[0.0, 30_000.0]. Required if mean > 0; allowed to be 0 (deterministic latency).double TimeCompressionFactor— multiplier that shortens simulator-internal idle delays. Default1.0(real-time).2.0halves connection delay and motion-tick wait;10.0divides by ten. Range[0.1, 100.0]. Producer rates (tags / frames / encoder) are deliberately not affected — the data plane represents real machine I/O bandwidth and must stay representative even under compression. Documented prominently both on theSimulatorProfilerecord's XML doc and in the runbook.
SDK-flakiness wrapper
A new FlakySdkOptions block bound to Simulator:FlakySdk:
double TimeoutChance— probability that a wrapped call hangs longer than the caller's timeout. Implementation: when triggered, the decorator awaits aTask.Delayof (caller's CTS expected timeout × 2). Range[0.0, 1.0].double IgnoreCancellationChance— probability that a wrapped call ignores cancellation and completes normally despite the caller cancelling its CTS. Implementation: wrap the inner call without forwarding theCancellationToken. Range[0.0, 1.0].double OutOfBandThrowChance— probability that a wrapped call throws anInvalidOperationException(a non-OperationCanceledExceptionexception type) at a random point in its lifetime. Range[0.0, 1.0].bool Enabled— master gate. Whenfalse, the decorator passes calls through unmodified; whentrue, the three chances above apply. Defaults tofalseso existing profiles see no change.
FlakySdkDecorator<IMachineConnection> wraps IMachineConnection.ConnectAsync only in this slice. Wrapping IMotionController.HomeAsync / MoveToAsync is deferred to a follow-up — the connection path is sufficient to exercise WorkflowService.DoConnectAsync's exception branch, which is one of the criterion-A paths.
New profiles in appsettings.json
ChaosMonkey:MotionSpeedUnitsPerSecond: 50.0,TelemetryIntervalMs: 50,FrameIntervalMs: 100,FrameWidth: 1024,FrameHeight: 768,BytesPerPixel: 1,EncoderIntervalMs: 5,DefectProbabilityPerFrame: 0.05,ConnectionFailureProbability: 0.30,DefectShowerEveryMs: 30_000,DefectShowerDurationMs: 3_000,AlarmBurstEveryMs: 45_000,TelemetryDropoutChance: 0.05,NetworkLatencyMeanMs: 250,NetworkLatencyStddevMs: 150,TimeCompressionFactor: 1.0. PlusSimulator:FlakySdkset withEnabled: true,TimeoutChance: 0.05,IgnoreCancellationChance: 0.05,OutOfBandThrowChance: 0.05.Soak8h:MotionSpeedUnitsPerSecond: 30.0,TelemetryIntervalMs: 100,FrameIntervalMs: 250,FrameWidth: 1024,FrameHeight: 768,BytesPerPixel: 1,EncoderIntervalMs: 5,DefectProbabilityPerFrame: 0.05,ConnectionFailureProbability: 0.05,DefectShowerEveryMs: 600_000,DefectShowerDurationMs: 5_000,AlarmBurstEveryMs: 0(disabled — alarm cycles dominate run-throughput and the soak's purpose is leak detection, not fault-path coverage),TelemetryDropoutChance: 0.01,NetworkLatencyMeanMs: 50,NetworkLatencyStddevMs: 20,TimeCompressionFactor: 1.0.Simulator:FlakySdkis shared withChaosMonkeybut is irrelevant underSoak8hsinceConnectionFailureProbabilityand the lack of forced reconnects mean the wrapper rarely fires.Important:
FlakySdkOptionsis a single top-level config block, not a per-profile one. Phase-1 scope is one set of flakiness knobs that get applied wheneverEnabled: true. Profile selection turns the decorator on/off only via the master gate; per-profile flakiness tuning is a follow-up.
Background services
DefectShowerService(IHostedServiceinApplication.Services) holds the current "shower window open?" boolean and ticksDefectShowerEveryMsto flip it on forDefectShowerDurationMs.FramePipelineService.ProcessDefectsForFrameconsults the service via a newIDefectShowerSchedule { bool IsShowerActive { get; } }abstraction; when active, the per-frame probability check is short-circuited to "always defect". The shower flip is logged once per transition (not per frame) to keep the diagnostics timeline readable. Service is opt-in by config: whenDefectShowerEveryMs == 0for the active profile,IsShowerActivereturnsfalsepermanently.AlarmBursterService(IHostedServiceinApplication.Services) ticksAlarmBurstEveryMsand on each tick: (1) callsIFaultInjector.InjectCriticalFaultwith one ofCHAOS-001…CHAOS-005(round-robin, soOnFaultInjected's "already active" branch is also exercised by the duplicate-code path), (2) waits 500 ms, (3) callsIFaultInjector.ClearFaultfor the same code, (4) waits 500 ms, (5) callsIWorkflowService.RecoverAsync. WhenAlarmBurstEveryMs == 0, the service exits its loop immediately and stays idle.- Both services subscribe to
ISimulatorProfileProvider.ProfileChangedand rebuild their tickers if the relevant interval field changes.
Telemetry dropout
SimulatedTagSource's per-emitter loop consults the active SimulatorProfile.TelemetryDropoutChance. When Random.Shared.NextDouble() < TelemetryDropoutChance, the emitter skips publishing for that cycle (cell value unchanged, samples.ingested and samples.coalesced untouched). Per-tag noise model still advances its ref state (so the random-walk doesn't reset to baseline after a long dropout window). Dropout is not a coalesce — it is a deliberate skip — and must not be counted as one in metrics.
Time compression
TimeCompressionFactor is read from the active profile in two places:
SimulatedMachineConnection.ConnectAsync:await Task.Delay(TimeSpan.FromMilliseconds(_connectDelay.TotalMilliseconds / factor), ct).SimulatedMotionController.InterpolateAsync: the per-tick wait is divided byfactorso motion completes faster in wall-clock time (the simulated commanded position still advances atMotionSpeedUnitsPerSecondper simulated second, but the simulator's internal "second" passes faster).
Producer ticks (tag emitters, frame ticker, encoder ticker) are not scaled. The slice's spec text notes this divergence prominently.
Metrics
No new counters in this slice. Existing counters (runs.started, runs.completed, runs.faulted, frames.dropped, samples.ingested, etc.) carry the chaos signal: runs.faulted count rises under ChaosMonkey, dotnet.process.memory.working_set peak vs. average tells the leak story under Soak8h. Two derived columns are added to the row block for this slice's evidence:
working-set growth (MB)—working_setpeak minusworking_setfirst-second value, in MB, rounded to 1 decimal. Captures the wall-clock leak signal directly.fault-cycles (count)— derived asruns.faultedtotal underChaosMonkey. Documents how many full inject-clear-recover cycles the workflow completed.
Both are computed in MeasurementExtraction.psm1 from existing CSV columns (no new producer-side instrumentation).
Configuration validation
SimulatorProfilesValidatorextends to reject:DefectShowerEveryMs < 0or> 3_600_000DefectShowerDurationMs < 0or> 60_000, orDefectShowerEveryMs > 0 && DefectShowerDurationMs == 0, orDefectShowerDurationMs > DefectShowerEveryMsAlarmBurstEveryMs < 0or> 3_600_000TelemetryDropoutChance < 0.0or> 1.0NetworkLatencyMeanMs < 0or> 30_000NetworkLatencyStddevMs < 0or> 30_000, orNetworkLatencyMeanMs > 0 && NetworkLatencyStddevMs is missing(treated as 0; only reject negatives)TimeCompressionFactor < 0.1or> 100.0
- New
FlakySdkOptionsValidatorrejects any ofTimeoutChance/IgnoreCancellationChance/OutOfBandThrowChanceoutside[0.0, 1.0].Enabledis unconditionally valid (it is just a gate).
Measurement scenarios (runbook §4.5 + §4.6)
§4.5 covers the 30-minute ChaosMonkey capture. §4.6 covers the 8-hour Soak8h capture (with sleep-disable discipline reaffirmed and a "do not run on a host you also intend to use" warning). Both use the existing MultiTagSoakFlaUi scenario from SLICE-1.6 invoked with -Profile ChaosMonkey / -Profile Soak8h respectively. No new FlaUI scenario class.
Measurement rows
Two row blocks are appended to docs/reviews/phase-1-measurements.md:
slice-1-4-chaos-monkey— 30-minute capture; baselineslice-1-3-encoder-rate-motion. Highlights:runs.faulted > 0,runs.started > 5(at least one inject-clear-recover-rerun cycle), one diagnostics-timeline entry per fault branch hit.slice-1-4-soak-8h— 8-hour capture; baselineslice-1-2-real-frame-payloads(closest comparable continuous-load row). Highlights:working-set growth ≤ 50 MB,gen-2-gc-countis bounded (no GC-pressure runaway), no unhandled-exception entries in the log.
MeasurementExtraction.psm1 gains Get-WorkingSetGrowthMb and Get-FaultCyclesCount helpers. ConvertTo-MeasurementRow adds two rows to the markdown block (working-set growth (MB), fault-cycles (count)); guarded with "—" when the CSV pre-dates this slice (parallels SLICE-1.2's GC-pause / LOH-alloc handling and SLICE-1.3's encoder-rate handling).
Out of Scope
- a UI panel for the chaos knobs — they live in
appsettings.jsonand are configured by profile selection only. Engineering-panel UI for runtime chaos tuning is Phase 3. - decorating
IMotionControllerwithFlakySdkDecorator— connection-only is enough for the criterion-A fault-branch coverage; the motion-decorator is a documented follow-up. - per-profile
FlakySdkOptions— theSimulator:FlakySdkblock is global (one config object); per-profile is a follow-up. - defect-shower interactions with the run-history persistence layer —
DefectShowerServiceonly forces the per-frame probability; the existingRunSummary.DefectsMinor/Major/Criticalaccumulation is unchanged. - automated chaos-monkey unit tests that drive the workflow's fault paths in-process — the criterion-A evidence is the 30-minute capture, not a unit test. Unit tests cover the new services' tick logic in isolation.
- a long-running CI job that runs
Soak8hautomatically — the 8-hour soak is a manual capture per release. CI guards stay green for sub-second tests. - adding
TagQuality.Staletransitions to dropouts — staleness handling is Phase 2 / SLICE-2.3 (data-plane lift-out). This slice's dropout produces gaps insamples.ingestedonly. - changing the
ConnectionFailureProbabilitysemantics — it remains a hard "return false" path; the network-latency knob adds wall-clock delay before the success/failure result, but does not change the result distribution. - buffer pooling for any of the new path data — Phase 2 if measurements show pressure.
- introducing a new
IScenarioclass —MultiTagSoakFlaUiwith-Profile ChaosMonkeyor-Profile Soak8his the capture path (mirrors SLICE-1.2 and SLICE-1.3). - modifying
SimulatedMotionController.InterpolateAsynccore motion model — only the per-tick wait is scaled byTimeCompressionFactor. The interpolation arithmetic,_currentX/_currentY, andPositionChangedevent remain identical. - writing a
TimeProvider-style abstraction across the simulator —TimeCompressionFactoris a localized scalar applied at two sites, not a globally injected clock. A properTimeProviderlift is Phase 2 or later if needed. - replacing or modifying the existing
IFaultInjectorinterface —AlarmBursterServicecalls the existing API.
Runtime Behavior
Defect-shower lifecycle
DefectShowerService.StartAsync reads the active profile's DefectShowerEveryMs. If 0, the service stays in a IsShowerActive == false state and does no work. If > 0, it starts a PeriodicTimer(DefectShowerEveryMs); on each tick it sets _isActive = true, schedules a Task.Delay(DefectShowerDurationMs) continuation that flips it back, and emits one diagnostics-timeline entry per transition (Info, source DiagnosticsSource.Pipeline).
FramePipelineService.ProcessDefectsForFrame consults IDefectShowerSchedule.IsShowerActive. When true, the random-roll check is skipped — every frame in the window produces a defect, and the per-frame defect distribution (Minor/Major/Critical 60/30/10) still applies. The per-frame metric increments and AppState.ActiveRun.DefectsCritical/Major/Minor updates flow exactly as today.
Profile-switch handling: subscribe to ISimulatorProfileProvider.ProfileChanged; on change, dispose the current PeriodicTimer and rebuild with the new period. If the new profile sets DefectShowerEveryMs == 0, the service drops to idle mode without restarting the timer.
Alarm-burster lifecycle
AlarmBursterService.StartAsync reads the active profile's AlarmBurstEveryMs. If 0, the service stays idle. If > 0, a PeriodicTimer(AlarmBurstEveryMs) drives the burst loop:
on tick:
alarmCode = _pool[(_index++) % _pool.Length] // round-robin CHAOS-001..005
_faultInjector.InjectCriticalFault(alarmCode, $"ChaosMonkey burst at {DateTimeOffset.UtcNow:HH:mm:ss.fff}")
await Task.Delay(500ms, ct)
_faultInjector.ClearFault(alarmCode)
await Task.Delay(500ms, ct)
await _workflow.RecoverAsync()The 500 ms gaps give the workflow time to observe Faulted, write the diagnostics entry, transition to Faulted state, and accept RecoverAsync. The recovery returns the workflow to Ready (if homed) or Idle. The next outer-loop run-start is driven by the FlaUI scenario's run-loop (which already issues StartAsync on completion), not by this service.
The round-robin code pool is intentional: OnFaultInjected's _activeFaultCodes.Add returns false for the first few duplicates (when the pool wraps), so the "already active" branch is also exercised. Each unique code triggers one full cycle.
Profile-switch handling matches DefectShowerService's.
Telemetry dropout
SimulatedTagSource's emitter is currently:
while (!ct.IsCancellationRequested) {
await timer.WaitForNextTickAsync(ct);
var sample = ComputeSample(...);
_cells[tagName] = sample; // overwrite, increments coalesced if previous unread
_metrics.SamplesIngested.Add(1, [tag.name]);
}It becomes:
while (!ct.IsCancellationRequested) {
await timer.WaitForNextTickAsync(ct);
if (Random.Shared.NextDouble() < _profileProvider.CurrentProfile.TelemetryDropoutChance) {
AdvanceNoiseRefStateOnly(ref _refStates[tagName], elapsed);
continue;
}
var sample = ComputeSample(...);
_cells[tagName] = sample;
_metrics.SamplesIngested.Add(1, [tag.name]);
}The noise-state advance on a dropped emit keeps the random-walk visibly continuous after the dropout ends; otherwise the resumption looks artificial. samples.coalesced is unchanged because no overwrite happened.
Network latency injection
The FlakySdkDecorator<IMachineConnection> is registered in DI when Simulator:FlakySdk:Enabled == true (decided at startup; profile changes do not swap the decorator on/off — Phase 1 keeps the decorator scope simple). Independent of the flaky-SDK gate, network latency injection is always applied directly inside SimulatedMachineConnection.ConnectAsync based on the active profile's NetworkLatencyMeanMs / NetworkLatencyStddevMs. (The latency is part of the simulator's connection model, not a flaky-SDK fault.)
public async Task<bool> ConnectAsync(CancellationToken ct) {
await Task.Delay(_connectDelay / TimeCompressionFactor, ct);
var profile = _profileProvider.CurrentProfile;
var jitter = SampleGaussianMs(profile.NetworkLatencyMeanMs, profile.NetworkLatencyStddevMs);
if (jitter > 0) await Task.Delay(TimeSpan.FromMilliseconds(jitter), ct);
if (Random.Shared.NextDouble() < profile.ConnectionFailureProbability) return false;
return true;
}SampleGaussianMs(mean, stddev) uses Box-Muller (already imported by NoiseModelEvaluator from SLICE-1.1; reuse the same helper). Negative samples clamp to 0.
Flaky SDK decorator
public sealed class FlakySdkDecorator<T> : T where T : IMachineConnection {
private readonly T _inner;
private readonly IOptionsMonitor<FlakySdkOptions> _options;
private readonly ILogger<FlakySdkDecorator<T>> _logger;
public async Task<bool> ConnectAsync(CancellationToken ct) {
var opts = _options.CurrentValue;
if (!opts.Enabled) return await _inner.ConnectAsync(ct);
if (Random.Shared.NextDouble() < opts.TimeoutChance) {
// Hang twice the caller's expected timeout. We don't know it; use 30s.
await Task.Delay(TimeSpan.FromSeconds(30), CancellationToken.None);
// Fall through to inner if not cancelled.
}
if (Random.Shared.NextDouble() < opts.IgnoreCancellationChance) {
return await _inner.ConnectAsync(CancellationToken.None);
}
if (Random.Shared.NextDouble() < opts.OutOfBandThrowChance) {
throw new InvalidOperationException("FlakySdk: simulated out-of-band SDK exception.");
}
return await _inner.ConnectAsync(ct);
}
public Task DisconnectAsync() => _inner.DisconnectAsync();
}DI registration: when Enabled == true, services.AddSingleton<IMachineConnection>(sp => new FlakySdkDecorator<...>(sp.GetRequiredService<SimulatedMachineConnection>(), ...)). When Enabled == false, IMachineConnection resolves directly to SimulatedMachineConnection (current behavior).
Time-compression scope
TimeCompressionFactor reads from the active profile at the moment the wait begins (no caching). Two affected sites:
SimulatedMachineConnection.ConnectAsync:_connectDelay.TotalMilliseconds / factorSimulatedMotionController.InterpolateAsync: per-iterationTask.Delay(20ms / factor, ct)(current 20 ms tick)
Producer rates that are intentionally NOT scaled:
SimulatedTagSource.IntervalMs(per tag)SimulatedCamera.FrameIntervalMs(frame producer)SimulatedEncoderSource.EncoderIntervalMs(encoder producer)
Rationale: under Soak8h with TimeCompressionFactor: 1.0 the producer rates and idle waits are both real-time, so the soak measures genuine wall-clock leak signal. If a future profile sets TimeCompressionFactor: 10.0, runs complete 10× faster but the data-plane bandwidth is unchanged — accelerating throughput without distorting per-second load. Producer scaling would conflate the two and is out of scope.
Acceptance Criteria
This slice is satisfied only if all of the following are true:
SimulatorProfile(Application.State) andSimulatorProfileOptions(Infrastructure.Simulator) gain the seven fields in "Profile fields" with defaults that preserve existing behavior. Existing seed profiles (Normal,Demo,HighDefect,MultiTag,HighFrameRate,EncoderRate) compile and run unchanged at runtime when no chaos fields are set.FlakySdkOptionsexists inInfrastructure.Simulatorwith the four fields in "SDK-flakiness wrapper".FlakySdkOptionsValidatorrejects out-of-range values. The options block is bound fromSimulator:FlakySdkand registered viaAddOptions<FlakySdkOptions>().BindConfiguration(...).ValidateOnStart().SimulatorProfilesValidatorextends to enforce the rules in "Configuration validation". Validation messages name the offending profile and field (matching the existingEncoderIntervalMsvalidation pattern). All existing profile-validation tests still pass.appsettings.jsoncontains a newChaosMonkeyprofile and a newSoak8hprofile with the field values listed in "New profiles", and a new top-levelSimulator:FlakySdkblock withEnabled: true,TimeoutChance: 0.05,IgnoreCancellationChance: 0.05,OutOfBandThrowChance: 0.05(chaos-only knobs; running existing profiles still works because the profile-level chance fields default to disable).IDefectShowerScheduleexists inApplication.Abstractions.DefectShowerServiceimplements it andIHostedService, is registered inAddInfrastructure(or the Application equivalent), andFramePipelineService.ProcessDefectsForFrameconsults the abstraction. WhenDefectShowerEveryMs == 0, the service stays idle and no diagnostics-timeline entries are produced.AlarmBursterServiceexists inApplication.Services, is registered asIHostedService, drives the inject-wait-clear-wait-recover cycle described in "Alarm-burster lifecycle", and rebuilds its timer when the active profile'sAlarmBurstEveryMschanges. WhenAlarmBurstEveryMs == 0, the service exits its loop and is idle.SimulatedTagSource's emitter loop appliesTelemetryDropoutChanceper cycle. Verified by a unit test that runs 1000 cycles withTelemetryDropoutChance == 0.5and assertssamples.ingestedcount is in[400, 600].samples.coalescedis not inflated by dropouts.SimulatedMachineConnection.ConnectAsyncscales_connectDelaybyTimeCompressionFactorand adds a Gaussian latency draw based onNetworkLatencyMeanMs/Stddev. Negative draws clamp to 0. Verified by a unit test that runs 100 connections withmean=100ms, stddev=20msand asserts measured wall-clock latency mean is in[80, 120]ms.SimulatedMotionController.InterpolateAsync's per-iteration tick wait is scaled byTimeCompressionFactor. The motion arithmetic,_currentX/_currentY, and thePositionChangedevent are unchanged. Diff-search verification:_currentX =andPositionChanged?.Invokeremain at the same call sites.FlakySdkDecorator<IMachineConnection>exists inInfrastructure.Simulator, is conditionally registered whenSimulator:FlakySdk:Enabled == true, and exhibits the three injection behaviors (timeout-hang, cancellation-ignore, out-of-band-throw) at the configured probabilities. Unit tests verify each branch in isolation by setting one chance to1.0and the others to0.0. WithEnabled == false, the decorator is not in the pipeline (IMachineConnectionresolves directly toSimulatedMachineConnection).- The 30-minute
ChaosMonkeycapture (criterion A) produces a row block taggedslice-1-4-chaos-monkeyinphase-1-measurements.mdwith at least:runs.started ≥ 5,runs.faulted ≥ 5,frames.droppedrecorded (any value, including 0), one or more diagnostics-timeline entries fromDiagnosticsSource.AlarmwithCritical/Errorlevel. Independently verified by inspectingLogs/inspection-prototype-*.logand confirming entries from each of: connect-failure (fromConnectionFailureProbabilityand/orFlakySdkDecorator), fault-during-home (fromAlarmBursterServiceoverlapping with a homing operation), fault-during-run (fromAlarmBursterServiceduringWorkflowState.Running), and fault-clear-and-recover (from the burster's recovery step). - The 8-hour
Soak8hcapture (criterion B) produces a row block taggedslice-1-4-soak-8hinphase-1-measurements.md. The working-set steady-state drift, defined asavg(working_set during the last 60 minutes) − avg(working_set during minutes 5-60), is ≤ 50 MB.gen-2-gc-count≤ 4× the equivalent rate fromslice-1-2-real-frame-payloads(no Gen-2 runaway). The capture's CSV span is at least 28_800 seconds (8 hours real-time, no system-sleep gaps as in TASK-1.1's 63-min mid-capture incident). Per-tagsamples.ingesteddistributions are in the same order-of-magnitude asslice-1-1-multi-tag-telemetry(low-rate dropout fromTelemetryDropoutChance: 0.01is acknowledged in the row notes).
Criterion 12 — measurement amendment (2026-05-03). The original criterion was working-set growth = last − first ≤ 50 MB, where the values are sampled from the working-set CSV time series. The 2026-05-02 capture observed last − first = 186.5 MB (47.5 MB at t=0 → 234.0 MB at t=8h). Direct CSV inspection showed the entire delta is the process startup ramp, which completed in under 30 seconds (47.5 MB → 230.9 MB at t=29 s); the working-set then held a stable sawtooth between 224 MB and 240 MB for the remaining 7.5 hours, with avg(minutes 5-60) = 235.4 MB and avg(last 60 min) = 232.7 MB — a steady-state drift of −2.7 MB over 8 hours of real time. The original metric conflates startup cost (a one-time WPF + 50-tag-emitter + first-run allocation event) with in-flight allocation growth (the leak signal). The amended metric excludes the first 5 minutes of the capture, isolating the steady-state behavior that the slice was designed to measure. Against the amended metric, the 2026-05-02 capture passes the 50 MB ceiling by a wide margin. Same pattern as SLICE-1.1's criterion 7 amendment (per-tag-rate accuracy — Windows scheduling reality forced a methodology change) and SLICE-1.3's criterion 7 amendment (encoder receiver rate — Windows timer-resolution ceiling). The 50 MB ceiling itself is unchanged; only the measurement window is. 13. tools/Capture-Measurements.ps1 -Scenario MultiTagSoak -Profile ChaosMonkey -DurationSeconds 1800 and … -Profile Soak8h -DurationSeconds 28800 end-to-end produce row blocks with the new working-set growth (MB) and fault-cycles (count) rows present. 14. docs/runbook/capturing-measurements.md gains §4.5 (ChaosMonkey, 30-minute) and §4.6 (Soak8h, 8-hour) with the procedure, sleep-disable discipline reaffirmed, the criterion-A and criterion-B bars stated explicitly, and the "do not run on a host you also intend to use" warning for the soak. 15. The full existing test suite still passes, plus new tests covering: SimulatorProfile round-trip with the new fields; SimulatorProfilesValidator rejects each new invalid case; FlakySdkOptionsValidator rejects each invalid case; DefectShowerService activates and deactivates per its schedule (use FakeTimeProvider or Task.Delay-driven wall-clock with generous tolerance); AlarmBursterService issues exactly one inject + clear + recover cycle per tick (use a recording IFaultInjector and IWorkflowService fake); FlakySdkDecorator exhibits each fault branch when its probability is 1.0; SimulatedTagSource honors TelemetryDropoutChance (criterion 7); SimulatedMachineConnection honors TimeCompressionFactor and NetworkLatency* (criterion 8). dotnet test runtime stays under 90 seconds (the FlakySdk timeout-hang test must use a short configurable hang duration via DI, not the 30 s production default). 16. No regressions: rows 0 / 0a / 0b / slice-1-1-multi-tag-telemetry / slice-1-2-real-frame-payloads / slice-1-3-encoder-rate-motion are reproducible against the merged commit (i.e., Capture-Measurements.ps1 -Profile MultiTag still produces a row equivalent to slice-1-1-multi-tag-telemetry within the existing per-tag accuracy bounds noted in SLICE-1.1).
Verification Notes
The implementation task for this spec must include verification for:
- the
TelemetryDropoutChancemechanism does not introduce asamples.coalescedcount where there should be none. Verified by the criterion-7 test plus a separate assertion that withTelemetryDropoutChance == 1.0(full dropout),samples.coalescedstays at 0 for the entire test window. AlarmBursterServicesurvives the case whereIWorkflowService.RecoverAsyncis rejected (e.g., the workflow is mid-Stoppingwhen the recovery call lands). The service must log aWarningand continue ticking, not fault the host. Verified by a test that intercepts the recovery call and returns rejection.DefectShowerServicedoes not produce defects after the host is shutting down (stoppingTokencancelled). Verified by a test that assertsIsShowerActivereturnsfalseafterStopAsync.FlakySdkDecorator'sIgnoreCancellationChancebranch does not propagate the original cancellation — when the caller cancels and the decorator ignores, the inner call must complete normally and the result must reach the caller. The caller is then responsible for noticing the late completion (in our case,WorkflowService.DoConnectAsyncalready wraps in aCancellationTokenSourceand observes the token; the test asserts that the late completion writes to_loggerbut does not corruptAppState).- the
Soak8hcapture'sworking-set growthmath is computed from the actual CSV's first and lastdotnet.process.memory.working_setrows, NOT fromworking-set peak − working-set min. Peak is a high-water mark dominated by transient allocations; growth is the trend that matters for leak detection. The runbook §4.6 shows the formula and theGet-WorkingSetGrowthMbPester test asserts it is computed correctly on synthetic CSV input. - the
slice-1-4-soak-8hcapture is committed against a single commit hash with the working tree clean. A capture interrupted by a system sleep, hibernate, or laptop-lid-close event must be discarded — the criterion-B 50 MB bar is meaningful only against an uninterrupted 8-hour wall-clock baseline. Thepowercfg /change standby-timeout-ac 0discipline (TASK-1.5.1, reaffirmed in SLICE-1.6 and SLICE-1.3) applies; for the soak, also disable hibernate (powercfg /hibernate off) and screen-saver-induced sleep on the capture machine. - the
ChaosMonkeycapture's fault-branch coverage is verified by log inspection, not by counter values. Counters say "X faults occurred"; logs say "fault occurred during Homing", "fault occurred during Running", etc. The runbook §4.5 includes aSelect-Stringrecipe (or PowerShell equivalent) overLogs/inspection-prototype-*.logthat filters and counts each branch. The recipe is part of the runbook so future captures can reproduce the verification. - the
MultiTagSoakFlaUiscenario runs cleanly under the newChaosMonkeyandSoak8hprofiles via theSIMULATOR_PROFILEenv-var path established in SLICE-1.6 Pass 3. No new FlaUI scenario class is added; if profile selection through the combo-box is intrinsically flaky, the SLICE-1.6 fallback--start-with-profile <name>CLI flag is the resolution.