TASK-1.4: Implement Storm & Soak Profiles
- Status: Proposed (no passes started)
- Date: 2026-04-30
- Spec: SLICE-1.4: Storm & Soak Profiles
- Depends on: TASK-1.1: Implement Multi-Tag Telemetry, TASK-1.2: Implement Real Frame Payloads, TASK-1.3: Implement Encoder-Rate Motion, TASK-1.6: FlaUI Capture
Objective
Add storm-and-soak load-shaping knobs to SimulatorProfile, ship the two new profiles ChaosMonkey and Soak8h, and wire them through the simulator (TelemetryDropoutChance, TimeCompressionFactor, network-latency injection, defect-shower service, alarm-burster service, flaky-SDK decorator). Capture two row blocks: a 30-minute ChaosMonkey capture proving every WorkflowService fault branch is exercised, and an 8-hour Soak8h capture proving working-set growth stays under 50 MB. Together those two rows close the Phase 1 exit gate.
Scope
- new
SimulatorProfilefields:DefectShowerEveryMs,DefectShowerDurationMs,AlarmBurstEveryMs,TelemetryDropoutChance,NetworkLatencyMeanMs,NetworkLatencyStddevMs,TimeCompressionFactor - new
FlakySdkOptionsconfig block (Simulator:FlakySdk) andFlakySdkOptionsValidator - extension of
SimulatorProfilesValidatorto enforce the new fields' ranges and consistency rules - new
IDefectShowerScheduleabstraction +DefectShowerService(IHostedService) - new
AlarmBursterService(IHostedService) SimulatedTagSourcehonorsTelemetryDropoutChanceper emitter cycleSimulatedMachineConnectionhonorsTimeCompressionFactorandNetworkLatencyMeanMs/StddevSimulatedMotionController.InterpolateAsynchonorsTimeCompressionFactorfor per-tick wait onlyFlakySdkDecorator<IMachineConnection>(timeout-hang, ignore-cancellation, out-of-band-throw)- conditional DI registration: when
Simulator:FlakySdk:Enabled == true,IMachineConnectionresolves through the decorator - new seed profiles
ChaosMonkeyandSoak8hinappsettings.json, plus a top-levelSimulator:FlakySdkblock - new
MeasurementExtraction.psm1helpers:Get-WorkingSetGrowthMb,Get-FaultCyclesCount;ConvertTo-MeasurementRowadds two new rows - new runbook §4.5 (ChaosMonkey, 30 min) and §4.6 (Soak8h, 8 h)
- two row blocks in
phase-1-measurements.md:slice-1-4-chaos-monkey,slice-1-4-soak-8h - tests: validator rejection cases (profile + flaky-sdk);
DefectShowerServiceactivation cycle;AlarmBursterServiceinject-clear-recover cycle;FlakySdkDecoratorper-branch behavior;SimulatedTagSourcedropout honor;SimulatedMachineConnectiontime-compression and Gaussian latency honor; reproducibility of prior rows (no regression)
Non-Scope
- decorating
IMotionControllerwithFlakySdkDecorator— connection-only is enough to satisfy criterion 11's connect-failure / out-of-band-throw branches; motion-decorator is a documented follow-up - per-profile
FlakySdkOptions— theSimulator:FlakySdkblock is global; per-profile is a follow-up - introducing a
TimeProvider-style global clock —TimeCompressionFactoris a localized scalar applied at two sites only TagQuality.Staletransitions on dropouts — staleness handling is Phase 2 (SLICE-2.3 data-plane lift-out)- engineering-panel UI for runtime chaos tuning — Phase 3
- changing
IFaultInjector's interface —AlarmBursterServicecalls the existing API as-is - a new FlaUI scenario class —
MultiTagSoakFlaUiwith-Profile ChaosMonkey/-Profile Soak8his the capture path - automated CI runs of the
Soak8hprofile — manual capture per release; CI guards stay green for sub-second tests - modifying
SimulatedMotionController.InterpolateAsync's motion arithmetic — only the per-tick wait is scaled
Touched Projects
src/InspectionPrototype.Application—IDefectShowerSchedule(Abstractions),DefectShowerService+AlarmBursterService(Services);SimulatorProfile.csfield additions (State);FramePipelineService.cs(consultIDefectShowerSchedule)src/InspectionPrototype.Infrastructure—SimulatorProfileOptionsfield additions,FlakySdkOptions+ validator,SimulatorProfilesValidatorextension,SimulatedTagSource(dropout),SimulatedMachineConnection(time-compression + latency),SimulatedMotionController(per-tick wait scale only),FlakySdkDecorator,InfrastructureServiceCollectionExtensions(DI wiring + conditional decorator)src/InspectionPrototype.App—appsettings.json(ChaosMonkeyprofile,Soak8hprofile,Simulator:FlakySdkblock,EncoderIntervalMs: 5on the two new profiles)tests/InspectionPrototype.Tests—SimulatorProfileFieldsTests,SimulatorProfilesValidatorChaosTests,FlakySdkOptionsValidatorTests,DefectShowerServiceTests,AlarmBursterServiceTests,FlakySdkDecoratorTests,SimulatedTagSourceDropoutTests,SimulatedMachineConnectionTimeCompressionTests,SimulatedMachineConnectionNetworkLatencyTests; recording stubs forIFaultInjectorandIWorkflowService(underStubs/) if not already presenttools/MeasurementExtraction.psm1—Get-WorkingSetGrowthMb,Get-FaultCyclesCount,ConvertTo-MeasurementRowextensiontests/Tools/MeasurementExtraction.Tests.ps1— Pester tests for the new helpersdocs/runbook/capturing-measurements.md— new §4.5 and §4.6 (replace the §4.5+ placeholder)docs/reviews/phase-1-measurements.md— two new row blocksdocs/captures/— two new CSV files (slice-1-4-chaos-monkey-<date>.csv,slice-1-4-soak-8h-<date>.csv)- (no changes to)
IMotionControllerinterface,IFaultInjectorinterface,IMachineConnectioninterface,IAppStateStore,MainViewModel,AppStaterecord,AppMetrics(no new counters)
AI Tool Guidance
Three Copilot passes; one-pass-per-session protocol as in TASK-1.2 / TASK-1.3.
- Profile fields + validators + telemetry dropout + time compression + network latency. Add the seven new
SimulatorProfilefields, extendSimulatorProfilesValidator, addFlakySdkOptions+ validator (the config block only; decorator is Pass 2). WireTelemetryDropoutChanceintoSimulatedTagSource. WireTimeCompressionFactorintoSimulatedMachineConnection.ConnectAsyncandSimulatedMotionController.InterpolateAsyncper-tick wait. WireNetworkLatencyMeanMs/StddevintoSimulatedMachineConnection.ConnectAsync(Gaussian latency). AddChaosMonkeyandSoak8hprofiles +Simulator:FlakySdkblock toappsettings.json. Tests for each. NODefectShowerService, NOAlarmBursterService, NOFlakySdkDecorator, NO measurement-extraction work. - Defect-shower + alarm-burster services + flaky-SDK decorator. Implement
IDefectShowerSchedule+DefectShowerService, wireFramePipelineService.ProcessDefectsForFrameto consult it. ImplementAlarmBursterService. ImplementFlakySdkDecorator<IMachineConnection>and conditional DI wiring (decorator only whenSimulator:FlakySdk:Enabled == true). AddGet-WorkingSetGrowthMbandGet-FaultCyclesCounttoMeasurementExtraction.psm1; extendConvertTo-MeasurementRow. Tests for each. NO captures. - 30-min ChaosMonkey + 8-hour Soak8h captures + row blocks + runbook §4.5 + §4.6 + session-handoff updates. Run both captures (30 min, then 8 h on a clean session — same machine, both wall-clock real-time). Append two row blocks. Write runbook §4.5 (ChaosMonkey) and §4.6 (Soak8h). Inspect logs for fault-branch coverage evidence. Update CLAUDE.md / roadmap-progress. NO code changes.
Acceptance Criteria Mapping
The implementation must satisfy all acceptance criteria from SLICE-1.4:
- Pass 1 covers criteria 1, 2, 3, 4 (config-only portions), 7, 8, 9, and the validator/simulator portions of 15
- Pass 2 covers criteria 5, 6, 10, 13, and the service / decorator / extraction portions of 15
- Pass 3 covers criteria 4 (capture-driven verification), 11, 12, 14, 16
Copilot Agent Prompts
Pass 1 — Profile fields + validators + telemetry dropout + time compression + network latency
You are implementing Pass 1 of TASK-1.4 in this repository: add the storm-and-
soak profile fields, extend SimulatorProfilesValidator, add FlakySdkOptions
(config-only — the decorator is Pass 2), and wire the three "low-friction"
load-shaping knobs (TelemetryDropoutChance into SimulatedTagSource;
TimeCompressionFactor and NetworkLatencyMeanMs/Stddev into
SimulatedMachineConnection; TimeCompressionFactor into
SimulatedMotionController per-tick wait). Add ChaosMonkey and Soak8h profiles
plus the Simulator:FlakySdk block to appsettings.json.
NO DefectShowerService, NO AlarmBursterService, NO FlakySdkDecorator, NO
MeasurementExtraction.psm1 changes, NO captures.
## Authoritative references
Read these before making changes:
- docs/specs/SLICE-1.4-storm-and-soak-profiles.md (the requirements)
- docs/tasks/TASK-1.4-implement-storm-and-soak-profiles.md (this task)
- src/InspectionPrototype.Application/State/SimulatorProfile.cs
- src/InspectionPrototype.Application/State/NoiseModelEvaluator.cs (Box-Muller helper to reuse)
- src/InspectionPrototype.Infrastructure/Simulator/SimulatorProfilesOptions.cs
- src/InspectionPrototype.Infrastructure/Simulator/SimulatorProfilesValidator.cs
- src/InspectionPrototype.Infrastructure/Simulator/SimulatorEncoderOptionsValidator.cs (parallel pattern)
- src/InspectionPrototype.Infrastructure/Simulator/SimulatedTagSource.cs
- src/InspectionPrototype.Infrastructure/Simulator/SimulatedMachineConnection.cs
- src/InspectionPrototype.Infrastructure/Simulator/SimulatedMotionController.cs
- src/InspectionPrototype.Infrastructure/InfrastructureServiceCollectionExtensions.cs
- src/InspectionPrototype.App/appsettings.json
Spec acceptance criteria 1, 2, 3, 4 (config portions), 7, 8, 9, and the
validator/simulator portions of 15 are the definition of done for this pass.
## Scope of this pass
Profile-record + options-class field additions, validator extensions,
SimulatorTagSource dropout, SimulatedMachineConnection time-compression and
Gaussian latency, SimulatedMotionController per-tick wait scale, appsettings.json
updates, tests for each. NO DefectShowerService, NO AlarmBursterService,
NO FlakySdkDecorator, NO MeasurementExtraction.psm1 changes.
## Deliverables
1. SimulatorProfile (src/InspectionPrototype.Application/State/SimulatorProfile.cs):
Add seven properties (with XML doc) at the end of the record:
int DefectShowerEveryMs = 0 // 0 = disabled
int DefectShowerDurationMs = 0
int AlarmBurstEveryMs = 0
double TelemetryDropoutChance = 0.0
double NetworkLatencyMeanMs = 0.0
double NetworkLatencyStddevMs = 0.0
double TimeCompressionFactor = 1.0
Update SimulatorProfile.Default to keep its current values (all new fields
default-zero / 1.0).
2. SimulatorProfileOptions
(src/InspectionPrototype.Infrastructure/Simulator/SimulatorProfilesOptions.cs):
Mirror the seven property additions on the JSON-binding shape with the same
defaults. Keep the property setters (this is the binder's mutable shape).
3. SimulatorProfilesValidator
(src/InspectionPrototype.Infrastructure/Simulator/SimulatorProfilesValidator.cs):
Extend `Validate(...)` with rules per spec "Configuration validation":
- DefectShowerEveryMs in [0, 3_600_000]
- DefectShowerDurationMs in [0, 60_000]
- if DefectShowerEveryMs > 0 then DefectShowerDurationMs > 0 and DefectShowerDurationMs ≤ DefectShowerEveryMs
- AlarmBurstEveryMs in [0, 3_600_000]
- TelemetryDropoutChance in [0.0, 1.0]
- NetworkLatencyMeanMs in [0.0, 30_000.0]
- NetworkLatencyStddevMs in [0.0, 30_000.0]
- TimeCompressionFactor in [0.1, 100.0]
Each failure names the offending profile and field, mirroring the existing
EncoderIntervalMs failure-message style.
4. FlakySdkOptions (src/InspectionPrototype.Infrastructure/Simulator/FlakySdkOptions.cs):
New sealed class:
public const string SectionName = "Simulator:FlakySdk";
public bool Enabled { get; set; } = false;
public double TimeoutChance { get; set; } = 0.0;
public double IgnoreCancellationChance { get; set; } = 0.0;
public double OutOfBandThrowChance { get; set; } = 0.0;
/// <summary>For tests: how long the timeout-hang branch waits. Default 30 s in production.</summary>
public int TimeoutHangMs { get; set; } = 30_000;
5. FlakySdkOptionsValidator
(src/InspectionPrototype.Infrastructure/Simulator/FlakySdkOptionsValidator.cs):
IValidateOptions<FlakySdkOptions> — reject any of the three Chance fields
outside [0.0, 1.0]; reject TimeoutHangMs < 1 or > 600_000.
6. SimulatedTagSource (dropout wiring):
In the per-emitter loop, before computing/publishing the sample, draw
`Random.Shared.NextDouble()`. If the draw is less than the active
profile's TelemetryDropoutChance, skip the publish but still advance the
noise ref state (so the random-walk is visibly continuous after dropout
ends). The active profile is read from ISimulatorProfileProvider.CurrentProfile.
IMPORTANT: do NOT call _metrics.SamplesIngested.Add when dropping; do NOT
call _metrics.SamplesCoalesced.Add (a dropout is not a coalesce).
7. SimulatedMachineConnection (time-compression + Gaussian latency):
Replace the current `await Task.Delay(_connectDelay, cancellationToken)`
block with:
var profile = _profileProvider.CurrentProfile;
var compressedDelay = TimeSpan.FromMilliseconds(
_connectDelay.TotalMilliseconds / profile.TimeCompressionFactor);
await Task.Delay(compressedDelay, cancellationToken);
if (profile.NetworkLatencyMeanMs > 0) {
var jitter = SampleGaussianClampedMs(
profile.NetworkLatencyMeanMs, profile.NetworkLatencyStddevMs);
if (jitter > 0)
await Task.Delay(TimeSpan.FromMilliseconds(jitter), cancellationToken);
}
`SampleGaussianClampedMs(mean, stddev)` is a private static helper that
uses Box-Muller (mirror NoiseModelEvaluator's existing implementation; do
not depend on or import from Application). Negative draws clamp to 0.
8. SimulatedMotionController (per-tick wait scale):
In `InterpolateAsync`, scale only the per-iteration `Task.Delay(20ms, ct)`
call. The motion arithmetic, _currentX/_currentY, the PositionChanged event,
and the loop count are unchanged. Read TimeCompressionFactor from
ISimulatorProfileProvider.CurrentProfile each iteration (so a profile change
mid-move takes effect on the next tick — matches existing convention).
9. DI wiring (InfrastructureServiceCollectionExtensions):
- services.AddSingleton<IValidateOptions<FlakySdkOptions>, FlakySdkOptionsValidator>();
- services.AddOptions<FlakySdkOptions>()
.BindConfiguration(FlakySdkOptions.SectionName)
.ValidateOnStart();
The decorator wiring is Pass 2; in Pass 1, IMachineConnection still resolves
directly to SimulatedMachineConnection.
10. appsettings.json (src/InspectionPrototype.App/appsettings.json):
Add the ChaosMonkey profile (after EncoderRate):
Name: "ChaosMonkey"
MotionSpeedUnitsPerSecond: 50.0
TelemetryIntervalMs: 50
FrameIntervalMs: 100
FrameWidth: 1024
FrameHeight: 768
BytesPerPixel: 1
EncoderIntervalMs: 5
DefectProbabilityPerFrame: 0.05
ConnectionFailureProbability: 0.30
DefectShowerEveryMs: 30000
DefectShowerDurationMs: 3000
AlarmBurstEveryMs: 45000
TelemetryDropoutChance: 0.05
NetworkLatencyMeanMs: 250
NetworkLatencyStddevMs: 150
TimeCompressionFactor: 1.0
Add the Soak8h profile (after ChaosMonkey):
Name: "Soak8h"
MotionSpeedUnitsPerSecond: 30.0
TelemetryIntervalMs: 100
FrameIntervalMs: 250
FrameWidth: 1024
FrameHeight: 768
BytesPerPixel: 1
EncoderIntervalMs: 5
DefectProbabilityPerFrame: 0.05
ConnectionFailureProbability: 0.05
DefectShowerEveryMs: 600000
DefectShowerDurationMs: 5000
AlarmBurstEveryMs: 0
TelemetryDropoutChance: 0.01
NetworkLatencyMeanMs: 50
NetworkLatencyStddevMs: 20
TimeCompressionFactor: 1.0
Add a new sibling block to "Simulator" called "FlakySdk":
Enabled: true
TimeoutChance: 0.05
IgnoreCancellationChance: 0.05
OutOfBandThrowChance: 0.05
TimeoutHangMs: 30000
11. Tests under tests/InspectionPrototype.Tests/:
- SimulatorProfileFieldsTests:
record round-trip for the seven new fields; SimulatorProfile.Default
carries all defaults.
- SimulatorProfilesValidatorChaosTests (or extend the existing
SimulatorProfilesValidatorTests): one [Theory] per new rule covering
the boundary cases (e.g., DefectShowerEveryMs = -1, 0, 1, 3_600_000,
3_600_001).
- FlakySdkOptionsValidatorTests: reject each Chance field outside [0.0, 1.0];
reject TimeoutHangMs of 0 and 600_001.
- SimulatedTagSourceDropoutTests:
Construct SimulatedTagSource with a single tag at 100 Hz, a fake
ISimulatorProfileProvider that yields a profile with
TelemetryDropoutChance = 0.5, run for 1 second, count
AppMetrics.SamplesIngested entries against a MeterListener (or count
cell-write events via a recording cell-store). Assert count is in
[400, 600] (50% ± 10%). Run a second test with TelemetryDropoutChance
= 1.0 and assert SamplesIngested == 0 and SamplesCoalesced == 0.
- SimulatedMachineConnectionTimeCompressionTests:
With TimeCompressionFactor = 1.0, ConnectAsync wall-clock is in
[1400ms, 1700ms]. With TimeCompressionFactor = 5.0, in [280ms, 360ms]
(ignoring jitter; both tests set NetworkLatencyMeanMs = 0).
- SimulatedMachineConnectionNetworkLatencyTests:
With TimeCompressionFactor = 5.0 (to keep total wall-clock low),
NetworkLatencyMeanMs = 100, NetworkLatencyStddevMs = 20:
run 100 ConnectAsync calls; measure each beyond the compressed
connect-delay floor; assert mean across the 100 samples is in
[80, 120] ms.
Set ConnectionFailureProbability = 0 to remove the false-return path.
- SimulatedMotionControllerTimeCompressionTests:
Construct with FakeSimulatorProfileProvider yielding TimeCompressionFactor
= 5.0; call MoveToAsync(distance such that real-time would take ~500ms);
assert wall-clock duration in [80ms, 200ms] and the final
_currentX/_currentY match the destination.
## Constraints
- Do NOT implement DefectShowerService or AlarmBursterService — Pass 2.
- Do NOT implement FlakySdkDecorator — Pass 2.
- Do NOT change AppMetrics or AppState — no new counters or fields outside
SimulatorProfile.
- Do NOT introduce a TimeProvider abstraction. TimeCompressionFactor is a
scalar applied at exactly two sites.
- Do NOT modify SimulatedMotionController's motion arithmetic, _currentX,
_currentY, or the PositionChanged event. Only the per-tick wait is scaled.
- Do NOT remove or rename any existing SimulatorProfile field. Existing
profiles must compile and run unchanged at runtime when no chaos field is
set.
- Do NOT change MainViewModel, IFaultInjector, IWorkflowService, IAppStateStore.
## Verification before you report done
dotnet build --configuration Release
dotnet test --configuration Release
Manual smoke test:
- Launch interactively (no --scenario flag); app starts with Normal profile
(all chaos fields default-zero); no warnings or crashes about missing
Simulator:FlakySdk config; no diagnostics-warning log spam.
- Switch to ChaosMonkey via the profile combo-box; observe the simulator
log warnings about ConnectionFailureProbability=0.3 and the latency
injection paths firing. The DefectShowerEvery/AlarmBurstEvery fields are
set in config but no service consumes them yet (Pass 2's job) — the
interactive smoke test should NOT see defect storms or alarm bursts. That
is correct.
- Switch to Soak8h; behavior similarly muted (most chaos fields set low).
- Switch back to Normal; everything returns to current behavior.
## Report format when finished
- files created and modified
- confirmation that all existing tests pass plus new tests
- a single commit hash
- commit message: "feat(sim): add storm/soak profile fields, validators, and three load-shaping knobs (pass 1/3 of TASK-1.4)"Pass 2 — Defect-shower + alarm-burster services + flaky-SDK decorator
You are implementing Pass 2 of TASK-1.4. Pass 1 (profile fields, validators,
TelemetryDropoutChance / TimeCompressionFactor / NetworkLatency wiring,
ChaosMonkey + Soak8h profiles, Simulator:FlakySdk config block) is already
merged. This pass adds the three new background services / decorators that
consume the chaos fields, plus the measurement-extraction helpers.
## Authoritative references
Read these before making changes:
- docs/specs/SLICE-1.4-storm-and-soak-profiles.md (criteria 5, 6, 10, 13)
- src/InspectionPrototype.Application/State/SimulatorProfile.cs (Pass 1 — chaos fields are present)
- src/InspectionPrototype.Application/Services/FramePipelineService.cs
- src/InspectionPrototype.Application/Abstractions/IFaultInjector.cs
- src/InspectionPrototype.Application/Abstractions/IWorkflowService.cs
- src/InspectionPrototype.Application/Services/EncoderStreamPipelineService.cs (parallel BackgroundService pattern)
- src/InspectionPrototype.Infrastructure/Simulator/SimulatedMachineConnection.cs
- src/InspectionPrototype.Infrastructure/Simulator/FlakySdkOptions.cs (Pass 1)
- src/InspectionPrototype.Infrastructure/InfrastructureServiceCollectionExtensions.cs
- tools/MeasurementExtraction.psm1
- tests/Tools/MeasurementExtraction.Tests.ps1
Pass 1 must be merged. Confirm by inspecting that SimulatorProfile carries
the seven new fields and that appsettings.json has ChaosMonkey, Soak8h, and
the Simulator:FlakySdk block.
## Scope of this pass
DefectShowerService, AlarmBursterService, FlakySdkDecorator<IMachineConnection>
+ conditional DI, MeasurementExtraction.psm1 helpers (Get-WorkingSetGrowthMb,
Get-FaultCyclesCount), ConvertTo-MeasurementRow extension. Tests for each.
NO captures, NO simulator-side changes beyond the FramePipelineService
consult of IDefectShowerSchedule, NO new FlaUI tests.
## Deliverables
1. IDefectShowerSchedule
(src/InspectionPrototype.Application/Abstractions/IDefectShowerSchedule.cs):
public interface IDefectShowerSchedule {
/// <summary>
/// True iff the active simulator profile has DefectShowerEveryMs > 0
/// AND we are currently inside an active shower window.
/// FramePipelineService consults this on every frame.
/// </summary>
bool IsShowerActive { get; }
}
2. DefectShowerService
(src/InspectionPrototype.Application/Services/DefectShowerService.cs):
- Implements IDefectShowerSchedule and IHostedService
- Constructor takes ISimulatorProfileProvider, IAppStateStore (for
WithDiagnosticsEntry on transitions), ILogger<DefectShowerService>
- Field: private volatile bool _isActive = false
- StartAsync: subscribe to ISimulatorProfileProvider.ProfileChanged;
start background task that, while not stopping:
* read profile.DefectShowerEveryMs; if 0, await ProfileChanged event
(or a 1-second poll), continue
* else: await Task.Delay(EveryMs - DurationMs); set _isActive = true,
log Info "Defect shower active"; await Task.Delay(DurationMs); set
_isActive = false, log Info "Defect shower ended". Loop.
Also write a one-line diagnostics-timeline entry on each transition
via _store.Update(s => s.WithDiagnosticsEntry(DiagnosticsSource.Pipeline, ...))
- StopAsync: cancel the background task; _isActive = false.
FramePipelineService.ProcessDefectsForFrame: inject IDefectShowerSchedule;
when IsShowerActive == true, skip the per-frame probability check (always
produce a defect). The per-defect severity distribution and
AppState.ActiveRun.DefectsCritical/Major/Minor accumulation flow as today.
3. AlarmBursterService
(src/InspectionPrototype.Application/Services/AlarmBursterService.cs):
- Implements IHostedService
- Constructor takes IFaultInjector, IWorkflowService, ISimulatorProfileProvider,
ILogger<AlarmBursterService>
- Field: private static readonly string[] _pool = ["CHAOS-001",...,"CHAOS-005"]
- Field: private int _index = 0
- StartAsync: subscribe to ProfileChanged; start background task:
* read profile.AlarmBurstEveryMs; if 0, await ProfileChanged event
(or a 1-second poll), continue
* else: while not stopping:
await Task.Delay(EveryMs, ct);
var code = _pool[Interlocked.Increment(ref _index) % _pool.Length];
try {
_faultInjector.InjectCriticalFault(code, $"ChaosMonkey burst at {DateTimeOffset.UtcNow:HH:mm:ss.fff}");
await Task.Delay(500, ct);
_faultInjector.ClearFault(code);
await Task.Delay(500, ct);
await _workflow.RecoverAsync();
} catch (Exception ex) when (ex is not OperationCanceledException) {
_logger.LogWarning(ex, "Alarm burst cycle failed; continuing.");
}
- StopAsync: cancel; do not re-throw.
4. FlakySdkDecorator<IMachineConnection>
(src/InspectionPrototype.Infrastructure/Simulator/FlakySdkDecorator.cs):
public sealed class FlakySdkDecorator : IMachineConnection {
private readonly IMachineConnection _inner;
private readonly IOptionsMonitor<FlakySdkOptions> _options;
private readonly ILogger<FlakySdkDecorator> _logger;
public async Task<bool> ConnectAsync(CancellationToken ct) {
var opts = _options.CurrentValue;
if (!opts.Enabled) return await _inner.ConnectAsync(ct);
if (Random.Shared.NextDouble() < opts.TimeoutChance) {
_logger.LogInformation("FlakySdk: timeout-hang branch fired.");
try { await Task.Delay(opts.TimeoutHangMs, ct); }
catch (OperationCanceledException) { throw; }
// If not cancelled (long enough wait), fall through to inner.
}
if (Random.Shared.NextDouble() < opts.IgnoreCancellationChance) {
_logger.LogInformation("FlakySdk: ignore-cancellation branch fired.");
// Ignore caller's CancellationToken.
return await _inner.ConnectAsync(CancellationToken.None);
}
if (Random.Shared.NextDouble() < opts.OutOfBandThrowChance) {
_logger.LogWarning("FlakySdk: out-of-band-throw branch fired.");
throw new InvalidOperationException(
"FlakySdk: simulated out-of-band SDK exception.");
}
return await _inner.ConnectAsync(ct);
}
public Task DisconnectAsync() => _inner.DisconnectAsync();
}
5. Conditional DI wiring (InfrastructureServiceCollectionExtensions):
Replace the current line
services.AddSingleton<IMachineConnection, SimulatedMachineConnection>();
with:
services.AddSingleton<SimulatedMachineConnection>();
services.AddSingleton<IMachineConnection>(sp => {
var opts = sp.GetRequiredService<IOptionsMonitor<FlakySdkOptions>>().CurrentValue;
var inner = sp.GetRequiredService<SimulatedMachineConnection>();
return opts.Enabled
? new FlakySdkDecorator(
inner,
sp.GetRequiredService<IOptionsMonitor<FlakySdkOptions>>(),
sp.GetRequiredService<ILogger<FlakySdkDecorator>>())
: (IMachineConnection)inner;
});
Also register:
services.AddSingleton<DefectShowerService>();
services.AddSingleton<IDefectShowerSchedule>(sp => sp.GetRequiredService<DefectShowerService>());
services.AddHostedService(sp => sp.GetRequiredService<DefectShowerService>());
services.AddSingleton<AlarmBursterService>();
services.AddHostedService(sp => sp.GetRequiredService<AlarmBursterService>());
6. tools/MeasurementExtraction.psm1:
Add and export Get-WorkingSetGrowthMb:
function Get-WorkingSetGrowthMb {
[CmdletBinding()]
param([Parameter(Mandatory)][object[]] $Csv)
$rows = $Csv | Where-Object {
$_.'Counter Name' -match 'dotnet\.process\.memory\.working_set'
} | Sort-Object Timestamp
if ($rows.Count -lt 2) { return $null }
$first = [double]$rows[0].'Mean/Increment'
$last = [double]$rows[-1].'Mean/Increment'
return [math]::Round(($last - $first) / 1MB, 1)
}
Add and export Get-FaultCyclesCount:
function Get-FaultCyclesCount {
[CmdletBinding()]
param([Parameter(Mandatory)][object[]] $Csv)
$rows = $Csv | Where-Object {
$_.'Counter Name' -match 'runs\.faulted'
}
if ($rows.Count -eq 0) { return 0 }
return [int](($rows | Measure-Object -Property 'Mean/Increment' -Sum).Sum)
}
Update ConvertTo-MeasurementRow to call both helpers and append two rows:
| working-set growth (MB) | <Get-WorkingSetGrowthMb output, or "—" if null> |
| fault-cycles (count) | <Get-FaultCyclesCount output> |
Use the same "—" sentinel pattern from SLICE-1.2 / SLICE-1.3 for missing data.
7. tests/Tools/MeasurementExtraction.Tests.ps1:
Four new Pester tests:
- "WorkingSetGrowthMb_OnFixture_ComputesLastMinusFirst": synthetic CSV
with two working_set rows, assert correct (last - first) / 1MB.
- "WorkingSetGrowthMb_OnEmptyCsv_ReturnsNull": empty CSV → $null.
- "FaultCyclesCount_OnFixture_SumsRunsFaulted": fixture with three rows of
runs.faulted, Mean/Increment = 1, 2, 3 → returns 6.
- "ConvertTo-MeasurementRow_AppendsTwoNewRows_WhenCsvHasData": fixture
with working_set + runs.faulted; assert markdown contains both rows.
8. Tests under tests/InspectionPrototype.Tests/:
- DefectShowerServiceTests: with DefectShowerEveryMs = 200, DurationMs = 100,
run for 1 second; assert IsShowerActive transitions ≥ 3 times.
- AlarmBursterServiceTests: with AlarmBurstEveryMs = 100, recording
IFaultInjector + IWorkflowService stubs, run for 500 ms; assert
InjectCriticalFault was called ≥ 3 times AND ClearFault was called for
each injection AND RecoverAsync was called the same number of times AND
the alarm codes cycled through CHAOS-001 → 005 → 001 in order.
Second test: stub IWorkflowService.RecoverAsync to throw
InvalidOperationException; assert the service logs a Warning and continues
ticking (no host fault).
- FlakySdkDecoratorTests: three [Fact]s — one per branch — by setting one
Chance to 1.0 and the others to 0.0.
* TimeoutHang: with TimeoutHangMs = 100 and caller CTS that cancels
at 50ms, assert OperationCanceledException is thrown.
* IgnoreCancellation: caller cancels CTS immediately; assert
ConnectAsync still returns successfully (the wrapped inner sees no
cancellation).
* OutOfBandThrow: assert InvalidOperationException is thrown.
Plus a passthrough [Fact] with Enabled=false: assert no extra delay,
no extra throws — bypasses to inner directly.
- FramePipelineServiceShowerTests (extend the existing suite): with a
fake IDefectShowerSchedule that returns IsShowerActive=true and a
profile DefectProbabilityPerFrame=0.0, push 10 frames; assert
ActiveRun.DefectCount == 10.
## Constraints
- Do NOT change AppMetrics, AppState, IAppStateStore, IFaultInjector, or
IWorkflowService.
- Do NOT add new metric counters. The two new measurement-extraction rows
are derived from existing counters (working_set, runs.faulted).
- Do NOT decorate IMotionController. Connection-only is the slice's scope.
- Do NOT make AlarmBursterService crash the host on any inner exception.
The "swallow and log" pattern is intentional — the service must outlive
individual cycle failures.
- Do NOT make DefectShowerService produce defects directly. It is a *schedule*;
FramePipelineService is what produces the defects when IsShowerActive == true.
- The FlakySdk decorator must not retry, log per-call, or accumulate state.
The three branches are independent and stateless beyond the options snapshot.
## Verification before you report done
dotnet build --configuration Release
dotnet test --configuration Release
Plus:
- Pester: Invoke-Pester tests/Tools/MeasurementExtraction.Tests.ps1
All four new tests pass plus the existing tests.
- Manual smoke capture (60 seconds, ChaosMonkey profile):
tools/Capture-Measurements.ps1 -Scenario MultiTagSoak `
-DurationSeconds 60 -Profile ChaosMonkey `
-OutputCsv docs/captures/_smoke.csv `
-CommitHash $(git rev-parse --short HEAD) -AllowDirty
Verify:
* exit code 0 OR a non-zero exit due to a fault-induced run termination
(this is acceptable for ChaosMonkey — log the exit reason)
* the printed row block has working-set growth (MB) and
fault-cycles (count) rows present
* fault-cycles ≥ 1 (at least one alarm-burster cycle landed in 60 s
given AlarmBurstEveryMs = 45_000)
* the diagnostics-timeline log shows "ChaosMonkey burst at ..." entries
Delete the smoke CSV before commit.
## Report format when finished
- files created and modified
- confirmation all C# tests + Pester tests pass
- the smoke-capture stdout (the row block) included as evidence
- a single commit hash
- commit message: "feat(app,sim,tools): add chaos services + flaky-SDK decorator + measurement helpers (pass 2/3 of TASK-1.4)"Pass 3 — Captures + row blocks + runbook §4.5 + §4.6
You are implementing Pass 3 of TASK-1.4, the final pass. Passes 1 and 2 are
merged. This pass runs the 30-minute ChaosMonkey capture and the 8-hour Soak8h
capture, appends two row blocks, writes runbook §4.5 and §4.6, and updates
session-handoff documents. NO code changes — Passes 1 and 2 own those.
## Authoritative references
Read these before making changes:
- docs/specs/SLICE-1.4-storm-and-soak-profiles.md (criteria 11, 12, 14, 16)
- docs/runbook/capturing-measurements.md (existing §3a, §4.1–§4.4)
- docs/reviews/phase-1-measurements.md (slice-1-2-real-frame-payloads,
slice-1-3-encoder-rate-motion
rows to mirror)
- CLAUDE.md, docs/reviews/roadmap-progress.md
- tools/Capture-Measurements.ps1
## Scope of this pass
Two captures, two table edits, two runbook sections (§4.5 + §4.6), session-
handoff updates. No code or test changes.
## Deliverables
1. Disable system sleep AND hibernate AND screen-saver before the 8-hour
soak; for the 30-min ChaosMonkey, sleep-disable is sufficient.
powercfg /change standby-timeout-ac 0
powercfg /change monitor-timeout-ac 0
powercfg /hibernate off # for the soak only
Note the previous values in the session-log entry so they can be restored.
2. Run the 30-minute ChaosMonkey capture FIRST (it is shorter; if it surfaces
a regression, do not waste 8 hours on the soak):
$date = Get-Date -Format 'yyyy-MM-dd'
tools/Capture-Measurements.ps1 -Scenario MultiTagSoak `
-DurationSeconds 1800 -Profile ChaosMonkey `
-OutputCsv "docs/captures/slice-1-4-chaos-monkey-$date.csv" `
-CommitHash $(git rev-parse --short HEAD) `
-SliceTag slice-1-4-chaos-monkey
Verify:
* The capture completed (no host crash; CSV span ≥ 1700 s).
* runs.started ≥ 5 (criterion 11).
* runs.faulted ≥ 5 (criterion 11).
* The Logs/inspection-prototype-*.log file from the capture window
contains entries showing every fault branch landed:
(a) connect-failure: "Connection failed (simulated failure)" OR
"FlakySdk: out-of-band-throw branch fired" OR
"Connection error:" from DoConnectAsync
(b) fault-during-home: "CRITICAL FAULT: [CHAOS-..." entries with
surrounding "Homing started" / "Homing aborted"
(c) fault-during-run: "CRITICAL FAULT: [CHAOS-..." entries with
surrounding "Run running" / "Run loop interrupted"
(d) fault-clear-and-recover: "Fault condition cleared: [CHAOS-..."
followed by "Recovery completed."
* The printed row block has 22 metrics, including the two new
working-set growth (MB) and fault-cycles (count) rows.
If criterion 11 fails (any branch is missing from the log), STOP — file
the gap as a follow-up, do not proceed to §4.5 / §4.6 edits or to the
soak. The most likely cause for a missing branch is a profile-config
typo (e.g., AlarmBurstEveryMs accidentally 0); inspect the active
profile snapshot in the diagnostics timeline.
3. Run the 8-hour Soak8h capture on a sleep-disabled session that is NOT
shared with other workloads:
tools/Capture-Measurements.ps1 -Scenario MultiTagSoak `
-DurationSeconds 28800 -Profile Soak8h `
-OutputCsv "docs/captures/slice-1-4-soak-8h-$date.csv" `
-CommitHash $(git rev-parse --short HEAD) `
-SliceTag slice-1-4-soak-8h
Verify:
* Capture span ≥ 28_500 s (≤ 1% drift from 8 h).
* working-set growth (MB) ≤ 50 (criterion 12).
* gen-2-gc-count is in the same order of magnitude as
slice-1-2-real-frame-payloads (no Gen-2 runaway). If it is more than
4× the slice-1-2 rate, criterion 12 fails.
* runs.faulted is non-zero only because of ChaosMonkey-style activity
from the spec? — under Soak8h AlarmBurstEveryMs = 0, so runs.faulted
should be near 0 (any non-zero value comes from
ConnectionFailureProbability = 0.05 misconnects, which are not
critical-fault paths but log warnings; runs.faulted should still be
0 under nominal Soak8h).
* No unhandled-exception entries in the log.
If any of these fails, STOP. The 50 MB ceiling is the slice's primary
exit gate — failing it is not a documentation problem.
4. Append TWO row blocks to docs/reviews/phase-1-measurements.md:
"### Row — slice-1-4-chaos-monkey" (mirror slice-1-3 format):
- 22 metrics: existing 18 + gc-pause-p95 + LOH-alloc-rate avg +
working-set growth (MB) + fault-cycles (count)
- Baseline = slice-1-3-encoder-rate-motion values for the 20 metrics that
overlap; "—" for working-set growth and fault-cycles since SLICE-1.3
predates them
- Notes section with at least:
(a) Why slice-1-3 is the baseline.
(b) Per-fault-branch evidence: one bullet per (a)/(b)/(c)/(d) listing
the log line counts confirming the branch was hit.
(c) Whether anything surprised — e.g., did the ignore-cancellation
branch surface a new race? did the diagnostics timeline survive the
alarm-burst cadence?
"### Row — slice-1-4-soak-8h":
- 22 metrics: same set as above
- Baseline = slice-1-2-real-frame-payloads values for 18 overlapping
metrics; "—" for the 4 new ones
- Notes section with:
(a) Why slice-1-2 is the baseline (continuous-load, FlaUI-captured row).
(b) Working-set first-second / last-second numbers and the (last - first)
growth math, evidencing criterion 12.
(c) Gen-2 GC count rate (per hour) compared to slice-1-2's rate.
(d) Per-tag samples.ingested distribution at a coarse level — note any
tag whose rate dropped by more than the 1% TelemetryDropoutChance
predicts.
(e) Anything else that surprised: working-set sawtooth shape vs
monotonic, alloc-rate trend, etc.
5. Add §4.5 to docs/runbook/capturing-measurements.md:
- title: "### 4.5 Chaos-monkey scenario — SLICE-1.4, `ChaosMonkey` profile"
- placement: after §4.4 (encoder-rate motion) and before §4.6 (Soak8h)
- content:
* one-paragraph rationale (links back to SLICE-1.4 spec)
* 30-minute step list mirroring §4.4 but with profile = ChaosMonkey
* sanity checks: runs.started ≥ 5, runs.faulted ≥ 5, fault-cycles
(count) ≥ 5, frames.dropped recorded, the four log-line branches
(a)/(b)/(c)/(d) all present
* the row block is 22-metric; name working-set growth (MB) and
fault-cycles (count) and where they come from
* a PowerShell `Select-String` recipe over the inspection-prototype
log files to count each fault-branch landing — copy-pasteable;
this is the criterion-11 verification recipe
* Implemented by: MultiTagSoakFlaUi with `--profile ChaosMonkey`
6. Add §4.6 to docs/runbook/capturing-measurements.md:
- title: "### 4.6 Soak scenario — SLICE-1.4, `Soak8h` profile"
- placement: after §4.5
- content:
* one-paragraph rationale: this is the slice's leak-detection bar;
8 hours real-time on a dedicated session
* a strong "do not run on a host you also intend to use" warning
* prerequisites stronger than §4.5: hibernate disabled, screen-saver
disabled, no other interactive use of the host during the run
* 8-hour step list — same Capture-Measurements.ps1 invocation with
-DurationSeconds 28800 -Profile Soak8h
* sanity checks: working-set growth (MB) ≤ 50, gen-2-gc-count rate
within 4× of slice-1-2-real-frame-payloads's rate, no
unhandled-exception entries in the log, capture span ≥ 28_500 s
* what to do if the capture is interrupted: discard the partial CSV
and restart; partial captures dilute the leak math
* Implemented by: MultiTagSoakFlaUi with `--profile Soak8h`
7. Replace the "### 4.5+ — pending Phase 1 scenarios" placeholder section
with: "### 4.7+ — pending Phase 2 scenarios" listing only "Reserved for
Phase 2 slices once they open." Phase 1 is complete after this slice.
8. Update CLAUDE.md "Current position" block:
- Phase: 1 (Simulator to scale) — complete
- Last completed action: TASK-1.4 Pass 3 — captured 30-min ChaosMonkey
(fault-cycles=<N>, runs.faulted=<N>) and 8-hour Soak8h
(working-set growth=<X> MB), 22-metric row blocks appended, runbook
§4.5 + §4.6 added; commit <hash>
- Next action: open Phase 2 — review Phase-1 measurement evidence to
prioritize SLICE-2.1 / 2.2 / 2.3 / 2.4 ordering
- Blocked on: nothing
- Last updated: <today's date>
9. Append a session-log entry to docs/reviews/roadmap-progress.md under
today's date covering: both CSV paths, both row-block headline numbers,
the criterion-11 log-evidence recipe output (per-branch counts), the
commit hash, and a one-line "Phase 1 exit gate met on YYYY-MM-DD"
declaration. Mark SLICE-1.4 as Completed in the progress table. Add a
one-line note under the Phase 1 section heading: "**Phase 1 exit gate:**
met on YYYY-MM-DD, see rows slice-1-4-chaos-monkey and slice-1-4-soak-8h."
10. Restore powercfg settings after both captures complete:
powercfg /change standby-timeout-ac <previous_value>
powercfg /change monitor-timeout-ac <previous_value>
powercfg /hibernate on # if previously on
## Constraints
- Do NOT make any code or test changes in this pass.
- Do NOT modify the SLICE-1.4 spec — the row blocks are the slice's exit-gate
evidence, not an opportunity to amend the spec.
- Do NOT skip the 8-hour soak. A shorter soak does not satisfy criterion 12;
leak signal needs the longer wall-clock window to separate from
steady-state fluctuation.
- Do NOT capture without disabling system sleep first (deliverable 1).
- Do NOT capture the soak with another high-CPU workload running on the host.
- Do NOT skip the 30-min ChaosMonkey capture in favor of the soak alone —
the two evidence different exit-gate criteria and both rows are required.
- Do NOT proceed to §4.5/§4.6 edits if criterion 11 (every fault branch hit)
fails. File the gap as a follow-up first.
## Verification before you report done
dotnet build --configuration Release
dotnet test --configuration Release
Plus:
- both docs/captures/slice-1-4-chaos-monkey-<date>.csv and
docs/captures/slice-1-4-soak-8h-<date>.csv exist and are committed
- both row blocks are in docs/reviews/phase-1-measurements.md with all 22
metrics filled, criterion 11 + 12 satisfied
- §4.5 and §4.6 render correctly (no broken markdown tables or links)
- CLAUDE.md current-position block reflects SLICE-1.4 closure and Phase 1
exit-gate met
- The Phase 1 exit-gate banner line is present in roadmap-progress.md
## Report format when finished
- files created and modified (note: there is no source code change in this pass)
- both captured row blocks (the 22-metric markdown tables) included in the report
- working-set growth (MB) value for Soak8h, fault-cycles (count) value and
per-branch log-evidence counts for ChaosMonkey, with one-sentence
interpretation of each
- a single commit hash
- commit message: "feat(measurements): close SLICE-1.4 and Phase 1; chaos-monkey + 8h soak rows + runbook §4.5/§4.6 (pass 3/3 of TASK-1.4)"Operator notes
- One pass per Copilot session. Same protocol as TASK-1.2 / TASK-1.3.
- Pass 1 keeps the seven new fields default-zero / 1.0. Existing profiles must compile and run unchanged. Any test regression in the existing slice-1-1 / slice-1-2 / slice-1-3 measurement reproducibility is a Pass 1 bug — do not paper over it in Pass 2.
- Pass 2's load-bearing detail is the conditional decorator wiring. When
Simulator:FlakySdk:Enabled == false, the decorator must not be in the call path —IMachineConnectionresolves directly toSimulatedMachineConnection. A test asserts this. The decorator's no-op-when-disabled branch is not the same as not registering the decorator at all; we want the latter so an existingNormal/Demo/MultiTagcapture is bit-for-bit reproducible. - Pass 3's 8-hour soak is the slice's one true gate. Working-set growth ≤ 50 MB is non-negotiable. If the soak fails, the slice is not done; do not paper over it by adjusting the criterion. Phase 2 may end up motivated by exactly the leak that the soak surfaces, in which case the row stays in the table as evidence and Phase 2 opens.
AlarmBurstEveryMs = 0under Soak8h is intentional. The alarm-burst path interrupts runs and dominates run throughput; under the soak, we want continuous runs accumulating wall-clock hours. ChaosMonkey getsAlarmBurstEveryMs = 45_000; Soak8h gets0. Do not make the two profiles share the same alarm cadence "for symmetry" — they have different jobs.TimeCompressionFactor: 1.0on both new profiles. Time compression's main use is for tightening developer feedback loops on the chaos paths; for the slice's exit-gate captures, real-time keeps the data plane representative. Keep the field plumbed (Pass 1) and validated (Pass 1) but use it at1.0for the captures (Pass 3).- The 30-minute ChaosMonkey is verified by log inspection, not counters. Counters say "X faults occurred"; logs say "the fault occurred during Homing" / "during Running" / etc. Pass 3's runbook §4.5 includes the recipe so future captures reproduce the verification without bespoke per-capture code.
- The flaky-SDK decorator is connection-only. Wrapping
IMotionControlleris a documented follow-up. The slice's spec does not gate on motion-side flakiness; the connect-side coverage ofDoConnectAsyncis enough to claim the criterion-A coverage of the connect-fail / out-of-band-throw branches ofWorkflowService. - Update the index files only at the end of the phase, not per-slice. Same rationale as earlier tasks. Phase 1's full retrospective banner goes into
roadmap-progress.mdunder the Phase 1 heading after Pass 3 lands.