Skip to content

SLICE-2.0: Store Allocation & Contention Profiling

Goal

Instrument AppStateStore.Update to measure (a) per-call allocation, (b) lock-wait time, (c) call rate, and (d) caller distribution. Run a 30-minute capture under the MultiTag profile and produce a slice-2-0-store-profiling row that quantifies what fraction of the application's total allocation rate flows through AppStateStore.Update and how contended the store lock actually is. The resulting numbers determine whether the originally planned Phase 2 slices (2.1, 2.2, 2.3, 2.4) are load-bearing as specified, need reordering, or can be deferred. This slice ships measurement only — no architectural refactoring.

Why This Slice

The Phase 1 exit-gate evidence (rows slice-1-1 through slice-1-4-soak-8h) shows the application surviving load beautifully: frames.dropped near zero, runs.faulted bounded, working-set steady-state flat across 8 hours, GC pause p95 stable at 7.9–12.4 ms. The roadmap §3 explicitly says: "If the app survives this run beautifully, Phase 2 is deferred. If not, we know exactly which slice of the store to attack first."

The app did survive beautifully — but AppStateStore.Update allocation share and lock-wait time are unmeasured. Phase 2's planned slices were specified with concrete exit gates that depend on these unmeasured baselines:

  • SLICE-2.1 (slice AppState into sub-records) — exit gate: "Allocations/sec at AppStateStore.Update drop by ≥ 40% under the Phase-1 soak profile." Requires a baseline.
  • SLICE-2.2 (immutable collections) — purpose: avoid List<T> rebuilds on hot paths. Without seeing a measurable allocation share, we can't tell if the rebuilds are even visible in the trace.
  • SLICE-2.3 (data-plane lift-out for telemetry / frames / diagnostics) — already done for the encoder stream (SLICE-1.3 bypasses AppState). For tags/frames/diagnostics: justified only if those callers dominate the AppState write rate.
  • SLICE-2.4 (per-slice observables) — exit gate: "Lock-wait on the store stays under a measured threshold." Requires the threshold to be measured first.

Opening any of those slices without a baseline means committing to a refactor whose payoff cannot be quantified. SLICE-2.0 produces the baseline. Its row block is the measured justification for Phase 2 that the roadmap has been pointing at since 2026-04-22.

Requirements Coverage

In Scope

Instrumentation in AppStateStore.Update

The current AppStateStore.Update(Func<AppState, AppState> reducer) becomes:

csharp
public void Update(
    Func<AppState, AppState> reducer,
    [CallerMemberName] string callerMember = "",
    [CallerFilePath]   string callerFile   = "")
{
    var caller = string.IsNullOrEmpty(callerFile)
        ? callerMember
        : $"{Path.GetFileNameWithoutExtension(callerFile)}.{callerMember}";

    var sw = Stopwatch.StartNew();
    long allocBefore = GC.GetAllocatedBytesForCurrentThread();
    AppState next;
    lock (_lock)
    {
        // Lock-wait time = total elapsed up to this point (acquired the lock)
        var lockWaitMicros = sw.Elapsed.TotalMicroseconds;
        _metrics.StoreUpdateLockWaitMicros.Record(lockWaitMicros);

        next = reducer(_current);
        _current = next;
    }
    long allocBytes = GC.GetAllocatedBytesForCurrentThread() - allocBefore;
    _metrics.StoreUpdateAllocBytes.Add(allocBytes);
    _metrics.StoreUpdateCalls.Add(1, new KeyValuePair<string, object?>("caller", caller));

    StateChanged?.Invoke(next);
}

The two [Caller*] parameters are compile-resolved by the C# compiler at every callsite — no callsite changes required. The resulting caller string is a <TypeFileName>.<MethodName> pair, e.g. FramePipelineService.ExecuteAsync, WorkflowService.OnFaultInjected. Multiple Update calls inside the same method aggregate under one key; that granularity is sufficient for SLICE-2.0's prioritization purpose.

The IAppStateStore interface gains the same two optional parameters so the abstraction matches the implementation; existing test doubles compile unchanged because the parameters are optional and the values default at the callsite.

New AppMetrics counters

Three new counters on the existing InspectionPrototype meter:

  • store.update.calls (counter, dimension caller) — total count, broken down by caller identity.
  • store.update.alloc.bytes (counter, no dimension) — total allocation bytes summed across all Update calls.
  • store.update.lock_wait.micros (histogram) — distribution of lock-wait time in microseconds. Histogram boundaries [1, 10, 100, 1_000, 10_000, 100_000] (1 µs → 100 ms).

StateChanged event firing semantics, return value, and the Current getter remain unchanged. The lock is held for the same duration as before plus the cost of one _metrics.StoreUpdateLockWaitMicros.Record(...) call (~50 ns); the alloc-delta and call-count emissions happen outside the lock.

Measurement-extraction helpers

tools/MeasurementExtraction.psm1 gains four helpers:

  • Get-StoreUpdateRate -Csv $rows -DurationSeconds $dur — returns store.update.calls total / duration, in calls/sec.
  • Get-StoreUpdateAllocShare -Csv $rows -DurationSeconds $dur — returns the ratio store.update.alloc.bytes total / dotnet.gc.heap.total_allocated total as a percentage, rounded to 1 decimal.
  • Get-StoreUpdateLockWaitP95 -Csv $rows — returns the p95 of store.update.lock_wait.micros as a number in microseconds.
  • Get-StoreUpdateCallerDistribution -Csv $rows — returns the top 5 callers by call count as an [ordered] array [{caller=…; calls=…; share_pct=…}].

ConvertTo-MeasurementRow adds four rows to the markdown block (becomes 26-metric for captures that include the new counters; preserves the SLICE-1.2/1.3/1.4 "—" sentinel pattern when the input CSV pre-dates SLICE-2.0):

| store.update rate (calls/s)         | <Get-StoreUpdateRate>            |
| store.update alloc share (%)        | <Get-StoreUpdateAllocShare>      |
| store.update lock-wait p95 (µs)     | <Get-StoreUpdateLockWaitP95>     |
| store.update top-caller             | <top entry from Get-StoreUpdateCallerDistribution> |

The full caller-distribution table (top 5) lives in the row's Notes section, not in the metric table, because it is multi-row and exceeds the table format.

Measurement scenario

A new §5.1 in docs/runbook/capturing-measurements.md (the new §5.x section bracket signals Phase 2 captures, distinct from §4.x Phase-1):

  • §5.1 Store-allocation profiling — SLICE-2.0, MultiTag profile. 30-minute capture using the existing MultiTagSoakFlaUi scenario with -Profile MultiTag. The MultiTag profile is chosen because it exercises the highest sustained Update rate of any baseline profile (50-tag emitters at the 50 ms snapshot interval drive ~20 Hz tag-snapshot writes plus per-frame writes plus per-position-change writes).

No new FlaUI scenario class. Reuses the SLICE-1.6 rig.

Measurement row

A new file docs/reviews/phase-2-measurements.md is created (mirrors phase-1-measurements.md structure) with one row block, slice-2-0-store-profiling. Baseline column: slice-1-1-multi-tag-telemetry for the 22 metrics that overlap (tag/frame/encoder rates, GC, working-set, etc.); "—" for the four new SLICE-2.0 metrics since the baseline pre-dates them.

Decision rubric (informative, not gating)

The row block's Notes section includes an explicit Phase 2 prioritization recommendation based on the captured numbers:

  • store.update alloc share < 10% → SLICE-2.1 (slice AppState into sub-records) and SLICE-2.2 (immutable collections) are deferred. Their stated exit gates require ≥ 40% allocation reduction at AppStateStore.Update; if the absolute number is small, even a 100% reduction is below noise. Document the deferral with the measured share.
  • store.update alloc share ≥ 10% → SLICE-2.1 opens. Write its spec with the measured share as the baseline; the ≥ 40% reduction target becomes a concrete number (e.g., "from 22% share to ≤ 13% share").
  • store.update lock-wait p95 < 100 µs → SLICE-2.4 (per-slice observables) is deferred. Lock contention is not the bottleneck.
  • store.update lock-wait p95 ≥ 100 µs → SLICE-2.4 opens, with the measured p95 as the baseline.
  • Top caller is TagStreamPipelineService.* with > 50% of total calls → SLICE-2.3 (data-plane lift-out for tags) becomes the priority. Mirrors the SLICE-1.3 encoder bypass: tag stream becomes a channel-drain pipeline that doesn't write through AppState. The other 2.3 sub-targets (frames, diagnostics) are evaluated separately.
  • Top caller is FramePipelineService.* with > 50% → SLICE-2.3 (frames bypass) is the priority instead.
  • Top caller is workflow-state-machine code (WorkflowService.*, OnFaultInjected, etc.) with > 50% → no Phase 2 slice fits cleanly. The store is doing exactly the job it was designed for. Recommend Phase 2 deferral entirely.

The row's Notes section spells out which case the captured data lands in and explicitly proposes (or defers) the next slice. Phase 2 spec authoring follows from this recommendation, not from the original roadmap text.

Out of Scope

  • Refactoring AppStateStore (slicing, immutable collections, observables, lift-out) — that's Phase 2.1/2.2/2.3/2.4, opened conditionally based on this slice's evidence.
  • Instrumenting subscribers of StateChanged — that's a Phase 2.4 measurement, not Phase 2 baseline.
  • Per-AppState-field allocation breakdown (e.g., "how much of the alloc comes from LatestTagValues rebuild vs RunHistory.Prepend") — out of scope for this baseline; would require deeper instrumentation. If 2.0 evidence motivates 2.1, the 2.1 task can add per-field accounting at that point.
  • Stack-walk-based caller resolution (slower, more accurate) — [CallerMemberName] + [CallerFilePath] is sufficient for the prioritization decision.
  • Removing or renaming the StateChanged event — semantics unchanged.
  • Adding new profiles — uses existing MultiTag.
  • A new IScenario class — MultiTagSoakFlaUi with -Profile MultiTag is the capture path.
  • Writing the actual Phase 2 specs (2.1/2.2/2.3/2.4) — out of scope; SLICE-2.0's row block names which Phase 2 slices to write next, but writing them is itself a separate slice-opening session.
  • Changing IAppStateStore.Update's required arguments — the two new parameters are optional with [Caller*] defaults; no existing callsite changes.

Runtime Behavior

Performance impact of instrumentation

  • Stopwatch.StartNew(): ~20 ns per call.
  • GC.GetAllocatedBytesForCurrentThread(): ~30 ns per call (×2 = before + after).
  • Counter.Add(1, KVP): ~50 ns per call.
  • Histogram.Record(double): ~100 ns per call.
  • Total per-Update overhead: ~250 ns.

AppStateStore.Update is called at most ~50 Hz under MultiTag (extrapolating from row 0a's 4.95 Hz × the 10× higher rate MultiTag drives via the 50 ms snapshot interval). 50 calls/sec × 250 ns/call = 12.5 µs/sec of CPU overhead, < 0.001% of a 1 GHz core. Below noise. The instrumentation is always-on; no opt-in flag.

Caller-string semantics

[CallerMemberName] returns the method name where the call site lives. [CallerFilePath] returns the absolute file path of that source file. We strip the path to the filename (no extension) and concatenate. Examples:

  • FramePipelineService.ExecuteAsync (frame-pipeline drain loop)
  • FramePipelineService.ProcessDefectsForFrame (defect detection)
  • TagStreamPipelineService.ExecuteAsync (tag snapshot drain)
  • WorkflowService.OnFaultInjected (critical fault path)
  • WorkflowService.OnPositionChanged (motion-update event)
  • MainViewModel.OnRecipeListUpdated (UI-driven update)

Some methods make multiple Update calls (e.g., WorkflowService.RunLoopAsync has several writes per scan-point cycle). Those aggregate under one key; sub-key granularity is out of scope.

Histogram boundaries

store.update.lock_wait.micros uses fixed boundaries [1, 10, 100, 1_000, 10_000, 100_000] (1 µs, 10 µs, 100 µs, 1 ms, 10 ms, 100 ms). Most calls under steady load should land in the 1-10 µs bucket (uncontended lock acquisition). Contended cases (frame-pipeline write fighting with tag-stream write fighting with workflow write) would land in 10-100 µs. Anything in 1 ms+ buckets indicates real contention and would directly motivate SLICE-2.4.

Behavior when no other consumers register

store.update.calls, .alloc.bytes, and .lock_wait.micros follow the same lifecycle as the existing InspectionPrototype meter counters. They appear in dotnet-counters monitor output and dotnet-counters collect --format csv. Unit tests use a MeterListener to verify increments (same pattern as AppMetricsTagDimensionTests and AppMetricsEncoderDimensionTests).

Test-double compatibility

Existing IAppStateStore test fakes (e.g., RecordingAppStateStore from SLICE-1.3 tests) implement Update(Func<AppState, AppState>) directly. The new optional parameters mean void Update(Func<AppState, AppState> reducer, string callerMember = "", string callerFile = "") — fakes that implement the interface need to add the parameters (compile-only). Existing test setups do not need to pass the parameters explicitly because the C# compiler injects the [CallerMemberName] / [CallerFilePath] defaults.

Acceptance Criteria

This slice is satisfied only if all of the following are true:

  1. AppStateStore.Update is instrumented per "Instrumentation in AppStateStore.Update": [CallerMemberName] + [CallerFilePath] parameters added; Stopwatch measures lock-wait; GC.GetAllocatedBytesForCurrentThread() measures alloc delta; three new AppMetrics counters are emitted per call. The lock is held for the same logical duration as before (measure lock-wait before entering critical section; emit alloc/calls after exiting); reducer execution and _current = next assignment remain inside the lock.
  2. IAppStateStore interface gains the two optional [Caller*] parameters with empty-string defaults. All existing Update(...) callsites in the codebase compile unchanged. Existing test fakes (RecordingAppStateStore and any others under tests/.../Stubs/) implement the new signature with the same defaults.
  3. AppMetrics exposes StoreUpdateCalls (Counter<long>), StoreUpdateAllocBytes (Counter<long>), and StoreUpdateLockWaitMicros (Histogram<double>). XML doc explains the caller dimension on StoreUpdateCalls. The existing 11 counters (frames, telemetry, runs, samples, encoder, tags-active gauge) are unchanged.
  4. MeasurementExtraction.psm1 exports Get-StoreUpdateRate, Get-StoreUpdateAllocShare, Get-StoreUpdateLockWaitP95, Get-StoreUpdateCallerDistribution. Pester tests cover each helper with synthetic CSV input, including the "absent counter" sentinel case (returns $null or "—" per the SLICE-1.2/1.3/1.4 convention).
  5. ConvertTo-MeasurementRow emits the four new metric rows when the input CSV contains the new counters, and "—" sentinels otherwise. Existing rows in phase-1-measurements.md remain reproducible: re-running ConvertTo-MeasurementRow against slice-1-1-multi-tag-telemetry.csv produces a row whose first 22 metrics match the existing markdown bit-for-bit, with four new "—" rows appended.
  6. A new file docs/reviews/phase-2-measurements.md exists with the same conventions as phase-1-measurements.md (Conventions / Fixed metric set / row blocks / Notes) and one row block tagged slice-2-0-store-profiling.
  7. A 30-minute MultiTag capture under MultiTagSoakFlaUi (commit at-or-after this slice's merge) produces the row block. The CSV is committed at docs/captures/slice-2-0-store-profiling-<date>.csv.
  8. The row's Notes section includes an explicit Phase 2 prioritization recommendation following the rubric in "Decision rubric": one of {2.1 opens / 2.2 opens / 2.3 (tags) opens / 2.3 (frames) opens / 2.4 opens / Phase 2 deferred entirely}. The recommendation cites the measured numbers that drove the conclusion.
  9. docs/runbook/capturing-measurements.md gains §5.1 (Phase-2 captures) describing the procedure. The §4.7+ Phase-2 placeholder added in SLICE-1.4 close-out is replaced.
  10. The full existing test suite still passes, plus new tests covering: AppStateStoreInstrumentationTests verifies that calling Update increments StoreUpdateCalls{caller=…}, StoreUpdateAllocBytes (≥ 0), and records to StoreUpdateLockWaitMicros; verifies the lock-wait timing path (no measurable alloc inside the lock-wait Stopwatch.Elapsed read); verifies caller dimension correctly identifies the calling type+method via [CallerMemberName]/[CallerFilePath]. Pester tests for the four new helpers. dotnet test runtime stays under 90 seconds.
  11. No regression to prior measurement rows: Capture-Measurements.ps1 -Profile MultiTag produces a row equivalent to slice-1-1-multi-tag-telemetry within the existing per-tag accuracy bounds (criterion 16 of SLICE-1.4 carried forward).
  12. Phase 2 prioritization decision (criterion 8) drives the next slice-opening session: if the recommendation says "SLICE-2.1 opens", the next session writes SLICE-2.1 spec and TASK-2.1 task; if it says "Phase 2 deferred", CLAUDE.md current-position records that and the next session opens Phase 3 evaluation. SLICE-2.0's job ends at producing the recommendation.

Verification Notes

The implementation task for this spec must include verification for:

  • Instrumentation overhead is below the per-Update budget. The slice's "Performance impact of instrumentation" estimate is ~250 ns per call. Verify by capturing a 60-second MultiTag smoke and asserting that cpu-usage avg (%) row delta vs the pre-SLICE-2.0 slice-1-1-multi-tag-telemetry row is within ±2 percentage points. If the delta is larger, the instrumentation is too heavy; investigate before the 30-min capture.
  • The caller dimension correctly identifies the caller. Unit test calls Update from inside a method named TestCaller_RegistersAsCallerName and asserts the emitted dimension contains both the method name and the test file's name (sans extension).
  • Allocation measurement is deterministic. A unit test invokes Update with a reducer that does a known allocation (e.g., new List<int>(1024) of known size), reads the emitted StoreUpdateAllocBytes increment, and asserts it is within ±1 KB of the expected size. Unit tests must call GC.Collect() + GC.WaitForPendingFinalizers() before the assertion to drain background allocations.
  • Lock-wait timing distribution. Under no contention (single thread calling Update 1 000 times serially), assert p95 < 10 µs. Under contention (10 threads each calling Update 100 times), assert p95 > 0 µs and < 1 ms (the lock should not be the dominant cost; if p95 hits 10 ms+, something is wrong).
  • The 30-minute capture decision rubric. The Notes section's prioritization recommendation must cite both the measured numbers and the matching rubric line. A reviewer reading only the Notes should be able to reproduce the decision without re-running the capture.
  • MultiTag profile reproducibility. The first 22 metrics of the slice-2-0-store-profiling row must match the slice-1-1-multi-tag-telemetry row within the per-tag accuracy bounds amended in SLICE-1.1's criterion 7. This is the criterion-11 reproducibility check carried forward; if the new instrumentation perturbs the data plane enough to break reproducibility, the slice is too heavy.
  • The deferred-Phase-2 case is not a regression. If the captured data leads the rubric to "Phase 2 deferred entirely", that is a successful outcome — not a failure to motivate refactoring. The roadmap's principle ("simulator-first, measured, then refactor — premature refactoring is the most expensive mistake available here") makes deferral a first-class outcome of the measurement.

Docs-first project memory for AI-assisted implementation.