TenantAtlas/specs/211-runtime-trend-recalibration/spec.md

# Feature Specification: Test Runtime Trend Reporting & Baseline Recalibration

**Feature Branch**: `211-runtime-trend-recalibration`
**Created**: 2026-04-17
**Status**: Implemented (local validation complete)
**Input**: User description: "Spec 211 - Test Runtime Trend Reporting & Baseline Recalibration"

## Spec Candidate Check *(mandatory — SPEC-GATE-001)*

- **Problem**: TenantPilot's test-suite governance is now enforceable per run, but maintainers still lack a shared time-series view of how lane runtime, hotspot cost, and budget headroom evolve over time.
- **Today's failure**: A lane can erode gradually without obvious alarm, a noisy outlier can be mistaken for structural regression, and baseline or budget changes can happen without consistent evidence or policy.
- **User-visible improvement**: Contributors and reviewers get readable lane trend summaries that show health, deterioration, hotspot drift, and whether recalibration is justified before a lane becomes a repeated blocker.
- **Smallest enterprise-capable version**: Reuse the existing governed lane artifacts to retain bounded runtime history, compare current versus previous versus baseline versus budget, classify drift states, surface dominant hotspots, and document explicit baseline and budget recalibration rules.
- **Explicit non-goals**: No new lane taxonomy, no new general-purpose analytics platform, no automatic budget inflation, no mandate to optimize every slow file inside this spec, and no unlimited raw-history retention.
- **Permanent complexity imported**: Runtime trend data contract, bounded history artifacts, drift classification vocabulary, hotspot comparison rules, recalibration policy, summary semantics, and contributor guidance.
- **Why now**: Specs 206 through 210 established lane execution, fixture cost reduction, heavy-lane separation, and CI enforcement; without a trend layer the team can only react after drift already starts blocking shared flow.
- **Why not local**: Private spreadsheets or ad hoc comparisons cannot produce shared, reviewable evidence or a consistent recalibration process that survives reviewer and maintainer turnover.
- **Approval class**: Cleanup
- **Red flags triggered**: New historical artifact retention, new drift-status vocabulary, and new recalibration policy. Defense: the feature stays repo-scoped, derives from existing lane outputs, and intentionally avoids becoming a second analytics system.
- **Score**: Nutzen: 2 | Dringlichkeit: 2 | Scope: 2 | Komplexität: 1 | Produktnähe: 1 | Wiederverwendung: 2 | **Gesamt: 10/12**
- **Decision**: approve

## Spec Scope Fields *(mandatory)*

- **Scope**: workspace
- **Primary Routes**: No end-user HTTP routes change. The affected surfaces are repository-owned lane reports, trend summaries, recalibration guidance, and CI/runtime artifacts.
- **Data Ownership**: Workspace-owned runtime history artifacts, trend summaries, budget and baseline policy, and contributor guidance. No tenant-owned records or product runtime tables are introduced.
- **RBAC**: No end-user authorization behavior changes. The actors are contributors, reviewers, maintainers, and CI runners consuming the shared test-governance contract.

## Proportionality Review *(mandatory when structural complexity is introduced)*

- **New source of truth?**: no
- **New persisted entity/table/artifact?**: yes, but only repository-owned historical runtime and trend artifacts derived from existing governed lane outputs
- **New abstraction?**: yes, but limited to a repo-level trend model, drift classification, and recalibration policy
- **New enum/state/reason family?**: yes, but only repository-level lane health states such as `healthy`, `budget-near`, `trending-worse`, `regressed`, and `unstable`
- **New cross-domain UI framework/taxonomy?**: no
- **Current operator problem**: Maintainers can enforce budgets per run, but they cannot yet see whether a lane is drifting, whether a hotspot is growing, or whether recalibration is evidence-based instead of reactive.
- **Existing structure is insufficient because**: Current CI evidence is mostly run-by-run and cannot reliably distinguish sustained erosion, legitimate suite growth, or runner noise without manual reconstruction.
- **Narrowest correct implementation**: Add bounded history and derived trend summaries on top of existing governed lane artifacts instead of inventing new lanes, new product persistence, or a broader analytics stack.
- **Ownership cost**: The team must maintain trend retention rules, drift thresholds, recalibration guidance, and representative example evidence as runner behavior and suite composition evolve.
- **Alternative intentionally rejected**: Ad hoc manual comparisons, one-off spreadsheets, or allowing budgets to silently move upward with each overrun.
- **Release truth**: Current-release repository truth required to make Specs 206 through 210 durable over time.

## Problem Statement

Specs 206 through 210 moved TenantPilot's test suite into a governed operating model:

- Lanes and budgets exist.
- Shared fixture cost has been reduced.
- Heavy Filament or Livewire families have been segmented.
- Heavy-governance cost is treated honestly.
- CI runs the governed lanes and evaluates budgets.

What is still missing is the time dimension. The repository can usually tell whether one run is green or red, but it still cannot answer the more strategic questions:

- Is a lane slowly getting worse even though it still passes?
- Is a budget warning noise, early erosion, or a genuine regression?
- Did the suite legitimately grow, or did the budget simply drift upward by habit?
- Are the dominant hotspots stable, worsening, or newly emerging?

Without historical observability and explicit recalibration policy, test governance remains operational rather than strategic.

## Dependencies

- Depends on Spec 206 - Test Suite Governance & Performance Foundation for lane vocabulary, budgets, and checked-in reporting entry points.
- Depends on Spec 207 - Shared Test Fixture Slimming for more credible lane cost signals.
- Depends on Spec 208 - Filament/Livewire Heavy Suite Segmentation for honest separation of expensive families.
- Depends on Spec 209 - Heavy Governance Lane Cost Reduction for a more stable heavy-lane baseline.
- Depends on Spec 210 - CI Test Matrix & Runtime Budget Enforcement for governed CI artifacts, budget evidence, and per-run enforcement semantics.
- Recommended after stable CI lanes, reproducible lane artifacts, and functioning budget enforcement are already available.
- Blocks durable long-horizon budget stewardship and trend-based test governance.
- Does not block normal feature delivery or daily CI execution.

## Goals

- Make runtime evolution visible for each primary lane over time.
- Compare current values against both baselines and budgets.
- Detect budget erosion before hard gates fail repeatedly.
- Define explicit policy for baseline recalibration.
- Define explicit policy for budget recalibration.
- Track hotspot and family cost shifts over time.
- Distinguish runner noise from true regression.

## Non-Goals

- Optimizing every individual slow test file within this spec.
- Creating another lane-segmentation feature.
- Replacing CI budget enforcement rather than complementing it.
- Building a general analytics platform for every CI metric.
- Turning trend reporting into a broad dashboard project unrelated to test runtime governance.
- Requiring unlimited historical retention of raw CI outputs.

## Assumptions

- The lane wrappers and artifact contracts created by Specs 206 through 210 remain the authoritative inputs for any trend layer.
- Representative run references, timestamps, or commit identifiers are available for governed lane outputs.
- History retention can be bounded without losing enough evidence to justify recalibration decisions.
- CI noise is real and should be treated as ordinary variance rather than proof of regression by default.

## Key Decisions

- **Budgets and baselines are different**: A budget is a governance limit, while a baseline is a reference point. They must not drift together automatically.
- **Trend visibility complements hard enforcement**: The existing red or green contract stays in place; trend reporting adds foresight rather than replacing gates.
- **Recalibration must be explicit**: Baseline or budget changes require documented evidence and reasoning.
- **Noise-aware governance matters**: Single noisy runs should not dominate decisions.
- **Lane-first governance remains primary**: File and family hotspots inform the decision, but the lane stays the main governance unit.
- **Historical observability must stay lightweight**: The first slice should aid decisions without becoming a second BI system.

## Test Governance Impact *(mandatory — TEST-GOV-001)*

- **Validation lane(s)**: `fast-feedback`, `confidence`, `heavy-governance`, `browser`, plus `junit` and `profiling` when they supply hotspot or comparison evidence.
- **Why these lanes are sufficient**: They cover the full governed cost classes already recognized by the repository, including both primary operational lanes and the support evidence used to explain hotspots and compare scope.
- **New or expanded test families**: No new product-facing test family is required. The feature may add lightweight repo-level guard coverage for trend parsing, drift classification, recalibration reasoning, and summary generation.
- **Fixture / helper cost impact**: Low and bounded. The feature MUST stay inside repo-level reporting, artifact retention, and documentation. It MUST NOT add shared product fixtures, broaden default setup, or widen heavy suite membership.
- **Heavy coverage justification**: None beyond consuming the existing `heavy-governance` and `browser` lanes as evidence sources. The feature introduces no new heavy-governance or browser scenarios.
- **Budget / baseline / trend impact**: This feature formalizes trend headroom, drift states, and recalibration criteria. Any threshold tuning or material runtime drift or recalibration follow-up discovered during rollout MUST be documented in this spec or the active implementation PR rather than silently absorbed into budgets or left only in quickstart notes.
- **Planned validation commands**: `./scripts/platform-test-lane fast-feedback`, `./scripts/platform-test-lane confidence`, `./scripts/platform-test-report fast-feedback`, and `./scripts/platform-test-report confidence` for routine reviewer validation. Representative `heavy-governance`, `browser`, `junit`, and `profiling` evidence should come from the same checked-in lane/report entry points rather than ad hoc commands.

## Trend Reporting Minimum Surface

### Lane Runtime Trend Model

For each relevant lane, the trend surface must show at least:

- current runtime
- previous comparable runtime
- baseline runtime
- budget target
- delta to previous runtime
- delta to baseline runtime
- current health classification
- recent history window sufficient to show direction rather than a single point

### Runtime History Contract

Each retained trend record must remain reproducible enough to justify later decisions and must preserve at least:

- run, commit, or timestamp reference
- lane name
- measured runtime
- budget outcome or headroom state
- baseline reference used for comparison
- hotspot or family summary when available
- enough provenance to explain whether the record is directly comparable to adjacent runs

### Drift Detection Outcomes

Trend reporting must distinguish at least these lane states:

- `healthy`
- `budget-near`
- `trending-worse`
- `regressed`
- `unstable`

The model must be able to show intermediate deterioration without collapsing every non-healthy case into a single hard failure signal.

### Hotspot Trend Visibility

Trend reporting must expose the dominant cost drivers for each primary lane in a way that shows:

- top cost drivers for the current reporting window
- change against the reference window
- newly dominant families or files
- persistent known hotspots that continue to dominate cost

### Readable Summary Surface

Each reporting cycle must publish a concise summary that makes it immediately clear:

- which lanes are healthy
- which lanes are near budget
- which lanes are worsening or regressed
- whether recalibration should be discussed
- which hotspots dominate the lanes that need attention

## Required Validation Evidence Set

- One recent sequence of at least three comparable run samples for each primary lane: `fast-feedback`, `confidence`, `heavy-governance`, and `browser`.
- One support-lane example from `junit` or `profiling` when it materially improves hotspot or comparison evidence.
- One example each for `healthy`, `budget-near`, `trending-worse` or `regressed`, and `unstable` outcomes.
- One example where legitimate lane-scope change justifies baseline recalibration.
- One example where an overrun does not justify either baseline or budget recalibration.
- Material runtime drift, bundle-hydration caveats, and approved or rejected recalibration follow-up must be recorded in this spec or the active implementation PR; quickstart may mirror the same evidence but does not replace the delivery record.
- Each evidence record must identify the run reference, lane, current runtime, previous runtime, baseline, budget, health class, and hotspot summary or an explicit note that hotspot evidence is unavailable.

## Recorded Validation Evidence (2026-04-17)

| Evidence | Lane | Current / Previous / Baseline / Budget | Health | Hotspots | Recalibration |
|----------|------|-----------------------------------------|--------|----------|---------------|
| Live cold-start wrapper refresh via `./scripts/platform-test-report fast-feedback --skip-latest-history` | `fast-feedback` | `120.29s / 120.29s / 176.74s / 200s` | `unstable` with `windowStatus=insufficient-history` | unavailable | budget `rejected` with `manual-hold` because the comparable window was still too short |
| Representative stable window from generated trend-summary fixtures | `fast-feedback` | `176.73s / 178.91s / 176.74s / 200s` | `healthy` | unavailable | none |
| Representative near-budget window from generated trend-classification fixtures | `confidence` | `433.00s / 430.00s / 394.38s / 450s` | `budget-near` | not the focus of this case | investigate only; no automatic repository-truth change |
| Representative noisy window from generated trend-classification fixtures | `fast-feedback` | `170.00s / 195.00s / 176.74s / 200s` | `unstable` with `windowStatus=noisy` | unavailable | none; the report explicitly treats the spike as noise instead of a structural regression |
| Representative hotspot-stable window from generated trend-summary and trend-hotspots fixtures | `confidence` | `394.38s / 401.12s / 394.38s / 450s` | `healthy` | available; `baseline-compare-matrix-workflow` and `onboarding-wizard-enforcement` stayed flat, with the compare-matrix pair remaining the top file hotspots | none |
| Approved baseline reset from generated recalibration fixtures | `fast-feedback` | `176.30s / 176.00s / 182.00s / 200s` | `healthy` | unavailable | baseline `approved` with `post-improvement-reset` after the lane stabilized |
| Rejected budget movement from generated recalibration fixtures | `fast-feedback` | `193.00s / 176.00s / 176.74s / 200s` | `budget-near` | unavailable | budget `rejected` with `noise-rejected`; repository truth stayed unchanged |
| Candidate budget review from generated recalibration fixtures | `confidence` | `460.00s / 420.00s / 394.38s / 450s` | `regressed` | not the focus of this case | budget `candidate` only after a five-run evidence window, proposed `505s`, still requiring human approval |
| Live primary-lane cold-start refresh via repo-root wrappers | `browser` and `heavy-governance` | `109.67s / n/a / n/a / 150s` and `228.34s / n/a / n/a / 300s` | both `unstable` on first refresh | unavailable until a comparable prior window exists | budget `rejected` with `manual-hold` on both first-pass reports |
| Live support-lane refresh via repo-root wrappers | `profiling` and `junit` | `2701.51s / n/a / n/a / 3000s` and `380.14s / n/a / n/a / 450s` | both `unstable` on first refresh | unavailable on cold start | budget `rejected` with `manual-hold`; the `junit` report wrapper path was repaired during this implementation so the documented command now executes |

- Reviewer dry run: the generated markdown summaries remained decidable from the `## Lane trend` section alone within the intended two-minute review window, without opening the raw JSON payloads.
- Bundle hydration note: workflow-owned report refresh now relies on `--fetch-latest-history` plus `TENANTATLAS_GITEA_TOKEN` and explicit `actions: read` plus `contents: read` permissions to pull the newest comparable artifact bundle before regenerating `trend-history.json`.
- Runtime follow-up note: no baseline or budget changed automatically in repository truth during implementation. All recalibration output stayed advisory unless a fixture or spec entry explicitly marked it approved.

## User Scenarios & Testing *(mandatory)*

### User Story 1 - See Lane Drift Before It Becomes A Repeated Gate (Priority: P1)

As a maintainer reviewing governed test runs, I want lane summaries to compare the current runtime against the previous run, the baseline, and the budget so I can spot erosion before a lane becomes a recurring blocker.

**Why this priority**: Early drift detection is the core value of the feature. Without it, governance remains reactive and only responds after breakage is already frequent.

**Independent Test**: Review a representative run sequence for `fast-feedback` and `confidence`, confirm that the summary shows current, previous, baseline, and budget values, and verify that healthy, near-budget, and worsening cases are distinguishable without manual arithmetic.

**Acceptance Scenarios**:

1. **Given** a lane stays near its baseline with comfortable headroom, **When** the trend summary is generated, **Then** the lane is shown as healthy with current, previous, baseline, and budget values visible.
2. **Given** a lane moves closer to its budget across multiple comparable runs, **When** the trend summary is generated, **Then** the lane is shown as budget-near or trending-worse before repeated hard failures begin.
3. **Given** a single run spikes but adjacent runs remain normal, **When** the trend summary is generated, **Then** the lane is treated as unstable or noisy rather than immediately treated as baseline regression.

---

### User Story 2 - Decide Recalibration With Evidence Instead Of Habit (Priority: P1)

As a maintainer responsible for budgets, I want explicit recalibration rules and supporting trend evidence so I can distinguish legitimate suite growth, lane reshaping, infrastructure change, and true regression.

**Why this priority**: Without explicit policy, every slowdown invites arbitrary budget inflation or blanket refusal to recalibrate, and both outcomes weaken governance.

**Independent Test**: Review one representative justified recalibration case and one rejected recalibration case, and confirm that the report plus policy make the outcome understandable without relying on private notes.

**Acceptance Scenarios**:

1. **Given** a lane slows because approved coverage legitimately expands its scope, **When** maintainers review the trend evidence, **Then** baseline recalibration is presented as discussable rather than automatic.
2. **Given** a lane slows because of a regression without approved scope change, **When** maintainers review the trend evidence, **Then** baseline and budget remain unchanged and follow-up performance work is indicated instead.
3. **Given** only runner noise is present, **When** the trend evidence is reviewed, **Then** no immediate baseline or budget recalibration is recommended.

---

### User Story 3 - Track Dominant Hotspots Over Time (Priority: P2)

As a contributor investigating suite slowdown, I want hotspot trend summaries per lane so I can target the dominant family or file based on persistent evidence rather than a single anecdotal slow run.

**Why this priority**: Lane-level health points maintainers toward trouble, but hotspot trend visibility makes follow-up work actionable.

**Independent Test**: Review representative hotspot summaries for each primary lane across multiple runs and confirm that persistent, worsening, newly dominant, and unavailable hotspot states are visible.

**Acceptance Scenarios**:

1. **Given** the dominant hotspot families change between reporting windows, **When** the summary is generated, **Then** newly dominant families are visible without reading raw per-test output.
2. **Given** a known expensive family remains the major cost driver across several runs, **When** the summary is reviewed, **Then** its persistence is clear enough to support targeted follow-up work.
3. **Given** hotspot detail is unavailable for one reporting cycle, **When** the summary is generated, **Then** the report states that the hotspot evidence is incomplete instead of silently omitting context.

### Edge Cases

- The first rollout window has too little history for a given lane; the summary must clearly mark the comparison as insufficient rather than pretending a stable trend exists.
- Lane membership or scope changes make old and new runs only partially comparable; the report must flag that boundary before trend conclusions are drawn.
- A budget exists but the prior baseline is outdated or missing; the report must surface the mismatch rather than hiding it.
- Several lanes move at once after an infrastructure or runner change; the recalibration policy must prevent accidental budget inflation across the board.
- Hotspot evidence is only partially available for one lane; the lane health summary must remain readable while clearly disclosing the missing hotspot context.

## Requirements *(mandatory)*

**Constitution alignment (required):** This feature is repository-only test-governance work. It introduces no Microsoft Graph calls, no product write behavior, no `OperationRun`, and no end-user authorization changes.

**Constitution alignment (PROP-001 / ABSTR-001 / PERSIST-001 / STATE-001 / BLOAT-001):** This feature introduces repository-owned historical artifacts, drift states, and recalibration policy only because per-run enforcement alone is insufficient to govern long-horizon suite behavior. The Proportionality Review above explains why bounded derived history is the narrowest correct implementation.

**Constitution alignment (TEST-GOV-001):** The feature covers the affected validation lanes, keeps heavy and browser scope unchanged, avoids new shared fixture cost, documents expected baseline and budget follow-up, and records the minimal reviewer commands above.

### Functional Requirements

- **FR-001 History Coverage**: The repository MUST retain or derive comparable runtime history for each primary governed lane: Fast Feedback, Confidence, Heavy Governance, and Browser. Support lanes such as JUnit or Profiling MUST be included when they materially improve hotspot or comparison evidence.
- **FR-002 Trend Record Contract**: Each retained trend record MUST include a lane identifier, a run or commit reference, measured runtime, baseline reference, budget context, and enough provenance to compare the record with the immediately preceding relevant record.
- **FR-003 Lane Summary Contract**: Each reporting cycle MUST expose, for every relevant lane, the current runtime, previous runtime, baseline, budget, delta to previous run, delta to baseline, and current lane health classification.
- **FR-004 Drift Health States**: The reporting model MUST distinguish at least the states `healthy`, `budget-near`, `trending-worse`, `regressed`, and `unstable`.
- **FR-005 Noise Handling**: A single anomalous run MUST NOT by itself force a lane into the same treatment as repeated deterioration; the trend model MUST differentiate one-off spikes from sustained erosion.
- **FR-006 Baseline Recalibration Policy**: The repository MUST document when a baseline may be reset, when it must remain unchanged, what evidence window is required, and who is expected to justify the decision.
- **FR-007 Budget Recalibration Policy**: The repository MUST document when a budget may change, when it must not change, and which reasons are considered valid, including deliberate lane-scope change, infrastructure shift, or post-improvement tightening.
- **FR-008 Explicit Recalibration Evidence**: Any approved baseline or budget recalibration MUST be tied to documented evidence showing the before-and-after rationale rather than silently adopting the latest run as the new truth.
- **FR-009 Hotspot Trend Visibility**: Each primary lane trend report MUST expose dominant cost drivers and indicate whether a hotspot is stable, worsening, or newly dominant compared with the reference window.
- **FR-010 Readable Summary**: Each reporting cycle MUST publish a concise summary that lets a reviewer tell which lanes are healthy, near budget, worsening, regressed, or candidates for recalibration without opening raw lane outputs first.
- **FR-011 Contributor Guidance**: Repository guidance MUST explain how to read the trend summary, when authors should react to budget-near or worsening status, when recalibration discussion is appropriate, and when a follow-up performance pass is the correct response instead.
- **FR-012 Bounded Retention**: The history model MUST remain lightweight by using bounded retained evidence sufficient for governance decisions rather than requiring unlimited archival of raw run outputs.
- **FR-013 Validation Examples**: Completion of this feature MUST include representative examples covering at least one healthy lane, one budget-near lane, one repeated worsening or regressed lane, one unstable case, and one justified recalibration case.
- **FR-014 Lane-First Governance**: Trend reporting MUST remain lane-first; hotspot detail may inform the decision, but it MUST NOT replace lane-level status as the primary governance unit.

### Non-Functional Requirements

- **NFR-001 Decision Speed**: A reviewer must be able to determine the health class of each governed lane from the summary in under two minutes for a normal reporting cycle.
- **NFR-002 Noise Resilience**: The trend model must reduce false regression calls caused by normal CI variance so that a single noisy run remains an exception rather than the default explanation.
- **NFR-003 Operational Weight**: The trend layer must reuse existing governed lane outputs and must not require duplicate full reruns of every primary lane solely to produce routine reporting.

## Risks

- **Overreacting to CI noise**: If the thresholds are too sensitive, normal runner variability could look like a structural regression.
- **Baseline inflation**: If recalibration is too easy, baseline history loses its value as a reference point.
- **Budget normalization drift**: If every overrun becomes a budget update, the budget model stops functioning as governance.
- **Over-complex reporting**: Too many metrics can make the summary harder to use instead of easier.
- **False precision**: Historical numbers can look more exact than the runner environment really allows.
- **Hotspot overload**: Too much hotspot detail can crowd out the lane-first decision that the report is supposed to support.

## Rollout Guidance

- Define the minimal trend data contract before adding new summary states.
- Introduce per-lane summaries showing current, previous, baseline, and budget values first.
- Add drift classification only after the comparison window is clear.
- Document baseline and budget recalibration policy before tuning thresholds.
- Add hotspot trend visibility for the highest-value lanes after the lane summary is readable.
- Validate the output with real or representative run sequences and adjust thresholds only when the examples show misleading outcomes.
- Keep the first slice minimal and decision-oriented rather than exhaustive.

## Design Rules

- **Budgets are policy, baselines are reference**.
- **Trend output must aid decisions**.
- **No silent recalibration**.
- **Noise-aware, not noise-blind**.
- **Lane-first observability**.
- **Hotspots support, not dominate, governance**.
- **Readable over exhaustive**.

## Deliverables

- A trend-capable runtime history contract or artifact for governed lanes.
- A per-lane trend summary showing current, previous, baseline, budget, and health state.
- A drift-classification model for lane health.
- Documented baseline recalibration policy.
- Documented budget recalibration policy.
- A hotspot trend view for relevant lanes.
- Contributor and reviewer guidance.
- Validation evidence from real or representative governed runs.

### Key Entities *(include if feature involves data)*

- **Lane Trend Record**: A retained runtime snapshot for one governed lane at one reporting point, including runtime, comparison context, and health state.
- **Baseline Reference**: The agreed reference value used to compare later lane runs without acting as the budget itself.
- **Budget Policy**: The governance limit and enforcement posture applied to a lane, distinct from the baseline reference.
- **Drift Status**: The named lane-health classification that distinguishes healthy behavior from near-budget, worsening, regressed, or unstable patterns.
- **Hotspot Trend Snapshot**: A ranked summary of the dominant cost drivers for a lane together with their change relative to the comparison window.
- **Recalibration Decision**: A documented decision that keeps, adjusts, or tightens a baseline or budget based on explicit trend evidence.

## Success Criteria *(mandatory)*

### Measurable Outcomes

- **SC-001**: The trend summary covers 100% of primary governed lanes with current runtime, previous runtime, baseline, budget, and health classification visible in the validation evidence.
- **SC-002**: At least three sequential comparable samples are available for each primary governed lane in the validation evidence without requiring manual reconstruction outside repository-owned artifacts or summaries.
- **SC-003**: In the documented validation examples, single noisy outliers are classified differently from repeated deterioration in 100% of cases.
- **SC-004**: The validation evidence includes at least one justified recalibration case and at least one rejected recalibration case, each explainable from retained trend evidence without relying on private notes.
- **SC-005**: For each primary governed lane, the trend output identifies at least the top three dominant cost drivers or explicitly states that hotspot evidence is unavailable.
- **SC-006**: Reviewers can determine within two minutes whether a lane is healthy, budget-near, worsening, regressed, or recalibration-worthy from the generated summary.