Main Confidence / confidence (push) Failing after 46s

Details

feat: implement runtime trend recalibration reporting (#244 )

## Summary
- implement Spec 211 runtime trend reporting with bounded lane history, drift classification, hotspot trend output, and recalibration evidence handling
- extend the repo-truth governance seams and workflow wrappers for comparable-bundle hydration, trend artifact publication, and contract-backed reporting
- add the Spec 211 planning artifacts, data model, quickstart, tasks, and repository contract documents

## Validation
- parsed `specs/211-runtime-trend-recalibration/contracts/test-runtime-trend-history.schema.json`
- parsed `specs/211-runtime-trend-recalibration/contracts/test-runtime-trend.logical.openapi.yaml`
- re-ran cross-artifact consistency analysis for the Spec 211 artifact set until no material findings remained
- no application test suite was re-run as part of this final commit/push/PR step

Co-authored-by: Ahmed Darrazi <ahmed.darrazi@live.de>
Reviewed-on: #244

2026-04-18 07:36:05 +00:00

9.1 KiB

Raw Blame History

Research: Test Runtime Trend Reporting & Baseline Recalibration

Decision 1: Persist bounded lane history as an artifact beside the existing lane report outputs

Decision: Add one bounded trend-history.json artifact per governed lane under the existing lane artifact root and stage that same file into the existing CI upload bundle for the lane's workflow profile.
Rationale: The repo already treats summary.md, report.json, budget.json, and junit.xml as the canonical lane outputs. A bounded history file beside those artifacts preserves repository truth, avoids product persistence, and gives the next CI run a portable history window without inventing a database, cache, or commit-on-every-run workflow.
Alternatives considered:
- Store history in a new product database table: rejected because the feature is repository governance, not application runtime truth.
- Commit history files back into the repository on every run: rejected because runtime-generated governance evidence should not create noisy git churn.
- Reconstruct history from many previous artifact bundles every time: rejected because it depends on broader artifact retention and more CI/API complexity than necessary.

Decision 2: Use the latest matching uploaded artifact bundle as the shared CI history source

Decision: Hydrate the next lane history window from the latest matching uploaded bundle for the same lane/workflow profile when CI credentials are available, and allow an explicit local artifact directory or prior trend-history.json file as the fallback source for local validation.
Rationale: Once each bundle already contains the full bounded history window, the next run only needs the most recent comparable bundle rather than a multi-run artifact crawl. This stays lightweight and lets local development validate the exact same contract using checked-out or copied artifacts.
Alternatives considered:
- Depend on an external metrics store or dashboard backend: rejected because it would import a second analytics system.
- Assume shared workspace persistence across CI runs: rejected because Gitea runners should be treated as ephemeral.
- Require local developers to manually build history state for every validation: rejected because the workflow would be too fragile and easy to bypass.

Decision 3: Keep trend policy in `TestLaneManifest` and trend evaluation inside the existing reporting seams

Decision: Extend TestLaneManifest with lane trend metadata and keep history/trend generation inside TestLaneReport, with TestLaneBudget providing recalibration and tolerance-aware recommendation helpers.
Rationale: Budgets, workflow bindings, artifact contracts, and existing comparison rules already live in these seams. Trend reporting is a governance extension of the same truth, not a separate subsystem. Keeping policy and evaluation together prevents duplication between wrappers, tests, and CI configuration.
Alternatives considered:
- Introduce a new generalized analytics service layer: rejected because there is only one real consumer and one real domain.
- Push all trend logic into shell scripts: rejected because the classification rules and JSON contracts belong in versioned PHP support code with guard tests.
- Scatter thresholds across workflow YAML and README prose: rejected because repository truth would become inconsistent.

Decision 4: Use a bounded comparable window with explicit retention and comparison fingerprints

Decision: Retain the latest 20 records for primary lanes (fast-feedback, confidence, heavy-governance, browser) and the latest 10 records for support lanes (junit, profiling); evaluate drift from the latest 5 comparable records and require at least 3 comparable samples before assigning a stable non-unstable health class. Each history record carries a comparison fingerprint built from lane ID, workflow ID, trigger class, contract version, baseline source, and lane-scope signature.
Rationale: Twenty primary-lane records preserve enough runway to separate short-term noise from structural erosion while staying small enough for artifact bundles. Five recent comparable records are enough to show worsening or stabilization trends without overfitting old runs. The comparison fingerprint prevents silent apples-to-oranges comparisons when lane membership, workflow class, or contract shape changes.
Alternatives considered:
- Retain every historical record forever: rejected because the feature explicitly calls for bounded lightweight history.
- Compare only the immediately previous run: rejected because it cannot reliably distinguish streaks, noise, and recalibration boundaries.
- Compare by lane ID alone: rejected because workflow class and lane-scope changes would produce misleading trends.

Decision 5: Derive health classes from existing variance tolerances plus a trend policy, not from raw runtime deltas alone

Decision: Classify lane health with the fixed vocabulary healthy, budget-near, trending-worse, regressed, and unstable. Use the existing lane-specific variance allowances from the current enforcement profiles as the minimum noise floor, combine them with a near-budget headroom rule, and reserve unstable for insufficient comparable history, comparison-fingerprint breaks, or high variance/noisy windows.
Rationale: The repo already documents lane-specific tolerance in budget enforcement. Reusing that allowance as the floor for trend significance keeps the new model aligned with current governance truth and avoids inventing unrelated threshold systems.
Alternatives considered:
- Binary healthy/regressed classification: rejected because it hides erosion before a lane breaches budget.
- Pure percentage-only thresholds: rejected because current lane budgets and tolerances already vary meaningfully in absolute seconds.
- Automatically downgrade every spike to regressed: rejected because one-off noise should remain visible without looking structural.

Decision 6: Keep hotspot trend visibility family-first and summary-friendly

Decision: Reuse TestLaneReport's existing classification totals, family totals, hotspot files, and slowest-entry output; show the top 5 family deltas and top 3 file hotspots in human-readable summaries, while retaining up to the current top 10 slowest entries in JSON evidence.
Rationale: The existing report already derives the expensive attribution data. Trend reporting only needs to answer which dominant contributors worsened or stabilized, not preserve exhaustive per-test history.
Alternatives considered:
- Store and diff every test case over time: rejected because the storage and readability cost is not justified.
- Show only lane-level runtime without hotspot context: rejected because recalibration and regression review would remain too opaque.
- Make hotspot output file-first only: rejected because family-level attribution is the more stable governance lens already used by the repo.

Decision 7: Separate recalibration recommendation from health status and keep recalibration explicitly human-approved

Decision: Emit recalibration recommendations separately from the lane health class and record explicit evidence for approved or rejected recalibration decisions. Baseline recalibration is only justified by documented lane-scope change, lasting infrastructure change, or deliberate post-improvement reset. Budget recalibration requires a stronger sustained evidence window and must never happen automatically because of a single regression or a noisy streak.
Rationale: Health status answers "what is happening now". Recalibration answers "should repository truth change". Keeping those separate prevents a degraded lane from appearing self-healing just because the tool auto-adjusted the benchmark.
Alternatives considered:
- Auto-adjust baselines or budgets from rolling averages: rejected because it would erase regression history.
- Treat recalibration as free-form README guidance only: rejected because reviewers need a structured evidence record.
- Merge recalibration directly into the health-class vocabulary: rejected because review semantics and current-state semantics are different concerns.

Decision 8: Extend the existing summary/report surfaces instead of introducing a new dashboard surface

Decision: Add a trend section to the existing lane summary.md, add a trend block to the current JSON report payloads, and use trend-history.json as the dedicated bounded-history artifact.
Rationale: Maintainers already read the current summary and JSON artifacts. Extending those surfaces makes trend output immediately usable in local runs, CI logs, and uploaded bundles without inventing a parallel UI or artifact family.
Alternatives considered:
- Create a new UI page or dashboard: rejected because this feature is repository-governance-only.
- Emit a second human-readable markdown file for trend alone: rejected because it would split the operator reading surface unnecessarily.
- Keep trend data only inside JSON: rejected because reviewers need readable summaries during ordinary PR and CI triage.

9.1 KiB Raw Blame History

Research: Test Runtime Trend Reporting & Baseline Recalibration

Decision 1: Persist bounded lane history as an artifact beside the existing lane report outputs

Decision 2: Use the latest matching uploaded artifact bundle as the shared CI history source

Decision 3: Keep trend policy in TestLaneManifest and trend evaluation inside the existing reporting seams

Decision 4: Use a bounded comparable window with explicit retention and comparison fingerprints

Decision 5: Derive health classes from existing variance tolerances plus a trend policy, not from raw runtime deltas alone

Decision 6: Keep hotspot trend visibility family-first and summary-friendly

Decision 7: Separate recalibration recommendation from health status and keep recalibration explicitly human-approved

Decision 8: Extend the existing summary/report surfaces instead of introducing a new dashboard surface

9.1 KiB

Raw Blame History

Decision 3: Keep trend policy in `TestLaneManifest` and trend evaluation inside the existing reporting seams