Main Confidence / confidence (push) Failing after 46s

Details

feat: implement runtime trend recalibration reporting (#244 )

## Summary
- implement Spec 211 runtime trend reporting with bounded lane history, drift classification, hotspot trend output, and recalibration evidence handling
- extend the repo-truth governance seams and workflow wrappers for comparable-bundle hydration, trend artifact publication, and contract-backed reporting
- add the Spec 211 planning artifacts, data model, quickstart, tasks, and repository contract documents

## Validation
- parsed `specs/211-runtime-trend-recalibration/contracts/test-runtime-trend-history.schema.json`
- parsed `specs/211-runtime-trend-recalibration/contracts/test-runtime-trend.logical.openapi.yaml`
- re-ran cross-artifact consistency analysis for the Spec 211 artifact set until no material findings remained
- no application test suite was re-run as part of this final commit/push/PR step

Co-authored-by: Ahmed Darrazi <ahmed.darrazi@live.de>
Reviewed-on: #244

2026-04-18 07:36:05 +00:00

10 KiB

Raw Blame History

Quickstart: Test Runtime Trend Reporting & Baseline Recalibration

Preconditions

Specs 206 through 210 are already implemented and remain the governing baseline for lane selection, budgets, CI workflow routing, and artifact publication.
Local validation runs from the repository root and uses Sail-backed commands for PHP and test execution.
At least one prior comparable artifact bundle or prior lane *-latest.trend-history.json file is available when validating a non-unstable history window locally.
No database migration, product route, Filament panel, or frontend asset step is required for this feature.

Planned Artifact Additions

Extend the existing lane artifact set with apps/platform/storage/logs/test-lanes/<lane>-latest.trend-history.json.
Extend the existing summary.md, report.json, and budget.json outputs with trend-aware sections and fields rather than creating a parallel human-readable artifact surface.
Stage the new history artifact into the existing .gitea-artifacts/<workflow-profile> upload bundle for the owning lane.

Recommended Implementation Order

Extend TestLaneManifest with the lane trend policy, bounded retention limits, comparison-fingerprint inputs, and recalibration guidance anchors.
Extend TestLaneReport so it can read a prior *-latest.trend-history.json, append the current LaneTrendRecord, trim to the lane retention limit, compute the trend window, emit drift status, and surface hotspot deltas.
Extend TestLaneBudget with recalibration recommendation helpers that stay separate from current budget outcome.
Extend scripts/platform-test-report so it refreshes trend-aware outputs after a prior history file has been hydrated into apps/platform/storage/logs/test-lanes.
Extend scripts/platform-test-artifacts and the checked-in artifact contracts so the trend history file is staged and uploaded with the existing lane bundle.
Update only the necessary Gitea workflow steps so each lane can hydrate the previous matching history artifact before report generation without widening lane execution.
Add or update Pest guard coverage for trend history, drift classes, hotspot deltas, recalibration rules, and workflow/artifact publication contracts.
Update README.md with reviewer guidance and capture representative validation evidence for the main trend cases.

Local Validation Flow

1. Generate current lane artifacts

./scripts/platform-test-lane fast-feedback
./scripts/platform-test-lane confidence
./scripts/platform-test-report fast-feedback --skip-latest-history
./scripts/platform-test-report confidence --skip-latest-history

2. Hydrate prior comparable history for a stable-window validation

Use the wrapper flags instead of manual artifact copying so local runs exercise the same hydration contract as CI.

./scripts/platform-test-report fast-feedback --history-file=/absolute/path/to/fast-feedback-latest.trend-history.json
./scripts/platform-test-report confidence --history-bundle=/absolute/path/to/comparable-bundle-or-zip

3. Rebuild workflow-shaped evidence without widening lane execution

./scripts/platform-test-report fast-feedback --workflow-id=pr-fast-feedback --trigger-class=pull-request --fetch-latest-history
./scripts/platform-test-report confidence --workflow-id=main-confidence --trigger-class=mainline-push --fetch-latest-history
./scripts/platform-test-report heavy-governance --workflow-id=heavy-governance --trigger-class=manual --skip-latest-history
./scripts/platform-test-report browser --workflow-id=browser-manual --trigger-class=manual --skip-latest-history
./scripts/platform-test-report profiling --skip-latest-history
./scripts/platform-test-report junit --skip-latest-history

4. Stage artifact bundles exactly as CI will publish them

./scripts/platform-test-artifacts fast-feedback .gitea-artifacts/pr-fast-feedback --workflow-id=pr-fast-feedback --trigger-class=pull-request
./scripts/platform-test-artifacts confidence .gitea-artifacts/main-confidence --workflow-id=main-confidence --trigger-class=mainline-push

5. Run focused guard coverage and formatting

cd apps/platform && ./vendor/bin/sail artisan test --compact tests/Feature/Guards
cd apps/platform && ./vendor/bin/sail bin pint --dirty --format agent

6. Time-box one reviewer summary check

Use the generated summary only, set a two-minute timer, and verify that the reviewer can name the health class for each primary lane plus whether recalibration discussion is warranted before opening raw lane outputs.

Health Class Cheat Sheet

healthy: the lane has enough comparable history, remains comfortably under budget, and recent variance stays below the lane noise floor.
budget-near: the lane is still passing, but its headroom is inside the lane's warning band.
trending-worse: multiple comparable samples are worsening above the documented variance floor.
regressed: the lane is over budget or repeatedly worsening enough that the report should stop calling it normal erosion.
unstable: the report is intentionally refusing a stronger label because the window is too short, too noisy, or no longer comparable.

Recalibration is separate from health. The report can emit candidate, approved, or rejected baseline or budget decisions, but it never mutates repository truth automatically.

Recorded Evidence Snapshot (2026-04-17)

Scenario	Lane	Runtime Window	Outcome
Live cold-start wrapper run	`fast-feedback`	current `120.29s`, previous `120.29s`, baseline `176.74s`, budget `200s`	`unstable`, hotspot evidence unavailable, budget recalibration rejected (`manual-hold`) because only two comparable samples existed
Stable healthy window	`fast-feedback`	current `176.10s`, previous `175.60s`, baseline `176.74s`, budget `200s`	`healthy`, no recalibration recommended
Stable budget-near window	`confidence`	current `433.00s`, previous `430.00s`, baseline `394.38s`, budget `450s`	`budget-near`, investigate before the lane becomes a repeated blocker
Noisy window	`fast-feedback`	current `170.00s`, previous `195.00s`, baseline `176.74s`, budget `200s`	`unstable` with `windowStatus=noisy`, so the spike is treated as noise instead of structural regression
Hotspot-stable example	`confidence`	current `394.38s`, previous `401.12s`, baseline `394.38s`, budget `450s`	`healthy`; dominant families stayed flat and the top files remained the baseline compare matrix pair plus onboarding-wizard enforcement
Approved baseline recalibration	`fast-feedback`	current `176.30s`, previous `176.00s`, baseline reset from `176.74s` to `182.00s`, budget `200s`	baseline recalibration recorded as `approved` with rationale `post-improvement-reset` after the lane stabilized
Rejected budget recalibration	`fast-feedback`	current `193.00s`, previous `176.00s`, baseline `176.74s`, budget `200s`	`budget-near`, but budget recalibration stayed `rejected` with rationale `noise-rejected`
Candidate budget review	`confidence`	current `460.00s`, previous `420.00s`, baseline `394.38s`, budget `450s`	`regressed`, budget review emitted as a `candidate` only after a five-run evidence window
Primary-lane cold starts	`browser`, `heavy-governance`	`109.67s/150s` and `228.34s/300s`	both reported `unstable` on first refresh, which is the intended cold-start behavior
Support-lane path	`profiling`, `junit`	`2701.51s/3000s` and `380.14s/450s`	both wrappers now emit bounded `trend-history.json`; `junit` support-lane report refresh was repaired so the documented command actually works

Representative Evidence Set

Capture at least one example for each of the following before calling the feature complete:

Three sequential comparable samples for each primary lane: fast-feedback, confidence, heavy-governance, and browser.
healthy: current runtime comfortably below budget with stable or improving recent comparable history.
budget-near: current runtime remains under budget but inside the lane's near-budget headroom band.
trending-worse: a bounded comparable window shows repeated worsening that is larger than the lane noise floor.
regressed: a budget breach or materially repeated worsening is clearly visible.
unstable: insufficient comparable history, fingerprint mismatch, or noisy evidence makes a stable label unsafe.
Approved recalibration case: explicit evidence shows why repository truth should change.
Rejected recalibration case: explicit evidence shows why repository truth should stay unchanged.
One support-lane example from junit or profiling when it materially improves hotspot or comparison evidence.

Each recorded example should name the lane, current runtime, previous runtime, baseline, budget, health class, hotspot summary, and the recalibration conclusion when relevant.

Material runtime drift, bundle-hydration caveats, and approved or rejected recalibration follow-up must be recorded in specs/211-runtime-trend-recalibration/spec.md or the active implementation PR. This quickstart may mirror the same evidence, but it does not replace the delivery record.

CI Rollout Notes

CI should hydrate the previous matching *-latest.trend-history.json from the most recent comparable uploaded artifact bundle before the report refresh step.
The uploaded bundle for each governed workflow must include the refreshed *-latest.trend-history.json so the next run only needs one prior bundle.
The workflow-owned refresh steps now pass --fetch-latest-history together with TENANTATLAS_GITEA_TOKEN and top-level actions: read plus contents: read permissions so bundle discovery stays explicit.
Pull request and dev push validation remain the narrowest proving paths; heavy/browser/manual/scheduled lanes provide representative cross-lane evidence and must not be widened.

Final Review Checklist

Trend policy lives in repository truth, not workflow prose.
summary.md, report.json, budget.json, and *-latest.trend-history.json agree on lane runtime and health class.
Baseline and budget recalibration remain explicit, reviewable, and separate.
Hotspot summaries stay readable and bounded.
A timed reviewer dry run confirms the generated summary remains decidable within two minutes.
The implementation does not add product persistence, routes, assets, or a second analytics surface.

10 KiB Raw Blame History