TenantAtlas/specs/211-runtime-trend-recalibration/quickstart.md
ahmido 81a07a41e4
Some checks failed
Main Confidence / confidence (push) Failing after 46s
feat: implement runtime trend recalibration reporting (#244)
## Summary
- implement Spec 211 runtime trend reporting with bounded lane history, drift classification, hotspot trend output, and recalibration evidence handling
- extend the repo-truth governance seams and workflow wrappers for comparable-bundle hydration, trend artifact publication, and contract-backed reporting
- add the Spec 211 planning artifacts, data model, quickstart, tasks, and repository contract documents

## Validation
- parsed `specs/211-runtime-trend-recalibration/contracts/test-runtime-trend-history.schema.json`
- parsed `specs/211-runtime-trend-recalibration/contracts/test-runtime-trend.logical.openapi.yaml`
- re-ran cross-artifact consistency analysis for the Spec 211 artifact set until no material findings remained
- no application test suite was re-run as part of this final commit/push/PR step

Co-authored-by: Ahmed Darrazi <ahmed.darrazi@live.de>
Reviewed-on: #244
2026-04-18 07:36:05 +00:00

10 KiB

Quickstart: Test Runtime Trend Reporting & Baseline Recalibration

Preconditions

  • Specs 206 through 210 are already implemented and remain the governing baseline for lane selection, budgets, CI workflow routing, and artifact publication.
  • Local validation runs from the repository root and uses Sail-backed commands for PHP and test execution.
  • At least one prior comparable artifact bundle or prior lane *-latest.trend-history.json file is available when validating a non-unstable history window locally.
  • No database migration, product route, Filament panel, or frontend asset step is required for this feature.

Planned Artifact Additions

  • Extend the existing lane artifact set with apps/platform/storage/logs/test-lanes/<lane>-latest.trend-history.json.
  • Extend the existing summary.md, report.json, and budget.json outputs with trend-aware sections and fields rather than creating a parallel human-readable artifact surface.
  • Stage the new history artifact into the existing .gitea-artifacts/<workflow-profile> upload bundle for the owning lane.
  1. Extend TestLaneManifest with the lane trend policy, bounded retention limits, comparison-fingerprint inputs, and recalibration guidance anchors.
  2. Extend TestLaneReport so it can read a prior *-latest.trend-history.json, append the current LaneTrendRecord, trim to the lane retention limit, compute the trend window, emit drift status, and surface hotspot deltas.
  3. Extend TestLaneBudget with recalibration recommendation helpers that stay separate from current budget outcome.
  4. Extend scripts/platform-test-report so it refreshes trend-aware outputs after a prior history file has been hydrated into apps/platform/storage/logs/test-lanes.
  5. Extend scripts/platform-test-artifacts and the checked-in artifact contracts so the trend history file is staged and uploaded with the existing lane bundle.
  6. Update only the necessary Gitea workflow steps so each lane can hydrate the previous matching history artifact before report generation without widening lane execution.
  7. Add or update Pest guard coverage for trend history, drift classes, hotspot deltas, recalibration rules, and workflow/artifact publication contracts.
  8. Update README.md with reviewer guidance and capture representative validation evidence for the main trend cases.

Local Validation Flow

1. Generate current lane artifacts

./scripts/platform-test-lane fast-feedback
./scripts/platform-test-lane confidence
./scripts/platform-test-report fast-feedback --skip-latest-history
./scripts/platform-test-report confidence --skip-latest-history

2. Hydrate prior comparable history for a stable-window validation

Use the wrapper flags instead of manual artifact copying so local runs exercise the same hydration contract as CI.

./scripts/platform-test-report fast-feedback --history-file=/absolute/path/to/fast-feedback-latest.trend-history.json
./scripts/platform-test-report confidence --history-bundle=/absolute/path/to/comparable-bundle-or-zip

3. Rebuild workflow-shaped evidence without widening lane execution

./scripts/platform-test-report fast-feedback --workflow-id=pr-fast-feedback --trigger-class=pull-request --fetch-latest-history
./scripts/platform-test-report confidence --workflow-id=main-confidence --trigger-class=mainline-push --fetch-latest-history
./scripts/platform-test-report heavy-governance --workflow-id=heavy-governance --trigger-class=manual --skip-latest-history
./scripts/platform-test-report browser --workflow-id=browser-manual --trigger-class=manual --skip-latest-history
./scripts/platform-test-report profiling --skip-latest-history
./scripts/platform-test-report junit --skip-latest-history

4. Stage artifact bundles exactly as CI will publish them

./scripts/platform-test-artifacts fast-feedback .gitea-artifacts/pr-fast-feedback --workflow-id=pr-fast-feedback --trigger-class=pull-request
./scripts/platform-test-artifacts confidence .gitea-artifacts/main-confidence --workflow-id=main-confidence --trigger-class=mainline-push

5. Run focused guard coverage and formatting

cd apps/platform && ./vendor/bin/sail artisan test --compact tests/Feature/Guards
cd apps/platform && ./vendor/bin/sail bin pint --dirty --format agent

6. Time-box one reviewer summary check

Use the generated summary only, set a two-minute timer, and verify that the reviewer can name the health class for each primary lane plus whether recalibration discussion is warranted before opening raw lane outputs.

Health Class Cheat Sheet

  • healthy: the lane has enough comparable history, remains comfortably under budget, and recent variance stays below the lane noise floor.
  • budget-near: the lane is still passing, but its headroom is inside the lane's warning band.
  • trending-worse: multiple comparable samples are worsening above the documented variance floor.
  • regressed: the lane is over budget or repeatedly worsening enough that the report should stop calling it normal erosion.
  • unstable: the report is intentionally refusing a stronger label because the window is too short, too noisy, or no longer comparable.

Recalibration is separate from health. The report can emit candidate, approved, or rejected baseline or budget decisions, but it never mutates repository truth automatically.

Recorded Evidence Snapshot (2026-04-17)

Scenario Lane Runtime Window Outcome
Live cold-start wrapper run fast-feedback current 120.29s, previous 120.29s, baseline 176.74s, budget 200s unstable, hotspot evidence unavailable, budget recalibration rejected (manual-hold) because only two comparable samples existed
Stable healthy window fast-feedback current 176.10s, previous 175.60s, baseline 176.74s, budget 200s healthy, no recalibration recommended
Stable budget-near window confidence current 433.00s, previous 430.00s, baseline 394.38s, budget 450s budget-near, investigate before the lane becomes a repeated blocker
Noisy window fast-feedback current 170.00s, previous 195.00s, baseline 176.74s, budget 200s unstable with windowStatus=noisy, so the spike is treated as noise instead of structural regression
Hotspot-stable example confidence current 394.38s, previous 401.12s, baseline 394.38s, budget 450s healthy; dominant families stayed flat and the top files remained the baseline compare matrix pair plus onboarding-wizard enforcement
Approved baseline recalibration fast-feedback current 176.30s, previous 176.00s, baseline reset from 176.74s to 182.00s, budget 200s baseline recalibration recorded as approved with rationale post-improvement-reset after the lane stabilized
Rejected budget recalibration fast-feedback current 193.00s, previous 176.00s, baseline 176.74s, budget 200s budget-near, but budget recalibration stayed rejected with rationale noise-rejected
Candidate budget review confidence current 460.00s, previous 420.00s, baseline 394.38s, budget 450s regressed, budget review emitted as a candidate only after a five-run evidence window
Primary-lane cold starts browser, heavy-governance 109.67s/150s and 228.34s/300s both reported unstable on first refresh, which is the intended cold-start behavior
Support-lane path profiling, junit 2701.51s/3000s and 380.14s/450s both wrappers now emit bounded trend-history.json; junit support-lane report refresh was repaired so the documented command actually works

Representative Evidence Set

Capture at least one example for each of the following before calling the feature complete:

  1. Three sequential comparable samples for each primary lane: fast-feedback, confidence, heavy-governance, and browser.
  2. healthy: current runtime comfortably below budget with stable or improving recent comparable history.
  3. budget-near: current runtime remains under budget but inside the lane's near-budget headroom band.
  4. trending-worse: a bounded comparable window shows repeated worsening that is larger than the lane noise floor.
  5. regressed: a budget breach or materially repeated worsening is clearly visible.
  6. unstable: insufficient comparable history, fingerprint mismatch, or noisy evidence makes a stable label unsafe.
  7. Approved recalibration case: explicit evidence shows why repository truth should change.
  8. Rejected recalibration case: explicit evidence shows why repository truth should stay unchanged.
  9. One support-lane example from junit or profiling when it materially improves hotspot or comparison evidence.

Each recorded example should name the lane, current runtime, previous runtime, baseline, budget, health class, hotspot summary, and the recalibration conclusion when relevant.

Material runtime drift, bundle-hydration caveats, and approved or rejected recalibration follow-up must be recorded in specs/211-runtime-trend-recalibration/spec.md or the active implementation PR. This quickstart may mirror the same evidence, but it does not replace the delivery record.

CI Rollout Notes

  • CI should hydrate the previous matching *-latest.trend-history.json from the most recent comparable uploaded artifact bundle before the report refresh step.
  • The uploaded bundle for each governed workflow must include the refreshed *-latest.trend-history.json so the next run only needs one prior bundle.
  • The workflow-owned refresh steps now pass --fetch-latest-history together with TENANTATLAS_GITEA_TOKEN and top-level actions: read plus contents: read permissions so bundle discovery stays explicit.
  • Pull request and dev push validation remain the narrowest proving paths; heavy/browser/manual/scheduled lanes provide representative cross-lane evidence and must not be widened.

Final Review Checklist

  • Trend policy lives in repository truth, not workflow prose.
  • summary.md, report.json, budget.json, and *-latest.trend-history.json agree on lane runtime and health class.
  • Baseline and budget recalibration remain explicit, reviewable, and separate.
  • Hotspot summaries stay readable and bounded.
  • A timed reviewer dry run confirms the generated summary remains decidable within two minutes.
  • The implementation does not add product persistence, routes, assets, or a second analytics surface.