Some checks failed
Main Confidence / confidence (push) Failing after 46s
## Summary - implement Spec 211 runtime trend reporting with bounded lane history, drift classification, hotspot trend output, and recalibration evidence handling - extend the repo-truth governance seams and workflow wrappers for comparable-bundle hydration, trend artifact publication, and contract-backed reporting - add the Spec 211 planning artifacts, data model, quickstart, tasks, and repository contract documents ## Validation - parsed `specs/211-runtime-trend-recalibration/contracts/test-runtime-trend-history.schema.json` - parsed `specs/211-runtime-trend-recalibration/contracts/test-runtime-trend.logical.openapi.yaml` - re-ran cross-artifact consistency analysis for the Spec 211 artifact set until no material findings remained - no application test suite was re-run as part of this final commit/push/PR step Co-authored-by: Ahmed Darrazi <ahmed.darrazi@live.de> Reviewed-on: #244
134 lines
10 KiB
Markdown
134 lines
10 KiB
Markdown
# Quickstart: Test Runtime Trend Reporting & Baseline Recalibration
|
|
|
|
## Preconditions
|
|
|
|
- Specs 206 through 210 are already implemented and remain the governing baseline for lane selection, budgets, CI workflow routing, and artifact publication.
|
|
- Local validation runs from the repository root and uses Sail-backed commands for PHP and test execution.
|
|
- At least one prior comparable artifact bundle or prior lane `*-latest.trend-history.json` file is available when validating a non-`unstable` history window locally.
|
|
- No database migration, product route, Filament panel, or frontend asset step is required for this feature.
|
|
|
|
## Planned Artifact Additions
|
|
|
|
- Extend the existing lane artifact set with `apps/platform/storage/logs/test-lanes/<lane>-latest.trend-history.json`.
|
|
- Extend the existing `summary.md`, `report.json`, and `budget.json` outputs with trend-aware sections and fields rather than creating a parallel human-readable artifact surface.
|
|
- Stage the new history artifact into the existing `.gitea-artifacts/<workflow-profile>` upload bundle for the owning lane.
|
|
|
|
## Recommended Implementation Order
|
|
|
|
1. Extend `TestLaneManifest` with the lane trend policy, bounded retention limits, comparison-fingerprint inputs, and recalibration guidance anchors.
|
|
2. Extend `TestLaneReport` so it can read a prior `*-latest.trend-history.json`, append the current `LaneTrendRecord`, trim to the lane retention limit, compute the trend window, emit drift status, and surface hotspot deltas.
|
|
3. Extend `TestLaneBudget` with recalibration recommendation helpers that stay separate from current budget outcome.
|
|
4. Extend `scripts/platform-test-report` so it refreshes trend-aware outputs after a prior history file has been hydrated into `apps/platform/storage/logs/test-lanes`.
|
|
5. Extend `scripts/platform-test-artifacts` and the checked-in artifact contracts so the trend history file is staged and uploaded with the existing lane bundle.
|
|
6. Update only the necessary Gitea workflow steps so each lane can hydrate the previous matching history artifact before report generation without widening lane execution.
|
|
7. Add or update Pest guard coverage for trend history, drift classes, hotspot deltas, recalibration rules, and workflow/artifact publication contracts.
|
|
8. Update `README.md` with reviewer guidance and capture representative validation evidence for the main trend cases.
|
|
|
|
## Local Validation Flow
|
|
|
|
### 1. Generate current lane artifacts
|
|
|
|
```bash
|
|
./scripts/platform-test-lane fast-feedback
|
|
./scripts/platform-test-lane confidence
|
|
./scripts/platform-test-report fast-feedback --skip-latest-history
|
|
./scripts/platform-test-report confidence --skip-latest-history
|
|
```
|
|
|
|
### 2. Hydrate prior comparable history for a stable-window validation
|
|
|
|
Use the wrapper flags instead of manual artifact copying so local runs exercise the same hydration contract as CI.
|
|
|
|
```bash
|
|
./scripts/platform-test-report fast-feedback --history-file=/absolute/path/to/fast-feedback-latest.trend-history.json
|
|
./scripts/platform-test-report confidence --history-bundle=/absolute/path/to/comparable-bundle-or-zip
|
|
```
|
|
|
|
### 3. Rebuild workflow-shaped evidence without widening lane execution
|
|
|
|
```bash
|
|
./scripts/platform-test-report fast-feedback --workflow-id=pr-fast-feedback --trigger-class=pull-request --fetch-latest-history
|
|
./scripts/platform-test-report confidence --workflow-id=main-confidence --trigger-class=mainline-push --fetch-latest-history
|
|
./scripts/platform-test-report heavy-governance --workflow-id=heavy-governance --trigger-class=manual --skip-latest-history
|
|
./scripts/platform-test-report browser --workflow-id=browser-manual --trigger-class=manual --skip-latest-history
|
|
./scripts/platform-test-report profiling --skip-latest-history
|
|
./scripts/platform-test-report junit --skip-latest-history
|
|
```
|
|
|
|
### 4. Stage artifact bundles exactly as CI will publish them
|
|
|
|
```bash
|
|
./scripts/platform-test-artifacts fast-feedback .gitea-artifacts/pr-fast-feedback --workflow-id=pr-fast-feedback --trigger-class=pull-request
|
|
./scripts/platform-test-artifacts confidence .gitea-artifacts/main-confidence --workflow-id=main-confidence --trigger-class=mainline-push
|
|
```
|
|
|
|
### 5. Run focused guard coverage and formatting
|
|
|
|
```bash
|
|
cd apps/platform && ./vendor/bin/sail artisan test --compact tests/Feature/Guards
|
|
cd apps/platform && ./vendor/bin/sail bin pint --dirty --format agent
|
|
```
|
|
|
|
### 6. Time-box one reviewer summary check
|
|
|
|
Use the generated summary only, set a two-minute timer, and verify that the reviewer can name the health class for each primary lane plus whether recalibration discussion is warranted before opening raw lane outputs.
|
|
|
|
## Health Class Cheat Sheet
|
|
|
|
- `healthy`: the lane has enough comparable history, remains comfortably under budget, and recent variance stays below the lane noise floor.
|
|
- `budget-near`: the lane is still passing, but its headroom is inside the lane's warning band.
|
|
- `trending-worse`: multiple comparable samples are worsening above the documented variance floor.
|
|
- `regressed`: the lane is over budget or repeatedly worsening enough that the report should stop calling it normal erosion.
|
|
- `unstable`: the report is intentionally refusing a stronger label because the window is too short, too noisy, or no longer comparable.
|
|
|
|
Recalibration is separate from health. The report can emit `candidate`, `approved`, or `rejected` baseline or budget decisions, but it never mutates repository truth automatically.
|
|
|
|
## Recorded Evidence Snapshot (2026-04-17)
|
|
|
|
| Scenario | Lane | Runtime Window | Outcome |
|
|
|----------|------|----------------|---------|
|
|
| Live cold-start wrapper run | `fast-feedback` | current `120.29s`, previous `120.29s`, baseline `176.74s`, budget `200s` | `unstable`, hotspot evidence unavailable, budget recalibration rejected (`manual-hold`) because only two comparable samples existed |
|
|
| Stable healthy window | `fast-feedback` | current `176.10s`, previous `175.60s`, baseline `176.74s`, budget `200s` | `healthy`, no recalibration recommended |
|
|
| Stable budget-near window | `confidence` | current `433.00s`, previous `430.00s`, baseline `394.38s`, budget `450s` | `budget-near`, investigate before the lane becomes a repeated blocker |
|
|
| Noisy window | `fast-feedback` | current `170.00s`, previous `195.00s`, baseline `176.74s`, budget `200s` | `unstable` with `windowStatus=noisy`, so the spike is treated as noise instead of structural regression |
|
|
| Hotspot-stable example | `confidence` | current `394.38s`, previous `401.12s`, baseline `394.38s`, budget `450s` | `healthy`; dominant families stayed flat and the top files remained the baseline compare matrix pair plus onboarding-wizard enforcement |
|
|
| Approved baseline recalibration | `fast-feedback` | current `176.30s`, previous `176.00s`, baseline reset from `176.74s` to `182.00s`, budget `200s` | baseline recalibration recorded as `approved` with rationale `post-improvement-reset` after the lane stabilized |
|
|
| Rejected budget recalibration | `fast-feedback` | current `193.00s`, previous `176.00s`, baseline `176.74s`, budget `200s` | `budget-near`, but budget recalibration stayed `rejected` with rationale `noise-rejected` |
|
|
| Candidate budget review | `confidence` | current `460.00s`, previous `420.00s`, baseline `394.38s`, budget `450s` | `regressed`, budget review emitted as a `candidate` only after a five-run evidence window |
|
|
| Primary-lane cold starts | `browser`, `heavy-governance` | `109.67s/150s` and `228.34s/300s` | both reported `unstable` on first refresh, which is the intended cold-start behavior |
|
|
| Support-lane path | `profiling`, `junit` | `2701.51s/3000s` and `380.14s/450s` | both wrappers now emit bounded `trend-history.json`; `junit` support-lane report refresh was repaired so the documented command actually works |
|
|
|
|
## Representative Evidence Set
|
|
|
|
Capture at least one example for each of the following before calling the feature complete:
|
|
|
|
1. Three sequential comparable samples for each primary lane: `fast-feedback`, `confidence`, `heavy-governance`, and `browser`.
|
|
2. `healthy`: current runtime comfortably below budget with stable or improving recent comparable history.
|
|
3. `budget-near`: current runtime remains under budget but inside the lane's near-budget headroom band.
|
|
4. `trending-worse`: a bounded comparable window shows repeated worsening that is larger than the lane noise floor.
|
|
5. `regressed`: a budget breach or materially repeated worsening is clearly visible.
|
|
6. `unstable`: insufficient comparable history, fingerprint mismatch, or noisy evidence makes a stable label unsafe.
|
|
7. Approved recalibration case: explicit evidence shows why repository truth should change.
|
|
8. Rejected recalibration case: explicit evidence shows why repository truth should stay unchanged.
|
|
9. One support-lane example from `junit` or `profiling` when it materially improves hotspot or comparison evidence.
|
|
|
|
Each recorded example should name the lane, current runtime, previous runtime, baseline, budget, health class, hotspot summary, and the recalibration conclusion when relevant.
|
|
|
|
Material runtime drift, bundle-hydration caveats, and approved or rejected recalibration follow-up must be recorded in `specs/211-runtime-trend-recalibration/spec.md` or the active implementation PR. This quickstart may mirror the same evidence, but it does not replace the delivery record.
|
|
|
|
## CI Rollout Notes
|
|
|
|
- CI should hydrate the previous matching `*-latest.trend-history.json` from the most recent comparable uploaded artifact bundle before the report refresh step.
|
|
- The uploaded bundle for each governed workflow must include the refreshed `*-latest.trend-history.json` so the next run only needs one prior bundle.
|
|
- The workflow-owned refresh steps now pass `--fetch-latest-history` together with `TENANTATLAS_GITEA_TOKEN` and top-level `actions: read` plus `contents: read` permissions so bundle discovery stays explicit.
|
|
- Pull request and `dev` push validation remain the narrowest proving paths; heavy/browser/manual/scheduled lanes provide representative cross-lane evidence and must not be widened.
|
|
|
|
## Final Review Checklist
|
|
|
|
- Trend policy lives in repository truth, not workflow prose.
|
|
- `summary.md`, `report.json`, `budget.json`, and `*-latest.trend-history.json` agree on lane runtime and health class.
|
|
- Baseline and budget recalibration remain explicit, reviewable, and separate.
|
|
- Hotspot summaries stay readable and bounded.
|
|
- A timed reviewer dry run confirms the generated summary remains decidable within two minutes.
|
|
- The implementation does not add product persistence, routes, assets, or a second analytics surface.
|