Main Confidence / confidence (push) Failing after 46s

Details

feat: implement runtime trend recalibration reporting (#244 )

## Summary
- implement Spec 211 runtime trend reporting with bounded lane history, drift classification, hotspot trend output, and recalibration evidence handling
- extend the repo-truth governance seams and workflow wrappers for comparable-bundle hydration, trend artifact publication, and contract-backed reporting
- add the Spec 211 planning artifacts, data model, quickstart, tasks, and repository contract documents

## Validation
- parsed `specs/211-runtime-trend-recalibration/contracts/test-runtime-trend-history.schema.json`
- parsed `specs/211-runtime-trend-recalibration/contracts/test-runtime-trend.logical.openapi.yaml`
- re-ran cross-artifact consistency analysis for the Spec 211 artifact set until no material findings remained
- no application test suite was re-run as part of this final commit/push/PR step

Co-authored-by: Ahmed Darrazi <ahmed.darrazi@live.de>
Reviewed-on: #244

2026-04-18 07:36:05 +00:00

12 KiB

Raw Blame History

Data Model: Test Runtime Trend Reporting & Baseline Recalibration

This feature adds repository-owned governance artifacts only. It does not add product database tables. All objects below are implemented as manifest metadata, generated JSON payloads, markdown summaries, or guard-test fixtures derived from the existing lane report outputs.

1. LaneTrendPolicy

Purpose: Defines the lane-specific rules for bounded history retention, comparable-window evaluation, hotspot visibility, and recalibration guidance.

Field	Type	Description
`laneId`	string	Canonical lane identifier (`fast-feedback`, `confidence`, `heavy-governance`, `browser`, `junit`, `profiling`).
`workflowProfile`	string	Workflow profile that owns the lane history source in CI.
`retentionLimit`	integer	Max history records retained for the lane.
`comparisonWindowSize`	integer	Number of recent comparable records used for drift evaluation.
`minimumComparableSamples`	integer	Required sample count before a stable non-`unstable` health class is allowed.
`varianceFloorSeconds`	integer	Minimum meaningful delta for the lane, aligned with current enforcement tolerance.
`nearBudgetHeadroomSeconds`	integer	Headroom threshold for `budget-near`.
`hotspotFamilyLimit`	integer	Max family deltas shown in readable summaries.
`hotspotFileLimit`	integer	Max file hotspots shown in readable summaries.
`slowestEntryRetention`	integer	Max slowest test entries retained in JSON evidence.
`recalibrationPolicy`	array	Rule summary for acceptable baseline and budget recalibration triggers.

Relationships

One LaneTrendPolicy governs many LaneTrendRecord entries for the same lane.
One LaneTrendPolicy informs one TrendComparisonWindow, one LaneDriftAssessment, and zero or more RecalibrationDecisionRecord entries per reporting cycle.

Validation Rules

retentionLimit must be greater than or equal to comparisonWindowSize.
minimumComparableSamples must be at least 3.
varianceFloorSeconds must align with or exceed the lane's existing enforcement tolerance.
Primary lanes use a larger retention window than support lanes.

2. LaneTrendRecord

Purpose: Captures the per-run evidence snapshot that can safely be compared over time.

Field	Type	Description
`runRef`	string	Stable run reference from CI or local execution.
`laneId`	string	Governed lane identifier.
`workflowId`	string	Workflow profile or logical workflow owner for the run.
`triggerClass`	string	Pull request, mainline push, manual, scheduled, or local classification.
`generatedAt`	datetime	When the record was emitted.
`wallClockSeconds`	number	Current lane runtime in seconds.
`baselineSeconds`	number or null	Current comparison baseline for the lane if defined.
`baselineSource`	string	Manifest source or comparison source that supplied the baseline.
`budgetSeconds`	number	Current lane budget threshold in seconds.
`budgetStatus`	string	Current lane budget status from the existing budget evaluator.
`blockingStatus`	string	Whether the current CI context blocks on this outcome.
`comparisonFingerprint`	string	Hash or structured fingerprint capturing comparability boundaries.
`classificationTotals`	array	Runtime grouped by current classification totals.
`familyTotals`	array	Runtime grouped by current family totals.
`hotspotFiles`	array	Current dominant hotspot files.
`slowestEntries`	array	Current slowest test entries, capped by policy.
`artifactRefs`	array	References to the summary, report, budget, JUnit, and history artifacts backing the record.

Validation Rules

A record must derive from the same lane's current summary.md, report.json, budget.json, and available JUnit output.
comparisonFingerprint must be present for any record eligible for comparison.
wallClockSeconds, budgetSeconds, and generatedAt are required.
slowestEntries must not exceed the lane policy retention cap.

3. TrendComparisonWindow

Purpose: Represents the bounded comparable history used to evaluate one lane in one reporting cycle.

Field	Type	Description
`laneId`	string	Governed lane identifier.
`policyRef`	string	Reference to the governing `LaneTrendPolicy`.
`currentRecord`	object	The latest `LaneTrendRecord`.
`previousComparableRecord`	object or null	The most recent prior comparable record, if one exists.
`comparableRecords`	array	Ordered comparable records used for trend evaluation.
`excludedRecords`	array	Recent records skipped because of fingerprint mismatch or invalid evidence.
`windowStatus`	enum	`stable`, `insufficient-history`, `scope-changed`, or `noisy`.
`sampleCount`	integer	Number of comparable records in the active window.

Validation Rules

Every comparable record must share the same comparisonFingerprint.
sampleCount may not exceed comparisonWindowSize.
previousComparableRecord must be the immediately preceding entry in comparableRecords when present.
windowStatus becomes insufficient-history whenever sampleCount is below minimumComparableSamples.

4. LaneDriftAssessment

Purpose: Summarizes the current drift verdict for one lane using the bounded comparison window.

Field	Type	Description
`laneId`	string	Governed lane identifier.
`healthClass`	enum	`healthy`, `budget-near`, `trending-worse`, `regressed`, or `unstable`.
`deltaToPreviousSeconds`	number or null	Current runtime delta vs previous comparable run.
`deltaToPreviousPercent`	number or null	Percent delta vs previous comparable run.
`deltaToBaselineSeconds`	number or null	Current runtime delta vs lane baseline.
`deltaToBaselinePercent`	number or null	Percent delta vs lane baseline.
`budgetHeadroomSeconds`	number	Remaining headroom before budget breach.
`worseningStreak`	integer	Count of recent comparable records showing meaningful worsening.
`varianceObservedSeconds`	number	Effective variance observed across the active window.
`recalibrationRecommendation`	enum	`none`, `investigate`, `review-baseline`, or `review-budget`.
`summaryLine`	string	Human-readable explanation emitted into markdown summaries.

Validation Rules

healthClass may only be non-unstable when the comparison window has at least minimumComparableSamples comparable records.
recalibrationRecommendation must remain separate from healthClass.
budgetHeadroomSeconds may be negative only when the lane is over budget.

5. HotspotTrendSnapshot

Purpose: Captures how the dominant runtime contributors changed between the current and previous comparable run.

Field	Type	Description
`laneId`	string	Governed lane identifier.
`familyDeltas`	array	Top family-level deltas with current seconds, previous seconds, and delta values.
`fileHotspots`	array	Top file hotspots with current/previous runtime and rank movement.
`newEntrants`	array	Families or files newly entering the visible hotspot set.
`droppedEntrants`	array	Families or files leaving the visible hotspot set.
`evidenceAvailability`	enum	`available` or `unavailable`, used when JUnit or attribution evidence is missing.

Validation Rules

Human-readable summaries must cap output at the policy's family/file limits.
JSON evidence may retain more detail, but must not exceed slowestEntryRetention.
If hotspot evidence is unavailable, the summary must say so explicitly.

6. RecalibrationDecisionRecord

Purpose: Records structured evidence for a proposed, approved, or rejected baseline/budget recalibration.

Field	Type	Description
`laneId`	string	Governed lane identifier.
`targetType`	enum	`baseline` or `budget`.
`decisionStatus`	enum	`candidate`, `approved`, or `rejected`.
`evidenceRunRefs`	array	Comparable runs supporting the decision.
`previousValueSeconds`	number	Existing baseline or budget value.
`proposedValueSeconds`	number or null	Proposed replacement value.
`rationaleCode`	enum	`lane-scope-change`, `infrastructure-shift`, `post-improvement-reset`, `sustained-erosion`, `noise-rejected`, or `manual-hold`.
`recordedIn`	string	Active spec path or implementation PR reference where the decision is documented.
`notes`	string	Concise reviewer-facing explanation.

Validation Rules

Approved baseline changes require at least one accepted rationale tied to scope or environment truth.
Approved budget changes require a stronger evidence window than approved baseline changes.
Rejected decisions must retain the rejection reason.
The artifact may propose candidates, but approval remains human-controlled.

7. TrendSummaryCycle

Purpose: Represents one generated trend-aware reporting cycle across the relevant lanes.

Field	Type	Description
`cycleId`	string	Reporting-cycle identifier, typically anchored to the current lane run or summary generation timestamp.
`generatedAt`	datetime	When the cycle summary was emitted.
`laneSummaries`	array	Per-lane summary entries containing `laneId`, current runtime, previous comparable runtime, baseline, budget, and the embedded drift assessment used by the readable summary surface.
`laneAssessments`	array	`LaneDriftAssessment` items for all relevant lanes.
`hotspotSnapshots`	array	`HotspotTrendSnapshot` items for lanes with available evidence.
`recalibrationDecisions`	array	Candidate, approved, or rejected recalibration records emitted for the cycle.
`artifactPublicationStatus`	array	Whether required current-run and history artifacts were published successfully.
`warnings`	array	Legibility notes such as missing comparable history or unavailable hotspot evidence.

Validation Rules

Every relevant primary lane must have exactly one laneSummaries entry and exactly one LaneDriftAssessment per cycle.
Each laneSummaries entry must expose the current runtime, previous comparable runtime, baseline, budget, and embedded health assessment needed by the readable summary surface.
warnings must be explicit when any required evidence is unavailable.
The cycle summary must stay readable without requiring a second dashboard surface.

State Transitions

LaneDriftAssessment.healthClass

unstable -> healthy: allowed once there are enough comparable samples and the lane is comfortably below budget without sustained worsening.
unstable -> budget-near: allowed once there are enough comparable samples and budget headroom falls inside the near-budget window.
unstable -> trending-worse: allowed once there are enough comparable samples and worsening exceeds the lane variance floor across the bounded window.
healthy <-> budget-near: allowed as headroom enters or leaves the near-budget band.
healthy or budget-near -> trending-worse: allowed when sustained worsening appears without a budget breach.
trending-worse -> regressed: allowed when the lane breaches budget or shows a materially worse repeated trend strong enough to stop calling it merely erosion.
Any state -> unstable: allowed when comparability breaks, history is insufficient, or the window is too noisy to classify reliably.

RecalibrationDecisionRecord.decisionStatus

candidate -> approved: allowed only by explicit human review with structured evidence.
candidate -> rejected: allowed when the evidence is noisy, incomplete, or policy says repository truth should not move.
approved and rejected: terminal statuses for the recorded decision.

12 KiB Raw Blame History

Data Model: Test Runtime Trend Reporting & Baseline Recalibration

1. LaneTrendPolicy

2. LaneTrendRecord

3. TrendComparisonWindow

4. LaneDriftAssessment

5. HotspotTrendSnapshot

6. RecalibrationDecisionRecord

7. TrendSummaryCycle

State Transitions

LaneDriftAssessment.healthClass

RecalibrationDecisionRecord.decisionStatus

12 KiB

Raw Blame History