Main Confidence / confidence (push) Failing after 3m36s

Details

Spec 210: implement CI test matrix budget enforcement (#243 )

## Summary
- add explicit Gitea workflow files for PR Fast Feedback, `dev` Confidence, Heavy Governance, and Browser lanes
- extend the repo-truth lane support seams with workflow profiles, trigger-aware budget enforcement, artifact publication contracts, CI summaries, and failure classification
- add deterministic artifact staging, new CI governance guard coverage, and Spec 210 planning/contracts/docs updates

## Validation
- `cd apps/platform && ./vendor/bin/sail bin pint --dirty --format agent`
- `cd apps/platform && ./vendor/bin/sail artisan test --compact tests/Feature/Guards/CiFastFeedbackWorkflowContractTest.php tests/Feature/Guards/CiConfidenceWorkflowContractTest.php tests/Feature/Guards/CiHeavyBrowserWorkflowContractTest.php tests/Feature/Guards/CiLaneFailureClassificationContractTest.php tests/Feature/Guards/FastFeedbackLaneContractTest.php tests/Feature/Guards/ConfidenceLaneContractTest.php tests/Feature/Guards/HeavyGovernanceLaneContractTest.php tests/Feature/Guards/BrowserLaneIsolationTest.php tests/Feature/Guards/FixtureLaneImpactBudgetTest.php tests/Feature/Guards/TestLaneManifestTest.php tests/Feature/Guards/TestLaneArtifactsContractTest.php tests/Feature/Guards/TestLaneCommandContractTest.php`
- `./scripts/platform-test-lane fast-feedback`
- `./scripts/platform-test-lane confidence`
- `./scripts/platform-test-lane heavy-governance`
- `./scripts/platform-test-lane browser`
- `./scripts/platform-test-report fast-feedback`
- `./scripts/platform-test-report confidence`

## Notes
- scheduled Heavy Governance and Browser workflows stay gated behind `TENANTATLAS_ENABLE_HEAVY_GOVERNANCE_SCHEDULE=1` and `TENANTATLAS_ENABLE_BROWSER_SCHEDULE=1`
- the remaining rollout evidence task is capturing the live Gitea run set this PR enables: PR Fast Feedback, `dev` Confidence, manual and scheduled Heavy Governance, and manual and scheduled Browser runs

Co-authored-by: Ahmed Darrazi <ahmed.darrazi@live.de>
Reviewed-on: #243

2026-04-17 18:04:35 +00:00

22 KiB

Raw Blame History

Feature Specification: CI Test Matrix & Runtime Budget Enforcement

Feature Branch: 210-ci-matrix-budget-enforcement
Created: 2026-04-17
Status: Draft
Input: User description: "Spec 210 — CI Test Matrix & Runtime Budget Enforcement"

Spec Candidate Check (mandatory — SPEC-GATE-001)

Problem: TenantPilot's test-lane governance is now credible locally, but shared repository validation still does not enforce the agreed lane paths, runtime budgets, or artifact contract.
Today's failure: Pull requests can silently widen the fast path, budget drift can accumulate without shared visibility, expensive lanes can bleed into the wrong triggers, and reviewers cannot rely on standardized CI evidence.
User-visible improvement: Contributors and reviewers get predictable fast, confidence, heavy, and browser validation paths with visible budget outcomes, standardized artifacts, and clear blocking versus non-blocking semantics.
Smallest enterprise-capable version: Wire repository CI to the existing checked-in lane entry points, classify each lane's budget outcome as blocking, warning, or informational per trigger, standardize per-lane artifacts, and document how contributors reproduce and interpret the governed paths.
Explicit non-goals: No new lane taxonomy, no new fixture-slimming or hotspot-optimization program, no broad build or deploy redesign, no new browser strategy, and no historical trend platform beyond per-run visibility.
Permanent complexity imported: Trigger-to-lane policy, budget enforcement classes, standardized artifact naming and retention rules, failure classification vocabulary, validation evidence, and concise contributor guidance.
Why now: Specs 206 through 209 created the necessary lane wrappers, fixture cost reductions, heavy-lane separation, and budget honesty; without CI enforcement those gains can drift during ordinary team and PR flow.
Why not local: Local wrappers and scripts cannot protect shared pull request behavior, enforce consistent blocking semantics, or publish comparable artifacts for reviewers and maintainers.
Approval class: Cleanup
Red flags triggered: New CI enforcement vocabulary and harder policy semantics. Defense: the scope stays repository-level, reuses existing lane truth, and avoids inventing a parallel execution model or new product runtime structures.
Score: Nutzen: 2 | Dringlichkeit: 2 | Scope: 2 | Komplexität: 1 | Produktnähe: 1 | Wiederverwendung: 2 | Gesamt: 10/12
Decision: approve

Spec Scope Fields (mandatory)

Scope: workspace
Primary Routes: No end-user HTTP routes change. The affected surfaces are repository-owned CI validation paths, checked-in lane entry points, run summaries, artifact outputs, and contributor guidance.
Data Ownership: Workspace-owned CI workflow definitions, lane policy, artifact naming conventions, budget classification results, validation evidence, and contributor documentation. No tenant-owned records or product runtime tables are introduced.
RBAC: No end-user authorization behavior changes. The affected actors are contributors, reviewers, maintainers, and CI runners operating against the shared test-governance contract.

Proportionality Review (mandatory when structural complexity is introduced)

New source of truth?: no
New persisted entity/table/artifact?: yes, but only repository-owned CI artifacts and validation outputs; no new product database persistence is introduced
New abstraction?: yes, but limited to a repository-level trigger-to-lane enforcement model, artifact contract, and budget outcome policy
New enum/state/reason family?: yes, but only repository-level failure and budget classification values used to keep CI outcomes legible
New cross-domain UI framework/taxonomy?: no
Current operator problem: Contributors and reviewers cannot rely on CI to preserve lane discipline, budget honesty, or standardized evidence across pull request and mainline work.
Existing structure is insufficient because: Local wrappers, budgets, and reports prove the governance model conceptually, but they do not stop shared workflows from drifting or guarantee that reviewers see the same artifacts and enforcement semantics.
Narrowest correct implementation: Reuse the existing lane wrappers and budgets, map them to explicit CI triggers, standardize artifacts and failure classes, and document how to reproduce each governed path locally.
Ownership cost: The team must maintain trigger policy, artifact naming, budget classes, validation coverage, and contributor guidance as lane budgets and runner behavior evolve.
Alternative intentionally rejected: Inline CI commands or a one-size-fits-all full-suite gate, because both duplicate repository truth and would either hide drift or overburden the fast path.
Release truth: Current-release repository truth required to operationalize the already approved local governance work from Specs 206 through 209.

Problem Statement

TenantPilot has already reshaped the test suite structurally:

Lane governance and runtime budgets now exist.
Shared fixture cost has been reduced.
Heavy Filament or Livewire families have been separated from lighter paths.
Heavy-governance cost has been made honest and explicitly visible.

That progress is still vulnerable to drift until repository CI becomes the enforcement surface.

The open risks are now operational rather than conceptual:

Shared pull request validation can still grow beyond the intended fast path.
Runtime budgets remain advisory until CI classifies and reacts to overruns.
JUnit-style results, lane reports, and budget evidence remain too easy to treat as local debugging aids instead of shared run outputs.
Heavy Governance and Browser have a different cost class from Fast Feedback and Confidence, but they are not yet guaranteed to stay on separate trigger cadences.
Without explicit failure semantics, the team either over-blocks the normal flow or under-enforces the very governance it just created.

Without this feature, the repository has a better-organized suite but not yet an institutionalized validation contract.

Dependencies

Depends on Spec 206 - Test Suite Governance & Performance Foundation for lane vocabulary, budgets, and checked-in entry points.
Depends on Spec 207 - Shared Test Fixture Slimming for the reduced default per-test cost that makes lane budgets credible.
Depends on Spec 208 - Heavy Suite Segmentation for the honest separation of heavy Filament or Livewire families.
Depends on Spec 209 - Heavy Governance Lane Cost Reduction for a more stable heavy-lane budget basis before CI enforcement hardens.
Recommended after stable local lane wrappers, credible initial budgets, and a clean Heavy Governance classification.
Blocks durable team-wide enforcement of the new test-governance model in everyday pull request and mainline flow.
Does not block isolated local development while contributors still follow the existing lane rules manually.

Goals

Establish the existing lane wrappers as the required CI execution paths.
Make runtime budgets machine-checked and visible in shared validation.
Separate pull request, mainline, scheduled, and manual validation intentionally.
Standardize machine-readable test results, lane reports, and budget evidence as part of the CI contract.
Surface lane drift and budget erosion early enough to correct.
Distinguish blocking from informative signals clearly.
Make the repository's test governance durable for the team rather than dependent on local discipline.

Non-Goals

Creating another fixture-slimming effort.
Re-segmenting the lane model created in earlier specs.
Fixing every individual performance hotspot inside this spec.
Redesigning the broader build, deploy, or environment pipeline.
Replacing the existing browser strategy beyond its CI placement and enforcement level.
Introducing a long-horizon historical trend platform; this spec is about per-run enforcement and visibility.

Assumptions

The lane wrappers and budget definitions established by Specs 206 through 209 are mature enough to serve as CI entry points with only targeted hardening.
Repository CI can retain per-lane artifacts and expose them to contributors or reviewers for governed runs.
Fast Feedback and Confidence can become stricter earlier than Heavy Governance or Browser if the latter still need softer enforcement while their budgets stabilize.
Validation can use representative CI runs to prove the matrix, without requiring multi-week trend infrastructure in this feature.

Test Governance Impact (mandatory — TEST-GOV-001)

Affected validation lanes: fast-feedback for blocking pull request validation, confidence for dev push validation, heavy-governance for separate manual and scheduled heavy validation, and browser for separate manual and scheduled browser validation. profiling and junit remain support-only lanes outside the default CI trigger matrix.
Fixture/helper cost risk: Low and bounded. This feature adds CI workflow files, CI-governance guards, and an artifact staging helper only. It MUST NOT introduce new shared product fixtures, widen default guard setup, or accidentally promote CI-governance coverage into Heavy Governance or Browser lane membership.
Heavy/browser impact: No new browser scenarios or heavy-governance families are introduced. The feature only operationalizes the existing Heavy Governance and Browser lanes as explicit CI trigger classes and evidence bundles.
Budget/baseline follow-up: Fast Feedback hard-fail budget enforcement requires a documented CI variance tolerance before rollout is complete. Any material runtime drift or recalibration discovered during rollout MUST be recorded in this spec or the implementation PR.

Required Validation Evidence Set

One representative pull_request Fast Feedback run.
One representative push to dev Confidence run.
One manual heavy-governance workflow run.
One scheduled heavy-governance workflow run after schedules are enabled.
One manual browser workflow run.
One scheduled browser workflow run after schedules are enabled.
Each evidence record MUST identify the trigger, executed lane, published artifact bundle, budget outcome class, and primary failure class or explicit clean-success result. The Fast Feedback evidence record MUST reference the chosen CI variance tolerance, and any material runtime recalibration discovered during rollout MUST be recorded in this spec or the implementation PR and may be linked from the affected evidence records.

Artifact Publication Contract

Pull request runs stage upload bundles in .gitea-artifacts/pr-fast-feedback.
dev confidence runs stage upload bundles in .gitea-artifacts/main-confidence.
Heavy Governance runs stage upload bundles in .gitea-artifacts/heavy-governance.
Browser runs stage upload bundles in .gitea-artifacts/browser.
Every governed bundle MUST contain summary.md, budget.json, report.json, and junit.xml for the lane executed by that workflow.

Frozen Trigger Matrix

Trigger	Workflow profile	Executed lane	Blocking semantics	Schedule state
`pull_request` (`opened`, `reopened`, `synchronize`)	`pr-fast-feedback`	`fast-feedback`	Blocking for test, wrapper or manifest, artifact, and mature hard-fail budget failures	N/A
`push` to `dev`	`main-confidence`	`confidence`	Blocking for test, wrapper or manifest, and artifact failures; budget remains visible as warning-first	N/A
`workflow_dispatch` heavy run	`heavy-governance-manual`	`heavy-governance`	Warning-first / trend-oriented while baselines stabilize	Enabled at rollout
Scheduled heavy run	`heavy-governance-scheduled`	`heavy-governance`	Warning-first / trend-oriented while baselines stabilize	Enable only after one successful manual validation
`workflow_dispatch` browser run	`browser-manual`	`browser`	Warning-first / trend-oriented while baselines stabilize	Enabled at rollout
Scheduled browser run	`browser-scheduled`	`browser`	Informational or warning-first until stability evidence justifies more	Enable only after one successful manual validation

No-New-Fixture-Cost Rule

CI-governance changes in this spec MUST stay inside repo-level workflows, lightweight guard coverage, manifest or budget policy, and artifact staging.
The feature MUST NOT add shared product fixtures, broaden default setup in existing feature tests, or promote CI-governance coverage into Heavy Governance or Browser lane membership.
The documented Fast Feedback CI variance allowance is 15s above the 200 second baseline threshold before the pull request path upgrades a budget overrun from warning to blocking failure.

User Scenarios & Testing (mandatory)

User Story 1 - Enforce The Fast Pull Request Path (Priority: P1)

As a contributor opening a pull request, I want CI to run only the intended fast validation path and to fail quickly when that path or its hard runtime budget contract is violated.

Why this priority: Pull request validation is the highest-frequency shared feedback loop. If it stays ambiguous or slow, the rest of the governance model loses credibility.

Independent Test: Run a representative pull request validation against a minimal non-lane-expanding sample change, such as a documentation-only edit or workflow-comment change on a feature branch, and confirm that only the assigned fast lane executes, required artifacts are produced, and blocking test or budget failures stop the run.

Acceptance Scenarios:

Given a contributor opens or updates a pull request, When CI starts validation, Then it invokes only the assigned fast lane entry point and does not also execute Heavy Governance or Browser lanes.
Given the fast lane has a test failure or a blocking budget breach, When the run completes, Then the pull request path is marked failed and the summary identifies the failure class.
Given the fast lane succeeds, When the run completes, Then the run exposes the required per-lane artifacts and budget outcome from that same execution.

User Story 2 - Publish Mainline Confidence Evidence (Priority: P1)

As a maintainer protecting shared quality, I want the broader mainline validation path to run the intended confidence checks and publish standardized evidence that reviewers can inspect without reconstructing the run manually.

Why this priority: The repository needs a shared confidence path between quick author feedback and the expensive heavy lanes. That path must be broad enough to trust and explicit enough to review.

Independent Test: Execute a representative mainline validation run and verify that the assigned confidence lane executes, machine-readable test results and lane reports are published, and the budget result is classified separately from test success or failure.

Acceptance Scenarios:

Given a push reaches the protected mainline path, When repository validation runs, Then CI executes the assigned broader lane and publishes the required lane artifacts for that path.
Given the broader lane exceeds a budget with non-blocking semantics, When the run completes, Then the budget warning remains visible without being confused with a test failure.
Given the broader lane has a real test failure, When a reviewer inspects the run, Then the machine-readable results and lane-specific report are available without requiring a local rerun.

User Story 3 - Keep Heavy And Browser Validation Separate (Priority: P2)

As a maintainer or release owner, I want Heavy Governance and Browser validation to remain available in CI as separate cost classes so the normal fast path stays lean while the expensive lanes still stay visible and governable.

Why this priority: Expensive lanes remain important, but they should not silently piggyback onto ordinary pull request feedback or become invisible because they are too awkward to run and interpret.

Independent Test: Execute representative scheduled or manual Heavy Governance and Browser runs and confirm that each path runs separately, publishes lane-specific artifacts, and shows its declared enforcement semantics.

Acceptance Scenarios:

Given a scheduled or manually requested Heavy Governance run, When it executes, Then only the assigned heavy lane runs and its budget result is classified according to policy.
Given a scheduled or manually requested Browser run, When it executes, Then Browser evidence and artifacts are published separately from the fast and confidence paths.
Given a contributor checks the repository guidance, When they look up expected triggers, Then they can tell which triggers run Fast Feedback, Confidence, Heavy Governance, and Browser validation.

Edge Cases

What happens when a lane passes tests but fails to publish one or more required artifacts?
How does the system handle a lane whose budget definition is missing, unreadable, or inconsistent with the declared trigger policy?
What happens when a run lands close to a budget threshold because of normal runner variability?
How does the system handle a trigger that points to a lane entry point or manifest that no longer resolves?
What happens when a contributor manually reruns one expensive lane and CI would otherwise be tempted to execute unrelated lanes implicitly?

Requirements (mandatory)

Constitution alignment (required): This feature is repository-only test-governance work. It introduces no Microsoft Graph calls, no product write behavior, no OperationRun, and no end-user authorization changes.

Constitution alignment (PROP-001 / ABSTR-001 / PERSIST-001 / STATE-001 / BLOAT-001): The feature adds repository-level artifacts, policy abstractions, and failure classes only because local governance alone is insufficient to keep shared validation honest. The Proportionality Review above explains why this is the narrowest correct implementation and why a more local workaround is not enough.

Functional Requirements

FR-001: The repository MUST define distinct CI validation paths for at least Fast Feedback pull request validation, broader mainline Confidence validation, Heavy Governance validation, and Browser validation.
FR-002: Each governed CI path MUST invoke the repository's checked-in lane entry point for that lane instead of inlining its own test-selection logic.
FR-003: The pull request Fast Feedback path MUST be blocking and MUST exclude Heavy Governance and Browser validation unless an explicitly documented trigger requests them.
FR-004: The mainline Confidence path MUST publish machine-readable test results, a lane report, and a budget evaluation for its assigned lane or lane set.
FR-005: Every governed lane MUST have a documented budget policy that states the budget target, the enforcement class for overruns, and the trigger contexts where that class applies.
FR-006: CI output MUST distinguish at least these failure classes: test failure, wrapper or manifest failure, budget breach or warning, artifact publication failure, and infrastructure or runner failure.
FR-007: Each lane run MUST produce standardized, lane-identifiable artifacts with reproducible naming and storage rules so runs remain comparable over time.
FR-008: The trigger policy MUST document and implement which lane set runs on pull request updates, protected-branch pushes, scheduled runs, and manual runs.
FR-009: The CI contract MUST surface drift signals when a lane exceeds its time budget, produces missing or corrupt artifacts, invokes the wrong lane, or fails to resolve its checked-in entry point.
FR-010: Contributor guidance MUST explain local reproduction, blocking versus non-blocking signals, artifact locations, and when Heavy Governance or Browser runs are expected.
FR-011: Completion of this feature MUST include validation evidence showing that each documented trigger-to-lane pairing executes as declared and produces the expected artifacts and outcome classification.

Key Entities (include if feature involves data)

Lane Class: The governed execution family for a run, including Fast Feedback, Confidence, Heavy Governance, and Browser, with a defined cost class and intended trigger usage.
Trigger Policy: The checked-in mapping from pull request, protected-branch, scheduled, and manual triggers to the lane classes they are allowed or required to execute.
Budget Policy: The per-lane runtime contract that states the target runtime, the enforcement class for overruns, and the contexts where that class is blocking, warning, or informational.
Artifact Set: The machine-readable test results, lane report, budget evaluation, and optional run overview that together form the evidence contract for a governed run.
Failure Classification: The named outcome family used to explain why a governed run did not succeed cleanly, such as test failure, budget problem, artifact failure, or infrastructure failure.

Success Criteria (mandatory)

Measurable Outcomes

SC-001: Within the required validation evidence set for this feature, 100% of pull request runs execute only the intended Fast Feedback lane and do not unintentionally invoke Heavy Governance or Browser validation.
SC-002: Within the required validation evidence set for this feature, 100% of mainline Confidence runs publish the required machine-readable test result, lane report, and budget evaluation for the assigned lane or lane set.
SC-003: Within the required validation evidence set for this feature, 100% of Heavy Governance and Browser runs publish lane-specific artifacts and display the enforcement class declared for that lane.
SC-004: For every governed lane run captured in the required validation evidence set, any non-success or warning outcome is labeled with exactly one named failure class in the run summary.
SC-005: 100% of documented CI triggers represented in the required validation evidence set match their implemented lane mapping, with no undocumented trigger exceptions.
SC-006: Contributors can reproduce each governed lane locally using only documented checked-in entry points, with no undocumented fallback commands required.

22 KiB Raw Blame History

Feature Specification: CI Test Matrix & Runtime Budget Enforcement

Spec Candidate Check (mandatory — SPEC-GATE-001)

Spec Scope Fields (mandatory)

Proportionality Review (mandatory when structural complexity is introduced)

Problem Statement

Dependencies

Goals

Non-Goals

Assumptions

Test Governance Impact (mandatory — TEST-GOV-001)

Required Validation Evidence Set

Artifact Publication Contract

Frozen Trigger Matrix

No-New-Fixture-Cost Rule

User Scenarios & Testing (mandatory)

User Story 1 - Enforce The Fast Pull Request Path (Priority: P1)

User Story 2 - Publish Mainline Confidence Evidence (Priority: P1)

User Story 3 - Keep Heavy And Browser Validation Separate (Priority: P2)

Edge Cases

Requirements (mandatory)

Functional Requirements

Key Entities (include if feature involves data)

Success Criteria (mandatory)

Measurable Outcomes

22 KiB

Raw Blame History