TenantAtlas/specs/206-test-suite-governance/spec.md
ahmido 3c38192405 Spec 206: implement test suite governance foundation (#239)
## Summary

This PR implements Spec 206 end to end and establishes the first checked-in test suite governance foundation for the platform app.

Key changes:
- add manifest-backed test lanes for fast-feedback, confidence, browser, heavy-governance, profiling, and junit
- add budget and report helpers plus app-local artifact generation under `apps/platform/storage/logs/test-lanes`
- add repo-root Sail-friendly lane/report wrappers
- switch the default contributor test path to the fast-feedback lane
- introduce explicit fixture profiles and cheaper defaults for shared tenant/provider test setup
- add minimal/heavy factory states for tenant and provider connection setup
- migrate the first high-usage and provider-sensitive tests to explicit fixture profiles
- document budgets, taxonomy rules, DB reset guidance, and the full Spec 206 plan/contracts/tasks set

## Validation

Executed during implementation:
- focused Spec 206 guard/support/factory validation pack: 31 passed
- provider-sensitive regression pack: 29 passed
- first high-usage caller migration pack: 120 passed
- lane routing and wrapper validation succeeded
- pint completed successfully

Measured lane baselines captured in docs:
- fast-feedback: 176.74s
- confidence: 394.38s
- heavy-governance: 83.66s
- browser: 128.87s
- junit: 380.14s
- profiling: 2701.51s
- full-suite baseline anchor: 2624.60s

## Notes

- Livewire v4 / Filament v5 runtime behavior is unchanged by this PR.
- No new runtime routes, product UI flows, or database migrations are introduced.
- Panel provider registration remains unchanged in `bootstrap/providers.php`.

Co-authored-by: Ahmed Darrazi <ahmed.darrazi@live.de>
Reviewed-on: #239
2026-04-16 13:58:50 +00:00

21 KiB

Feature Specification: Test Suite Governance & Performance Foundation

Feature Branch: 206-test-suite-governance
Created: 2026-04-16
Status: Draft
Input: User description: "Spec 206 — Test Suite Governance & Performance Foundation"

Spec Candidate Check (mandatory — SPEC-GATE-001)

  • Problem: TenantPilot's test suite has become expensive enough that feedback speed, suite honesty, and authoring discipline are now repo-level engineering concerns rather than local developer preferences.
  • Today's failure: The default run is too close to a broad full-suite execution, heavy helpers and factory defaults silently inflate test cost, and slow-test regressions are hard to see before they become the new normal.
  • User-visible improvement: Contributors get a clearly faster standard run, a clearly broader confidence run, separate heavy lanes, and visible slow-test signals without relying on private shell habits.
  • Smallest enterprise-capable version: Define four operational test lanes (Fast Feedback, Confidence, Browser, and Heavy Governance), add checked-in entry points plus profiling and machine-readable reporting runs, document honest taxonomy and cheap-default rules, standardize slow-test visibility, set baseline-backed runtime budgets, and split the heaviest shared setup paths into minimal versus full modes.
  • Explicit non-goals: No wholesale rewrite of the entire suite, no blanket removal of browser or database tests, no immediate hard-fail CI matrix rollout, and no attempt to optimize every slow file in one pass.
  • Permanent complexity imported: Lane vocabulary, taxonomy rules, runtime budgets, reporting conventions, helper and factory naming discipline, and guard tests that keep lane drift visible.
  • Why now: The suite is already large enough that further feature waves, wider browser coverage, and later CI hardening will compound current cost unless the repo establishes structure first.
  • Why not local: Personal scripts and tribal knowledge cannot stop default-run drift, hidden heavy setup, or ambiguous classification; the rules and entry points must live in the repo and be shareable.
  • Approval class: Cleanup
  • Red flags triggered: New classification vocabulary and "foundation" framing. Defense: the scope is intentionally narrow to repo test execution, visibility, and cheap-default discipline; it does not add new runtime product truth, new user-facing surfaces, or generalized platform abstractions.
  • Score: Nutzen: 2 | Dringlichkeit: 2 | Scope: 2 | Komplexität: 1 | Produktnähe: 1 | Wiederverwendung: 2 | Gesamt: 10/12
  • Decision: approve

Spec Scope Fields (mandatory)

  • Scope: workspace
  • Primary Routes: No end-user HTTP routes are changed. The affected surfaces are repository-level test entry points, suite grouping rules, shared test helpers, factory defaults, and checked-in test-governance documentation.
  • Data Ownership: Workspace-owned test commands, helper conventions, grouping metadata, runtime budgets, and slow-test reporting outputs. No new tenant-owned runtime records are introduced.
  • RBAC: No end-user authorization behavior changes. The affected actors are repository contributors who need a consistent, checked-in way to run the right test lane.

Proportionality Review (mandatory when structural complexity is introduced)

  • New source of truth?: no
  • New persisted entity/table/artifact?: yes, but only ephemeral local test-run artifacts under apps/platform/storage/logs/test-lanes; no new product database table or persisted product-truth artifact is introduced
  • New abstraction?: yes, but limited to a repo-level lane and classification model for test execution governance
  • New enum/state/reason family?: no
  • New cross-domain UI framework/taxonomy?: no
  • Current operator problem: Developers and reviewers cannot reliably choose the right test run for quick feedback versus broad confidence, and they cannot see where the suite is getting slower until the cost is already entrenched.
  • Existing structure is insufficient because: A single serial default path, misclassified tests, expensive shared helpers, cascading factory defaults, and absent runtime budgets let expensive patterns spread without a shared correction mechanism.
  • Narrowest correct implementation: Introduce four operational lanes plus two support runs, checked-in entry points, written classification and helper rules, baseline-backed initial budgets, slow-test visibility, and the first cheap-default fixes for the heaviest setup paths.
  • Ownership cost: The team must maintain lane definitions, budgets, helper naming, classification rules, and a small set of guard tests or reports as the suite evolves.
  • Alternative intentionally rejected: Pure parallelization without cost discipline, or ad-hoc local scripts without a shared taxonomy and budget model.
  • Release truth: Current-release truth and a near-term prerequisite for larger feature waves, broader browser coverage, and later CI hardening.

Problem Statement

TenantPilot already has a valuable test suite, but its growth pattern now creates avoidable delay and ambiguity:

  • The standard run behaves too much like a broad suite instead of a fast feedback path.
  • A meaningful share of tests classified as lightweight are still integration-heavy in practice.
  • Shared helpers and factory defaults often create more tenant, provider, membership, workspace, or UI context than the test actually needs.
  • Browser, UI-heavy, contract, and guard-heavy families have real value but belong to a different cost class than the usual authoring loop.
  • Slow-file and slow-test drift is not yet a first-class repository signal.

Without a governance layer, each new feature wave makes the default path slower, increases the temptation to skip useful runs, and makes later CI hardening more expensive.

Goals

  • Restore a clearly fast standard feedback path for normal feature work.
  • Separate fast-feedback, confidence, browser, and heavy-governance execution paths into explicit lanes.
  • Make test taxonomy honest so a file's label matches its real cost and dependencies.
  • Stop shared helpers and factory defaults from smuggling in heavy setup unless the test opts into it.
  • Make slow-test regressions visible early enough to review and correct.
  • Prepare the suite to grow toward 10k+ tests without default-run cost exploding.

Non-Goals

  • Removing high-value security, governance, contract, or browser coverage solely to improve headline runtime.
  • Rewriting every existing test into a new classification scheme in one feature.
  • Forcing an immediate CI matrix rollout before the repo has stable lane definitions and budgets.
  • Replacing the current application testing foundations or browser-testing approach.

Assumptions

  • The current full-suite wall-clock time will be captured as the baseline from the repository's standard development environment before budgets are locked.
  • Heavy suites remain valuable and are being separated, not downgraded in importance.
  • This feature may change checked-in commands, grouping metadata, shared helpers, factory states, and test documentation, but it does not require a new product runtime surface.

User Scenarios & Testing (mandatory)

User Story 1 - Run The Fast Feedback Lane (Priority: P1)

As a contributor making a normal code change, I want one checked-in default run that gives representative feedback quickly, without silently dragging browser and deliberately heavy governance suites into my normal loop.

Why this priority: The default authoring loop is the highest-frequency path in the repository. If it stays slow and ambiguous, every other optimization has limited impact.

Independent Test: Run the default lane from a clean developer environment, confirm that it excludes browser and intentionally heavy families, and verify that the result arrives within the documented fast-lane budget.

Acceptance Scenarios:

  1. Given a contributor is working on a non-browser change, When they run the repository's default test entry point, Then only the fast-feedback lane executes and it completes within the documented fast-lane budget.
  2. Given browser or heavy governance tests exist in the repository, When the default lane runs, Then those families are not silently included.
  3. Given the fast lane fails, When the contributor reads the output, Then it is clear whether a broader confidence or heavy lane should be run next.

User Story 2 - Run The Broader Confidence Lane Before Merge (Priority: P1)

As a contributor or reviewer preparing a higher-confidence check, I want a broader lane that covers most feature and integration safety without forcing the cost of every browser and governance-heavy family.

Why this priority: A fast lane alone is not enough. The repository also needs a shared, predictable middle path between quick local feedback and the heaviest suite.

Independent Test: Run the confidence lane and verify that it includes the Unit suite plus the manifest-defined non-browser Feature or Integration selectors that remain outside explicit heavy-governance exclusions, stays under its documented budget, and remains separate from the heavy-governance lane.

Acceptance Scenarios:

  1. Given a change is ready for broader validation, When a contributor runs the confidence lane, Then it includes the Unit suite plus the manifest-defined non-browser Feature or Integration selectors that remain outside explicit heavy-governance exclusions while remaining under its documented budget.
  2. Given a suite family is explicitly classified as heavy governance or browser-only, When the confidence lane runs, Then that family is excluded unless the lane definition says otherwise.

User Story 3 - See Where Runtime Is Getting Worse (Priority: P2)

As a maintainer investigating suite cost, I want machine-readable reporting and slow-test profiling available through checked-in entry points so I can identify runtime drift before it becomes accepted baseline behavior.

Why this priority: Visibility must come before targeted optimization. Without it, expensive regressions are discovered late and argued from anecdotes.

Independent Test: Generate the machine-readable report and slow-test profile from the repository entry points and confirm that the slowest files or test cases are ranked and attributable to a lane or family.

Acceptance Scenarios:

  1. Given a maintainer wants to inspect suite cost, When they run the reporting and profiling entry points, Then the repository produces ranked slow-test output and machine-readable results without ad-hoc local scripting.
  2. Given a lane exceeds its documented budget, When the report is reviewed, Then the over-budget lane or file cluster is visible enough to trigger follow-up work.

User Story 4 - Author Cheap, Honest Tests By Default (Priority: P2)

As a contributor adding or editing tests, I want written rules and cheap shared defaults so I can place the test in the right lane and avoid inheriting an expensive tenant or UI context I did not ask for.

Why this priority: The suite only stays healthy if new tests default to minimal setup and honest classification instead of repeating the patterns that caused the slowdown.

Independent Test: Add or update a test that uses shared helpers or factories, verify that the default path is minimal, and confirm that heavy context requires an explicit opt-in.

Acceptance Scenarios:

  1. Given a contributor writes a lightweight test, When they use shared helpers or factories, Then the default path creates only the minimum required records and relationships.
  2. Given a test genuinely needs full tenant, provider, membership, workspace, or UI context, When the author opts into that path, Then the helper or factory name makes the heavier cost explicit.
  3. Given a test's real dependencies no longer match its current classification, When it is reviewed, Then the written taxonomy provides a clear destination lane or group.

Edge Cases

  • A file mixes lightweight and heavy behavior; the repository must force an explicit grouping or classification decision instead of letting the file float ambiguously in the default path.
  • Slimming a shared helper exposes tests that depended on hidden side effects; those tests must fail loudly and move to an explicit heavy path rather than re-inflating the default helper.
  • Browser tests that cross real HTTP or browser boundaries must not be forced into the same reset expectations as in-process tests.
  • Initial budgets may be transitional; exceeding them should still be visible even if some runs start as warning-level governance rather than immediate hard failure.

Requirements (mandatory)

Constitution alignment: This feature changes no end-user routes, no Microsoft Graph behavior, no runtime authorization plane, and no operator-facing product surface. It does introduce repository-wide governance vocabulary and checked-in execution rules, so lane definitions, helper discipline, and reporting outputs must remain explicit, reviewable, and testable.

Functional Requirements

  • FR-001 Lane Model: The repository MUST define four named operational test lanes: Fast Feedback, Confidence, Browser, and Heavy Governance. Each lane MUST have a written purpose, intended audience, included families, excluded families, and ownership expectations.
  • FR-002 Checked-In Entry Points: The repository MUST provide checked-in entry points for the fast-feedback lane, the broader confidence lane, the browser lane, the heavy-governance lane, a profiling support run, and a machine-readable reporting support run. The profiling and JUnit reporting runs MUST be represented as support-lane entries in the checked-in manifest and logical reporting contract so they share the same contract shape without being treated as default operational lanes.
  • FR-003 Default Run Semantics: The standard contributor test run MUST resolve to the fast-feedback lane instead of behaving like an implicit broad full-suite run.
  • FR-004 Honest Taxonomy: The repository MUST define written classification rules for Unit, Feature or Integration, Browser, and Architecture or Governance tests. The first implementation slice MUST audit the obvious misfit families surfaced during rollout and reclassify or regroup the first batch that does not match its current label.
  • FR-005 Browser Isolation: Browser tests MUST belong to a dedicated lane with a dedicated entry point and dedicated runtime budget. They MUST NOT be silently included in the fast-feedback lane.
  • FR-006 Shared Helper Discipline: Shared test helpers MUST default to a minimal setup path. Provider, credential, membership, workspace, session, cache, and UI-heavy behavior MUST require explicit opt-in and clear naming.
  • FR-007 Factory Discipline: Factories touched by this rollout, plus the first additional heavy helper or factory cluster identified during rollout, MUST support minimal states for cheap records and clearly named heavy states for richer graphs. Expensive cascading defaults encountered in this slice MUST be removed, isolated, or explicitly documented, and the documented taxonomy MUST establish the same expectation for future touched factories.
  • FR-008 Slow-Test Observability: Contributors MUST be able to generate ranked slow-test information and machine-readable results from checked-in entry points, including visibility into the slowest files or test cases and the lane they affect.
  • FR-009 Runtime Budgets: Every checked-in lane entry, including heavy-governance and the two support-lane entries, MUST carry a baseline-backed runtime budget in the manifest and reporting contract. At minimum, the first slice MUST define explicit review targets for the fast-feedback lane, the confidence lane, the browser lane, and the heaviest known files or groups. Those budgets MUST be based on recorded baseline measurements from the current full-suite run and the first lane runs in the standard development environment. The reporting output MUST surface over-budget results.
  • FR-010 Database Strategy Guidance: The repository MUST document when database-backed tests are appropriate, when database resets are justified, when seeds are prohibited or limited, and when a prebuilt schema baseline should be evaluated for the testing environment.
  • FR-011 Initial Cheap Defaults: The first implementation slice MUST split at least the largest known shared setup path into minimal versus full behavior and MUST introduce a comparable minimal path for at least one additional heavy helper or factory cluster.
  • FR-012 Heavy Suite Separation: The heaviest browser, discovery, guard, contract, or surface-scan families identified during rollout MUST be grouped or lane-separated so they are intentionally run rather than silently inherited by the default lane. The initial heavy-governance lane MAY start from obvious seed families, but it MUST be refined using the first profiling evidence captured during rollout.
  • FR-013 Growth Governance: New test files, helper changes, and factory changes MUST be placeable into the written lane and taxonomy model without relying on undocumented local knowledge. For this first slice, the checked-in manifest is intentionally capped at the four operational lanes plus the two support-lane entries; expanding beyond those six entries requires an explicit follow-up spec and contract revision rather than silent manifest drift.

Non-Functional Requirements

  • NFR-001 Feedback Speed: The fast-feedback lane must be materially faster than the current full-suite baseline and stable enough to become the repository's normal authoring loop.
  • NFR-002 Confidence Separation: The confidence lane must preserve broader trust while remaining distinct from the heaviest governance and browser cost classes.
  • NFR-003 Observability Usability: Runtime and slow-test visibility must be available through checked-in repository paths that are easy for contributors and reviewers to run consistently.
  • NFR-004 Scalability Readiness: The lane model and taxonomy must be durable enough to support suite growth toward 10k+ tests without constant redefinition of the default path.

Rollout Guidance

  • Establish visibility, lane definitions, entry points, and initial budgets first.
  • Slim the default shared helpers and factory paths next so new tests stop inheriting full-context setup by accident.
  • Reclassify or separate the heaviest misfit suites once the lane model exists.
  • Evaluate broader framework-level optimization only after the lane model and cheap defaults make the expensive clusters visible.

Key Entities (include if feature involves data)

  • Test Lane: A named execution path with a purpose, intended use frequency, included families, excluded families, and a runtime budget.
  • Test Classification Rule: A written rule set that defines which dependencies and behaviors belong to Unit, Feature or Integration, Browser, and Architecture or Governance tests.
  • Shared Fixture Path: A reusable helper or factory path that can be minimal by default or full by explicit opt-in.
  • Runtime Budget: A documented wall-clock target and drift signal for a lane, file family, or heavy test cluster.
  • Slow-Test Report: A checked-in reporting output that shows the slowest files or test cases and ties them back to a lane or heavy family.

Success Criteria (mandatory)

Measurable Outcomes

  • SC-001: Contributors can run the fast-feedback lane through one checked-in entry point, and its documented wall-clock budget is at least 50% lower than the current full-suite baseline measured on the same standard environment.
  • SC-002: The confidence lane covers the Unit suite plus the manifest-defined non-browser Feature or Integration selectors that remain outside explicit heavy-governance exclusions while staying within its documented budget and below the current full-suite baseline.
  • SC-003: Browser tests are excluded from the fast-feedback lane in normal use and are available only through their dedicated lane and budget.
  • SC-004: The repository can produce a machine-readable test result artifact and a ranked slow-test report through checked-in entry points, showing at least the top 10 slowest files or test cases.
  • SC-005: At least two of the currently heaviest shared setup paths, including the current tenant-user helper path, expose a documented minimal mode so new tests do not inherit full-context setup by default.
  • SC-006: Reviewers can determine from the written taxonomy where a new or reworked test belongs without case-by-case reinvention, supporting continued suite growth toward 10k+ tests.