Main Confidence / confidence (push) Failing after 48s

Details

docs: add Spec 212 test authoring guardrails (#245 )

## Summary

- add Spec 212 planning artifacts for test authoring constitution and review guardrails
- expand `TEST-GOV-001` and sync the SpecKit spec/plan/tasks/checklist templates plus contributor guidance
- define the canonical review checklist outcomes and record low-impact and higher-cost validation examples

## Validation

- docs/workflow only; no runtime Pest or Sail test lanes were run
- validation is recorded in `specs/212-test-authoring-guardrails/spec.md` and `specs/212-test-authoring-guardrails/quickstart.md`

Co-authored-by: Ahmed Darrazi <ahmed.darrazi@live.de>
Reviewed-on: #245

2026-04-18 10:08:00 +00:00

24 KiB

Raw Blame History

Feature Specification: Test Suite Authoring Constitution & Review Guardrails

Feature Branch: 212-test-authoring-guardrails
Created: 2026-04-18
Status: Draft
Input: User description: "Spec 212 — Test Suite Authoring Constitution & Review Guardrails"

Spec Candidate Check (mandatory — SPEC-GATE-001)

Problem: TenantPilot can now measure, segment, and enforce test-suite cost, but contributors still lack a mandatory authoring and review routine that keeps new tests correctly classified, minimally provisioned, and lane-aware before they become permanent suite cost.
Today's failure: New tests can still be written convenience-first, broaden shared helpers or fixtures, or expand heavy families without early disclosure, so the first strong signal often appears only after review fatigue or CI slowdown.
User-visible improvement: Contributors and reviewers get lightweight, repeatable prompts that make lane impact, heavy risk, fixture cost, and escalation needs explicit while the change is still small and easy to redirect.
Smallest enterprise-capable version: Extend the existing governance system with a short authoring constitution, mandatory test-impact prompts in spec and plan flows, a compact task checklist, a concise review checklist, explicit escalation rules, and contributor guidance validated against representative specs.
Explicit non-goals: No new CI lanes, no new runtime-optimization program, no automatic PR bot, no broader coding constitution outside test authoring and review, and no attempt to replace human test design judgment with bureaucracy.
Permanent complexity imported: A small set of governance prompts, reviewer questions, escalation vocabulary, contributor guidance, and maintenance responsibility for keeping those artifacts aligned with Specs 206 through 211.
Why now: Specs 206 through 211 built the lane, budget, heavy-segmentation, CI, and trend foundation; without authoring-time and review-time guardrails, the suite can still drift back toward hidden cost growth at the exact point new tests are introduced.
Why not local: Ad hoc reviewer discipline and tribal memory do not scale across contributors, feature specs, or future maintainers, and they do not leave a durable, reviewable record of why a costly test choice was accepted.
Approval class: Cleanup
Red flags triggered: New governance prompts across several authoring surfaces and some new shared vocabulary around escalation. Defense: the feature stays repository-scoped, avoids new runtime infrastructure, and intentionally closes an already active governance program instead of opening a new one.
Score: Nutzen: 2 | Dringlichkeit: 2 | Scope: 2 | Komplexität: 1 | Produktnähe: 1 | Wiederverwendung: 2 | Gesamt: 10/12
Decision: approve

Spec Scope Fields (mandatory)

Scope: workspace
Primary Routes: No end-user HTTP routes change. The affected surfaces are repository-owned governance artifacts: the test authoring constitution, specification routine, planning routine, task routine, review checklist, and contributor guidance.
Data Ownership: Workspace-owned authoring templates, governance rules, review prompts, and validation notes. No tenant-owned records or product runtime tables are introduced.
RBAC: No product authorization behavior changes. The actors are contributors, reviewers, and maintainers applying repository governance.

Proportionality Review (mandatory when structural complexity is introduced)

New source of truth?: no
New persisted entity/table/artifact?: yes, but only repository-owned governance artifacts such as constitution text, prompt blocks, checklists, and validation notes
New abstraction?: no new software abstraction; only a documented decision routine for authoring and review
New enum/state/reason family?: no
New cross-domain UI framework/taxonomy?: no
Current operator problem: Contributors can still introduce expensive or misclassified tests at authoring time, while reviewers lack a short, explicit checklist for catching avoidable suite-cost drift before merge.
Existing structure is insufficient because: Lane budgets, CI enforcement, and trend reporting detect problems after a test already exists, but they do not reliably force contributors to justify heavy surfaces or shared setup cost while the change is still being designed.
Narrowest correct implementation: Add lightweight governance text and prompt surfaces directly to the existing spec, plan, task, and review workflow instead of inventing new runtime tooling or a separate approval system.
Ownership cost: Maintainers must keep the constitution text, prompt blocks, checklist language, and representative examples aligned as lane vocabulary and governance expectations evolve.
Alternative intentionally rejected: Relying on informal reviewer comments, CI failures alone, or scattered contributor notes with no shared authoring contract.
Release truth: Current-release repository truth needed to make the test-governance foundation from Specs 206 through 211 durable.

Problem Statement

Specs 206 through 211 gave TenantPilot a strong technical foundation for test-suite governance:

lane structure and runtime budgets exist
shared fixture cost has been reduced
heavy Filament and Livewire families have been segmented
heavy-governance cost is treated explicitly
CI runs the governed lanes and enforces their runtime expectations
runtime trend and baseline logic make erosion visible over time

That foundation is strong, but it is still mostly reactive. The main remaining gap is the moment where new tests are conceived, written, and reviewed.

The biggest slowdown risks are still created at authoring time, when a contributor chooses whether a test stays narrow or immediately reaches for database, Livewire, Filament, browser, or broad shared helpers. Review is the last reliable checkpoint before those choices become permanent suite cost. If authoring and review lack explicit guardrails, the repository drifts back toward convenience-first testing and only learns about the damage after CI or runtime budgets start complaining.

This feature closes that gap by embedding the existing governance model directly into the daily workflow for specification, planning, tasking, authoring, and review.

Dependencies

Depends on Spec 206 — Test Suite Governance & Performance Foundation for lane vocabulary, cost awareness, and the original governance contract.
Depends on Spec 207 — Shared Test Fixture Slimming for the expectation that common setup remains intentionally small.
Depends on Spec 208 — Filament/Livewire Heavy Suite Segmentation for the definition and containment of expensive UI-driven families.
Depends on Spec 209 — Heavy Governance Lane Cost Reduction for the principle that heavy governance must be deliberate rather than accidental.
Depends on Spec 210 — CI Test Matrix & Runtime Budget Enforcement for enforced lane boundaries and runtime budget evidence.
Depends on Spec 211 — Test Runtime Trend Reporting & Baseline Recalibration for the ability to see long-horizon cost drift and justify escalation.
Recommended after the governed lanes, CI enforcement, and trend visibility are stable enough that the remaining problem is authoring and review behavior rather than missing infrastructure.
Blocks durable, everyday embedding of the existing governance model at the point where new tests enter the suite.
Does not block normal feature delivery when current reviewer discipline is already handling the risk manually.

Goals

Embed test-governance thinking directly into the normal development routine.
Give contributors explicit rules for classifying and justifying new tests.
Give reviewers concrete prompts that catch hidden suite-cost drift before merge.
Require new specs, plans, and task lists to state their test and lane impact deliberately.
Keep heavy-family creation, browser expansion, and shared setup cost from appearing silently.
Prevent drift earlier than CI or budget failures.
Close the open loop in the existing test-governance program.

Non-Goals

Creating another runtime-optimization or lane-segmentation spec.
Expanding the CI matrix or adding new infrastructure by default.
Replacing thoughtful test design with a rigid checklist ritual.
Creating a universal engineering constitution for every domain outside test authoring and review.
Introducing PR bots or fully automated review comments as a requirement for this slice.
Reopening lane or budget design that Specs 206 through 211 already settled.

Assumptions

Specs 206 through 211 remain the authoritative source for lane vocabulary, heavy-family expectations, budget stewardship, and runtime-trend interpretation.
The existing specification, planning, and task routines are the correct places to force early test-impact thinking.
Reviewers will continue to use judgment; the checklist is meant to sharpen decisions, not replace them.
Most feature work should still be able to satisfy the added prompts with concise answers rather than long essays.

Key Decisions

Prevention is better than post-facto enforcement: The cheapest place to control suite cost is before the test is committed, not after CI exposes the damage.
Constitution rules must stay lightweight but binding: The authoring contract must be short enough to use every day and strong enough to matter when a costly choice appears.
Every spec must consider test impact explicitly: New feature work should say which lane, family, and runtime implications it touches instead of leaving that question implicit.
Reviewers need decision-grade prompts: Review guardrails should ask direct questions about lane fit, breadth, fixture cost, heavy-family creation, and escalation need rather than vague reminders to care about performance.
Classification must happen at authoring time: Contributors should decide up front whether a test belongs in a narrow lane, a heavier lane, or a new heavy family.
New heavy cost centers must announce themselves: New browser scope, new heavy families, major lane-cost movement, and revived expensive defaults require explicit escalation instead of silent normalization.

Required Outcomes

Test Authoring Constitution

The repository must gain a short, durable constitution section that states the standing rules for test classification, lane awareness, justified use of database or UI-heavy surfaces, minimal fixtures by default, and refusal of hidden shared-cost growth.

Specification Routine Extension

Every new feature specification must answer a small, standard test-impact block that covers affected lanes, new or expanded test families, heavy or browser relevance, expected budget or trend effect, and the validation expected at review time.

Planning Routine Extension

The planning workflow must make test-impact decisions visible before implementation by asking what test types change, whether helpers or fixtures widen, whether lane reshaping is needed, and what final validation is required.

Task Routine Extension

Task lists must carry a short test-governance checklist that keeps lane assignment, minimal setup, relevant validation, and budget or trend disclosure visible while work is broken down.

Review Guardrails

Reviewers must have a fast checklist that asks whether a test is in the right lane, whether it is unnecessarily broad, whether database or UI-heavy surfaces are actually required, whether setup is secretly expensive, whether the change should be split, and whether escalation is required. The canonical daily-use review surface is the generated checklist based on .specify/templates/checklist-template.md, with .specify/README.md acting as the reviewer entry point for how to apply it.

Escalation Rules

The governance model must define when a change stops being an ordinary test delta and becomes a governance signal that needs explicit documentation or a follow-up spec, especially for new heavy families, new browser coverage, material lane-cost changes, revived expensive defaults, or broad suite reshaping.

Contributor Guidance

Contributors must get short guidance that explains how to choose between narrow and heavy test surfaces, how to detect an overly broad test, when shared setup is justified, and when a change belongs inside an existing family versus creating a new one.

Workflow Integration

The resulting rules must appear where they are used: in the constitution, specification routine, planning routine, task routine, review checklist, and lightweight contributor-facing guidance.

Testing / Lane / Runtime Impact (mandatory for runtime behavior changes)

Validation lane(s): N/A
Why these lanes are sufficient: N/A. This feature changes repository authoring and review artifacts rather than product runtime behavior.
New or expanded test families: none
Fixture / helper cost impact: none directly. The intended effect is future prevention of unnecessary shared setup cost rather than immediate new fixture or helper behavior.
Heavy coverage justification: none
Budget / baseline / trend impact: none directly. The feature should improve earlier disclosure of future drift, but it does not itself change lane membership, budgets, baselines, or runtime measurements.
Planned validation commands: N/A. Validation is document-based and consists of applying the new prompts and guardrails to representative specs, plans, and task flows.

Workflow Validation Notes (2026-04-18)

Low-Impact Authoring Dry Run

Scenario: Apply the updated prompts to a template-only change limited to .specify/templates/checklist-template.md and .specify/README.md.
Result: The authoring flow can be completed with concise N/A or none answers in under 1 minute because the prompts only ask for runtime-specific detail when impact actually exists.
Wording adjustment captured: The spec and plan templates now ask for a short reviewer handoff and an explicit escalation outcome so low-impact work stays lightweight while still ending in a clear review disposition.

Higher-Cost Review Dry Run

Scenario: Apply the updated review guardrails to specs/211-runtime-trend-recalibration/spec.md and specs/211-runtime-trend-recalibration/plan.md.
Result: The reviewer can confirm lane fit, bounded helper cost, no new heavy/browser promotion, and explicit validation commands in under 3 minutes. The correct outcome is document-in-feature because Spec 211 changes governed runtime-reporting behavior inside existing lane families and already records its own drift and recalibration notes.
Escalation boundary proved: A true follow-up-spec remains reserved for recurring pain or structural lane-model changes, such as introducing a new heavy family, normalizing browser coverage for a new workflow class, or reviving an expensive shared default across unrelated tests.

User Scenarios & Testing (mandatory)

User Story 1 - Classify Test Impact While Authoring (Priority: P1)

As a contributor preparing a new feature spec or plan, I want the workflow to ask about lane impact, heavy coverage, and fixture cost before implementation begins so I choose the smallest justified test surface instead of defaulting to convenience-first coverage.

Why this priority: This is the earliest and cheapest place to stop avoidable suite-cost drift.

Independent Test: Apply the workflow to a genuinely low-impact docs-only or template-only scenario, such as a change limited to .specify/templates/checklist-template.md and .specify/README.md, and confirm that the author can answer with concise N/A or none responses while still making any affected lanes, new or expanded test families, heavy-surface justification, and minimal validation expectations explicit when they exist.

Acceptance Scenarios:

Given a feature spec that introduces or changes tests, When the author completes the required governance prompts, Then the spec states the affected lane or lanes, any family expansion, and the required validation scope explicitly.
Given a proposed test that reaches for database, Livewire, Filament, or browser coverage, When the author documents the approach, Then the justification and minimal-setup expectation are stated rather than assumed.

User Story 2 - Reviewers Catch Hidden Suite Cost Before Merge (Priority: P1)

As a reviewer evaluating new or changed tests, I want a short guardrail checklist so I can quickly judge whether the test belongs in the chosen lane, whether the setup is too broad, and whether the change needs escalation instead of silent acceptance.

Why this priority: Review is the last reliable checkpoint before hidden cost becomes permanent repository truth.

Independent Test: Use the canonical generated review checklist on representative test changes and confirm that the reviewer can reach a clear keep, split, or escalate decision without relying on unwritten tribal knowledge.

Acceptance Scenarios:

Given a test that is broader than necessary for its intent, When the reviewer applies the checklist, Then the checklist makes the breadth and likely narrower alternative visible.
Given a change that quietly expands a heavy family or shared helper default, When the reviewer applies the checklist, Then the need for explicit escalation or follow-up governance is surfaced before merge.

User Story 3 - Escalate New Cost Centers Deliberately (Priority: P2)

As a maintainer stewarding suite health, I want clear escalation rules so that new heavy families, new browser scope, or material lane-cost shifts are documented and evaluated explicitly instead of being normalized through drift.

Why this priority: The governance model stays durable only if major new cost centers announce themselves early and visibly.

Independent Test: Apply the escalation rules to representative examples involving new browser or heavy scope and confirm that the outcome is either a documented local exception or an explicit follow-up governance action.

Acceptance Scenarios:

Given a change that introduces a new heavy family or new browser coverage, When the escalation rules are applied, Then the change is classified as an explicit governance decision rather than a routine test edit.
Given a small test change that stays within an existing lane and family, When the escalation rules are applied, Then the workflow allows the change to proceed without forcing unnecessary process overhead.

Edge Cases

A feature has no meaningful runtime or test impact; the workflow must allow concise N/A or none answers instead of forcing boilerplate.
One feature legitimately affects multiple existing lanes; the prompts must allow multi-lane disclosure without implying a new family.
A seemingly small helper or factory default would silently broaden setup cost across many tests; the guardrails must treat this as a governance concern even if the local diff looks minor.
A reviewer sees budget or baseline implications before CI is red; the escalation rules must allow early documentation rather than waiting for a hard failure.
A single justified browser or heavy scenario must not automatically bless wider copy-paste expansion into nearby tests.

Requirements (mandatory)

Functional Requirements

FR-001: The repository MUST define a permanent test authoring constitution that requires explicit test classification, deliberate lane awareness, justified use of database or UI-heavy surfaces, minimal fixtures by default, and rejection of hidden shared-cost growth.
FR-002: The specification routine MUST require a standard test-impact section for every new spec that captures affected lane or lanes, new or expanded test families, heavy or browser relevance, expected budget or trend implications, and reviewer validation expectations, or explicit N/A or none answers when no such impact exists.
FR-003: The planning routine MUST require a test-impact block that identifies which test types change, whether shared helpers, fixtures, factories, or defaults widen, whether lane reassignment or lane addition is implicated, and what final validation is required.
FR-004: The task routine MUST include a short standardized checklist that confirms lane assignment, avoidance of unnecessary heavy cost, use of minimal fixtures or helpers, relevant validation, and documentation of budget or trend implications when present.
FR-005: The review routine MUST provide a concise guardrail checklist that asks whether the test is in the correct lane, whether it is unnecessarily broad, whether database, Livewire, Filament, or browser usage is justified, whether setup is secretly expensive, whether the test should be split, and whether escalation is required. The canonical checklist surface is the generated review checklist based on .specify/templates/checklist-template.md, with .specify/README.md linking reviewers to its use.
FR-006: The governance model MUST define explicit escalation rules for new heavy families, new browser coverage, material lane-cost change, broad new Filament or Livewire governance surfaces, revived expensive helper or factory defaults, budget or baseline relevant shifts, and major suite reshaping.
FR-007: Contributor guidance MUST explain how to choose between narrow and heavy test surfaces, when database or UI-heavy coverage is justified, how to recognize an overly broad test, and when to extend an existing family versus introduce a new one.
FR-008: The guardrails MUST be integrated into the everyday authoring and review surfaces used by contributors and reviewers, including the constitution, specification routine, planning routine, task routine, review checklist, and contributor guidance.
FR-009: The added governance prompts MUST remain lightweight enough that an ordinary feature with little or no test impact can satisfy them with concise answers and without material process drag.
FR-010: The completed guidance MUST be validated against at least one representative low-risk docs-only or template-only flow, such as a change limited to .specify/templates/checklist-template.md and .specify/README.md, and one representative higher-cost or multi-lane scenario to confirm that the rules are usable, do not contradict existing lane or budget governance, and catch the intended escalation cases.
FR-011: The governance rules MUST explicitly forbid introducing new expensive shared helper, factory, seed, or fixture defaults without disclosing the cost impact and either containing the change locally or escalating it as governance-relevant work.

Success Criteria (mandatory)

Measurable Outcomes

SC-001: In dry runs on at least two representative feature specs, authors can complete the required test-impact prompts with no unanswered required field and with the affected lane or lanes, family impact, and validation scope made explicit.
SC-002: In representative review exercises, reviewers can use the guardrail checklist to reach a clear keep, split, or escalate decision within 3 minutes for each sample change.
SC-003: Every validation example that introduces new browser coverage, a new heavy family, or a material lane-cost shift is explicitly classified as either a documented local exception or a governance escalation; none remain implicit.
SC-004: A representative low-impact docs-only or template-only scenario with no runtime or meaningful test change can satisfy the added governance prompts in under 1 minute using concise N/A or none answers.
SC-005: Validation against representative specs, plans, and task flows shows no contradiction with the existing lane, budget, baseline, or runtime-trend model established by Specs 206 through 211.

24 KiB Raw Blame History

Feature Specification: Test Suite Authoring Constitution & Review Guardrails

Spec Candidate Check (mandatory — SPEC-GATE-001)

Spec Scope Fields (mandatory)

Proportionality Review (mandatory when structural complexity is introduced)

Problem Statement

Dependencies

Goals

Non-Goals

Assumptions

Key Decisions

Required Outcomes

Test Authoring Constitution

Specification Routine Extension

Planning Routine Extension

Task Routine Extension

Review Guardrails

Escalation Rules

Contributor Guidance

Workflow Integration

Testing / Lane / Runtime Impact (mandatory for runtime behavior changes)

Workflow Validation Notes (2026-04-18)

Low-Impact Authoring Dry Run

Higher-Cost Review Dry Run

User Scenarios & Testing (mandatory)

User Story 1 - Classify Test Impact While Authoring (Priority: P1)

User Story 2 - Reviewers Catch Hidden Suite Cost Before Merge (Priority: P1)

User Story 3 - Escalate New Cost Centers Deliberately (Priority: P2)

Edge Cases

Requirements (mandatory)

Functional Requirements

Success Criteria (mandatory)

Measurable Outcomes

24 KiB

Raw Blame History