TenantAtlas/docs/product/spec-candidates.md
ahmido a74ab12f04 feat: implement evidence domain foundation (#183)
## Summary
- add the Evidence Snapshot domain with immutable tenant-scoped snapshots, per-dimension items, queued generation, audit actions, badge mappings, and Filament list/detail surfaces
- add the workspace evidence overview, capability and policy wiring, Livewire update-path hardening, and review-pack integration through explicit evidence snapshot resolution
- add spec 153 artifacts, migrations, factories, and focused Pest coverage for evidence, review-pack reuse, authorization, action-surface regressions, and audit behavior

## Testing
- `vendor/bin/sail artisan test --compact --stop-on-failure`
- `CI=1 vendor/bin/sail artisan test --compact`
- `vendor/bin/sail bin pint --dirty --format agent`

## Notes
- branch: `153-evidence-domain-foundation`
- commit: `b7dfa279`
- spec: `specs/153-evidence-domain-foundation/`

Co-authored-by: Ahmed Darrazi <ahmed.darrazi@live.de>
Reviewed-on: #183
2026-03-19 13:32:52 +00:00

905 lines
141 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Spec Candidates
> Concrete future specs waiting for prioritization.
> Each entry has enough structure to become a real spec when the time comes.
>
> **Flow**: Inbox → Qualified → Planned → Spec created → removed from this file
**Last reviewed**: 2026-03-18 (Help/guidance capability line refactored into 4 bounded candidates)
---
## Inbox
> Ungefiltert. Kurze Notiz reicht. Wöchentlich sichten.
- Dashboard trend visualizations (sparklines, compliance gauge, drift-over-time chart)
- Dashboard "Needs Attention" should be visually louder (alert color, icon, severity weighting)
- Dashboard enterprise polish: severity-weighted drift table, actionable alert buttons, progressive disclosure (demoted from Qualified — needs bounded scope before re-qualifying)
- Operations table should show duration + affected policy count
- Density control / comfortable view toggle for admin tables
- Inventory landing page may be redundant — consider pure navigation section
- Settings change history → explainable change tracking
- Workspace chooser v2: search, sort, favorites, pins, environment badges, last activity
- Workspace-level PII override for review packs (deferred from Spec 109 — controls whether PII is included/redacted in tenant review pack exports at workspace scope)
- CSV export for filtered run metadata (deferred from Spec 114 — allow operators to export filtered operation run lists from the system console as CSV)
- Raw error/context drilldowns for system console (deferred from Spec 114 — in-product drilldown into raw error payloads and execution context for failed/stuck runs in the system console)
---
## Qualified
> Problem + Nutzen klar. Scope noch offen. Braucht noch Priorisierung.
### Queued Execution Reauthorization and Scope Continuity
- **Type**: hardening
- **Source**: architecture audit 2026-03-15
- **Problem**: Queued work still relies too heavily on dispatch-time actor and tenant state. Execution-time scope continuity and capability revalidation are not yet hardened as a canonical backend contract.
- **Why it matters**: This is a backend trust-gap on the mutation path. It creates the class of failure where a UI action was valid at dispatch time but the queued execution is no longer legitimate when it runs.
- **Proposed direction**: Define execution-time reauthorization, tenant operability rechecks, denial semantics, and audit visibility as a dedicated spec instead of scattering local `authorize()` patches.
- **Dependencies**: Existing operations semantics, audit log foundation, queued job execution paths
- **Priority**: high
### Livewire Context Locking and Trusted-State Reduction
- **Type**: hardening
- **Source**: architecture audit 2026-03-15
- **Problem**: Complex Livewire and Filament flows still expose ownership-relevant context in public component state without one explicit repo-wide hardening standard.
- **Why it matters**: This is a trust-boundary problem. Even without a known exploit, mutable client-visible identifiers and workflow context make future authorization and isolation mistakes more likely.
- **Proposed direction**: Define a reusable hardening pattern for locked identifiers, server-derived workflow truth, and forged-state regression tests on tier-1 component families.
- **Dependencies**: Managed tenant onboarding draft identity (Spec 138), onboarding lifecycle checkpoint work (Spec 140)
- **Priority**: medium
### Tenant Draft Discard Lifecycle and Orphaned Draft Visibility
- **Type**: hardening
- **Source**: domain architecture analysis 2026-03-16 — tenant lifecycle vs onboarding workflow lifecycle review
- **Problem**: TenantPilot correctly separates durable tenant lifecycle (`draft`, `onboarding`, `active`, `archived`) from onboarding workflow lifecycle (`draft` → `completed` / `cancelled`), but there is no end-of-life path for abandoned draft tenants. When all onboarding sessions for a tenant are cancelled, the tenant reverts to `draft` and remains visible indefinitely without a semantically correct cleanup action. Archive/restore do not apply (draft tenants have no operational data worth preserving), and force delete requires archive first (which is semantically wrong for a provisional record). Operators cannot remove orphaned drafts.
- **Why it matters**: Without a discard path, abandoned draft tenants accumulate as orphaned rows in the tenant list. This creates operator confusion (draft vs. archived vs. active ambiguity), data hygiene issues, and forces operators to either ignore stale records or misuse lifecycle actions that don't fit the domain semantics. The gap also makes tenant list UX harder to trust for enterprise operators managing many tenants.
- **Proposed direction**:
- Introduce a canonical **draft discardability contract** (central service/policy, not scattered UI visibility logic) that determines whether a draft tenant may be safely removed, considering linked onboarding sessions, downstream artifacts, and operational traces
- Add a **discard draft** destructive action for tenant records in `draft` status with no resumable onboarding sessions, gated by the discardability contract, capability authorization (`tenant.delete` or a dedicated `tenant.discard_draft`), and confirmation modal
- Add an **orphaned draft indicator** to the tenant list/detail views — visual distinction between a resumable draft (has active session) and an abandoned draft (all sessions terminal or none exist)
- Emit a **distinct audit event** (`tenant.draft_discarded`) separate from `tenant.force_deleted`, capturing workspace context, tenant identifiers, linked session state, and acting user
- Preserve and reinforce the existing domain separation: `archive/restore/force_delete` remain reserved for durable tenant lifecycle; `cancel/delete` remain reserved for onboarding workflow lifecycle; `discard` is the new end-of-life action for provisional drafts
- **Key domain rules**:
- `archive` = preserve durable tenant for compliance while removing from active use
- `restore` = reactivate an archived durable tenant
- `force delete` = permanently destroy an already archived durable tenant
- `discard draft` = permanently remove a provisional tenant that never became a durable operational entity
- Draft tenants must NOT become archivable or restorable
- **Safety preconditions for discard**: tenant is in `draft` status, not trashed, no resumable onboarding sessions exist, no accumulated operational data (no policies, backups, operation runs beyond onboarding)
- **Out of scope**: automatic cleanup without operator confirmation, retention policy for cancelled onboarding sessions, changes to the 4-state tenant lifecycle enum, changes to the 7-state onboarding session lifecycle enum
- **Dependencies**: Spec 140 (onboarding lifecycle checkpoints — already shipped), Spec 143 (tenant lifecycle operability context semantics)
- **Related specs**: Spec 138 (draft identity), Spec 140 (lifecycle checkpoints), Spec 143 (lifecycle operability semantics)
- **Priority**: medium
### Exception / Risk-Acceptance Workflow for Findings
- **Type**: feature
- **Source**: HANDOVER gap analysis, Spec 111 follow-up
- **Problem**: Finding has a `risk_accepted` status value but no formal exception lifecycle. Today, accepting risk is a status transition — there is no dedicated entity to record who accepted the risk, why, under what conditions, or when the acceptance expires. No approval workflow, no expiry/renewal semantics, no structured justification. Auditors cannot answer "who accepted this risk, what was the justification, and is it still valid?"
- **Why it matters**: Enterprise compliance frameworks (ISO 27001, SOC 2, CIS) require documented, time-bounded risk acceptance with clear ownership. A bare status flag does not meet this bar. Without a formal exception lifecycle, risk acceptance becomes invisible to audit trails and impossible to govern at scale.
- **Proposed direction**: First-class `RiskException` (or `FindingException`) entity linked to Finding, with: justification text, owner (actor), `accepted_at`, `expires_at`, renewal/reminder semantics, optional linkage to verification checks or related findings. Approval flow with capability-gated acceptance. Audit trail for creation, renewal, expiry, and revocation. Findings in `risk_accepted` state without a valid exception should surface as governance warnings.
- **Dependencies**: Findings workflow (Spec 111) complete, audit log foundation (Spec 134)
- **Priority**: high
### Evidence Domain Foundation
- **Type**: feature
- **Source**: HANDOVER gap, R2 theme completion
- **Problem**: Review pack export (Spec 109) and permission posture reports (104/105) exist as separate output artifacts. There is no first-class evidence domain model that curates, bundles, and tracks these artifacts as a coherent compliance deliverable for external audit submission.
- **Why it matters**: Enterprise customers need a single, versioned, auditor-ready package — not a collection of separate exports assembled manually. The gap is not export packaging (Spec 109 handles that); it is the absence of an evidence domain layer that owns curation, completeness tracking, and audit-trail linkage.
- **Proposed direction**: Evidence domain model with curated artifact references (review packs, posture reports, findings summaries, baseline governance snapshots). Completeness metadata. Immutable snapshots with generation timestamp and actor. Not a re-implementation of export — a higher-order assembly layer.
- **Explicit non-goals**: Not a presentation or reporting layer — this candidate owns data curation, completeness tracking, artifact storage, and immutable snapshots. Executive summaries, framework-oriented readiness views, management-ready outputs, and stakeholder-facing packaging belong to the Compliance Readiness & Executive Review Packs candidate, which consumes this foundation. Not a replacement for Spec 109's export packaging. Not a generic BI or data warehouse initiative.
- **Boundary with Compliance Readiness**: Evidence Domain Foundation = lower-level data assembly (what artifacts exist, are they complete, are they immutable). Compliance Readiness = upper-level presentation (how to arrange evidence into framework-oriented, stakeholder-facing deliverables). This candidate is a prerequisite; Compliance Readiness is a downstream consumer.
- **Dependencies**: Review pack export (109), permission posture (104/105)
- **Priority**: high
### Compliance Readiness & Executive Review Packs
- **Type**: feature
- **Source**: roadmap-to-spec coverage audit 2026-03-18, R2 theme completion, product positioning for German midmarket / MSP governance
- **Problem**: TenantPilot is building a strong evidence/data foundation (Evidence Domain Foundation candidate, StoredReports, review pack export via Spec 109, findings, baselines), but there is no product-level capability that assembles this data into management-ready, customer-facing, or auditor-oriented readiness views. Enterprise customers, MSP account managers, and CISOs need structured governance outputs for recurring tenant reviews, audit preparation, and compliance conversations — not raw artifact collections or manual export assembly. The gap is not data availability; it is the absence of a dedicated readiness presentation and packaging layer that turns existing governance evidence into actionable, consumable deliverables.
- **Why it matters**: This is a core product differentiator and revenue-relevant capability for the MSP and German midmarket audience. Without it, TenantPilot remains an operator tool — powerful but invisible to the stakeholders who sign off on governance, approve budgets, and evaluate vendor value. Structured readiness outputs (lightweight BSI/NIS2/CIS-oriented views, executive summaries, customer review packs) make TenantPilot sellable as a governance review platform, not just a backup and configuration tool. This directly strengthens the MSP sales story for quarterly reviews, security health checks, and audit preparation.
- **Proposed direction**:
- A dedicated readiness/review presentation layer that consumes evidence domain artifacts, findings summaries, baseline/drift posture, permission posture signals, and operational health data
- Management-ready output surfaces: executive summary views, customer-facing review dashboards, structured compliance readiness pages oriented toward frameworks such as BSI Grundschutz, NIS2, and CIS — in a lightweight, non-certification sense (governance evidence, not formal compliance claims)
- Exportable review packs that combine multiple evidence dimensions into a single coherent deliverable (PDF or structured export) for external stakeholders
- Tenant-scoped and workspace-scoped views — individual tenant reviews as well as portfolio-level readiness summaries
- Clear separation from the Evidence Domain Foundation: evidence foundation owns curation, completeness tracking, and artifact storage; compliance readiness owns presentation, assembly, and stakeholder-facing output
- Readiness views should be composable: an operator selects which dimensions to include in a review pack (e.g. baseline posture + findings summary + permission evidence + operational health) rather than a monolithic fixed report
- **Explicit non-goals**: Not a formal certification engine — TenantPilot does not certify compliance or issue attestations. Not a legal or compliance advice system. Not a replacement for the Evidence Domain Foundation (which owns the data layer). Not a generic BI dashboard or data warehouse initiative. Not a PDF-only export task — the primary value is the structured readiness view, with export as a secondary delivery mechanism. Not a reimplementation of review pack export (Spec 109 handles CSV/ZIP). Not a customer-facing analytics suite.
- **Boundary with Evidence Domain Foundation**: Evidence Domain Foundation = curation, completeness tracking, artifact storage, immutable snapshots. Compliance Readiness = presentation, assembly, framework-oriented views, stakeholder-facing outputs. Evidence Foundation is a prerequisite; Compliance Readiness is a consumer.
- **Dependencies**: Evidence Domain Foundation (data layer), review pack export (Spec 109), findings workflow (Spec 111), baseline/drift engine (Specs 116119), permission posture (Specs 104/105), audit log foundation (Spec 134)
- **Priority**: medium (high strategic value, but depends on evidence foundation maturity)
### Enterprise App / Service Principal Governance
- **Type**: feature
- **Source**: platform domain coverage planning, governance gap analysis
- **Problem**: TenantPilot covers tenant configuration and governance workflows, but lacks a first-class governance surface for enterprise applications and service principals. Operators cannot easily answer which app identities exist, which ones hold privileged permissions, which credentials are nearing expiry, and where renewal/review workflows are needed.
- **Why it matters**: Enterprise apps and service principals are a major governance and security pain point in Microsoft cloud environments. Expiring secrets/certificates, over-privileged app permissions, and unclear ownership create real audit, operational, and risk-management gaps. This is highly relevant for MSP reviews, customer reporting, and exception workflows.
- **Proposed direction**: Add a governance-oriented domain surface for enterprise applications and service principals, starting with inventory, privileged-permission visibility, expiring credential visibility, ownership/review metadata, alerting hooks, and exception/renewal workflow support. Keep the scope centered on governance and reviewability rather than trying to model all enterprise app administration.
- **Dependencies**: Evidence/reporting direction, alerting foundations, RBAC/capability model, domain coverage strategy
- **Priority**: high
### SharePoint Tenant-Level Sharing Governance
- **Type**: feature
- **Source**: platform domain coverage planning, audit/compliance positioning
- **Problem**: TenantPilot currently focuses on device and identity governance domains, but does not yet cover one of the most audit-relevant Microsoft 365 data-governance control surfaces: tenant-level SharePoint and OneDrive external sharing settings. Operators lack a governance view for high-risk sharing posture at tenant scope.
- **Why it matters**: Tenant-level sharing controls are central to data exposure, external collaboration, and audit readiness. For many customers, especially compliance-oriented SMB and midmarket environments, these settings are part of the core governance story and should not remain outside the platform's planned coverage.
- **Proposed direction**: Introduce a bounded governance surface for tenant-level SharePoint and OneDrive sharing/access settings, focused on inventory, reviewability, explainability, and later alignment with evidence/reporting workflows. Start at tenant-level controls rather than attempting full site-level administration or a broad SharePoint management surface.
- **Dependencies**: Domain coverage strategy, Microsoft 365 policy-domain expansion, reporting/evidence direction
- **Priority**: medium
### Entra Role Governance
- **Type**: feature
- **Source**: platform domain coverage planning, identity governance expansion
- **Problem**: TenantPilot does not yet provide a first-class governance surface for Microsoft Entra roles. Built-in roles, custom role definitions, and role assignments are highly relevant for identity governance, but today they are not planned as a dedicated product capability.
- **Why it matters**: Role governance is a central part of tenant security posture, privileged access control, and audit readiness. Customers need visibility into how administrative authority is defined and assigned, especially as Entra role usage grows beyond default out-of-the-box roles.
- **Proposed direction**: Add a first-class Entra role governance capability focused on role definitions and assignments as governable objects. Start with inventory, visibility, and review-oriented explainability. Preserve the possibility of future attestation/review workflows without making them mandatory in V1.
- **Dependencies**: Identity governance expansion, RBAC/capability model, reporting/evidence direction
- **Priority**: medium
### Security Posture Signals Foundation
- **Type**: feature
- **Source**: platform domain coverage planning, compliance/readiness reporting direction
- **Problem**: TenantPilot's evidence and reporting direction is strong, but high-value security posture signals such as Defender Vulnerability Management exposure data and backup assurance signals are not yet represented as a bounded product capability. This leaves a gap between governance findings and the operational evidence customers want in recurring reviews.
- **Why it matters**: Customers and MSP operators increasingly want proof that security operations are functioning, not just that configurations exist. Exposure trends, vulnerability posture, and backup success/failure signals are highly valuable inputs for executive reviews, customer reporting, and audit preparation.
- **Proposed direction**: Establish a bounded evidence/signal foundation for ingesting, historizing, correlating, and reporting on selected posture signals, starting with Defender Vulnerability Management and backup success/failure/protection-state signals. Keep this clearly in the evidence domain, not the policy domain.
- **Dependencies**: StoredReports/Evidence direction, signal ingestion foundations, reporting/export maturity
- **Priority**: medium
### Security Suite Layer — Posture Score, Blast Radius, High-Risk Opt-In Controls
- **Type**: feature
- **Source**: roadmap-to-spec coverage audit 2026-03-18, 0800-future-features brainstorming, roadmap "Security Suite Layer" long-term theme
- **Problem**: TenantPilot's security-related capabilities are growing — findings, baselines, drift detection, permission posture — but they remain siloed as individual data outputs. The Security Posture Signals Foundation candidate addresses lower-level signal ingestion and evidence collection, but there is no product layer that aggregates, interprets, and prioritizes these signals into actionable security posture surfaces for operators. An MSP operator managing twenty tenants cannot currently answer: which tenants have the weakest security posture? Which single misconfiguration has the widest blast radius across users and devices? Which high-risk settings are intentionally enabled vs. accidentally exposed? The gap is not signal availability — it is the absence of a higher-level interpretation and prioritization layer that turns raw posture data into operator-facing security value.
- **Why it matters**: Raw signals and individual findings are necessary but insufficient. Operators, CISOs, and MSP account managers need aggregated, prioritized, and contextualized security views that surface the most consequential risks first. Without this, security-relevant data is scattered across findings tables, drift reports, permission posture views, and evidence exports — forcing operators to mentally assemble a posture picture themselves. A productized security posture layer is the difference between "we collect security data" and "we help you act on the most important risks." This is a strategic differentiator for MSP positioning and enterprise customer conversations where security posture is a recurring review topic.
- **Proposed direction**:
- **Posture scoring or posture rollups**: Tenant-level and optionally workspace-level security posture summaries that aggregate signals from findings, baselines, drift state, permission posture, and (when available) external posture signals into a structured posture indicator. Not a single arbitrary number — a structured rollup showing posture dimensions (configuration compliance, identity hygiene, protection coverage, exposure risk) with clear drill-down paths. The goal is "where should I focus?" not "what is my score?"
- **Blast-radius and impact-oriented interpretation**: For high-severity findings, misconfigurations, or risky conditions, show the scope of impact — how many users, devices, or groups are affected? Which policies target broad populations with permissive or risky settings? Impact context helps operators prioritize consequential risks over technically-severe-but-narrow ones. This is interpretation layered on top of existing assignment and scope data, not a separate data collection effort.
- **High-risk opt-in and guarded enablement surfaces**: Where tenants have intentionally enabled high-risk settings (e.g. broad sharing, disabled MFA for service accounts, permissive conditional access), make these visible as explicit, acknowledged decisions rather than hidden configuration details. Support opt-in acknowledgement patterns where operators confirm that a high-risk condition is intentional versus accidental. This is about operator awareness and explicit decision capture, not about enforcing or blocking configurations.
- **Security prioritization surfaces**: Operator-facing views that rank and filter posture conditions by severity, blast radius, and recency — helping operators focus on the few conditions that matter most rather than reviewing flat lists. Supports "top 5 risks across my portfolio" and "highest-impact unresolved findings" patterns.
- **Tenant-scoped and portfolio-aware**: Security posture is evaluated per tenant; portfolio-level aggregation surfaces which tenants are strongest and weakest for MSP operators managing fleets. Supports fleet-level security posture comparisons and trend tracking over time.
- **Explicit non-goals**: Not a SIEM, SOC, or XDR platform — this is posture interpretation for governance operators, not a security operations center tool. Not a vulnerability management system — TenantPilot does not own vulnerability remediation workflows or patch management. Not a generic security analytics platform or BI dashboard. Not a replacement for Security Posture Signals Foundation — this candidate consumes signals; the foundation candidate collects them. Not a compliance certification engine (Compliance Readiness handles audit-ready reporting). Not a threat detection or incident response system. Not a catch-all security backlog bucket — scope is bounded to posture aggregation, interpretation, prioritization, and guarded-visibility patterns. Not a broad platform hardening initiative — infrastructure, delivery, and application-level hardening are separate candidates.
- **Boundary with Security Posture Signals Foundation**: Security Posture Signals Foundation = signal ingestion, historization, correlation, and evidence-layer representation of security-relevant data (Defender exposure, backup health, etc.). Security Suite Layer = aggregation, interpretation, prioritization, and operator-facing posture value built on top of those signals (and other existing governance data like findings, baselines, and drift). Foundation is the substrate; Suite Layer is the product interpretation. Foundation answers "what signals exist?" Suite Layer answers "what do they mean, and what should I act on first?"
- **Boundary with Compliance Readiness & Executive Review Packs**: Compliance Readiness = framework-oriented, stakeholder-facing reporting and evidence assembly for audit conversations. Security Suite Layer = operator-facing, action-oriented posture interpretation for day-to-day security prioritization. Compliance readiness produces reports; security suite layer guides operational focus. Posture data may feed into compliance readiness outputs, but the two serve different audiences and decision patterns.
- **Boundary with Script & Secrets Governance**: Script & Secrets Governance = lifecycle controls, diff, review, and scanning for high-risk content (scripts, secrets). Security Suite Layer may consume secrets governance findings as posture inputs, but does not own the scanning, diffing, or lifecycle management of scripts and secrets.
- **Boundary with Findings and Baselines**: Findings (Spec 111+) and baselines (Specs 116119) produce governance data points. Security Suite Layer aggregates and reinterprets those data points through a security-prioritization lens. Findings workflow owns the individual finding lifecycle; security suite layer owns the cross-finding posture picture.
- **Dependencies**: Security Posture Signals Foundation (primary signal source), findings workflow (Spec 111+), baseline/drift engine (Specs 116119), permission posture (Specs 104/105), RBAC/capability system (066+), audit log foundation (Spec 134)
- **Priority**: medium (high strategic value for MSP positioning and enterprise security conversations, but realistically sequenced after signal foundations and current governance hardening work are stable)
### Recovery Confidence — Automated Restore Testing & Readiness Reporting
- **Type**: feature
- **Source**: roadmap-to-spec coverage audit 2026-03-18, 0800-future-features brainstorming ("Killer Feature"), product positioning for enterprise trust and MSP differentiation
- **Problem**: TenantPilot has a mature backup and restore pipeline — including restore preview, dry-run, risk checking, and audit logging — but there is no product-level capability that answers the question "how confident are we that restores will actually succeed when needed?" Backup existence proves data is captured; restore execution proves the mechanism works when manually triggered. Neither proves ongoing recoverability. Operators cannot answer: when was recoverability last validated for a given tenant or policy family? Which restore paths have never been exercised? Which tenants have backup coverage but zero restore confidence? What is the overall recovery posture across the portfolio? The gap is not restore capability — it is the absence of a confidence and readiness layer that continuously proves, measures, and reports on recoverability.
- **Why it matters**: Backup without proven recoverability is a false safety net. Enterprise customers, auditors, and MSP account managers increasingly ask not "do you have backups?" but "can you prove you can recover?" Recovery confidence is the difference between a backup tool and a trusted governance platform. It directly strengthens audit conversations (proving restore paths work), MSP differentiation (recovery readiness as a reportable SLA dimension), and operator trust (visibility into which restore paths are validated vs. assumed). This was identified as a "killer feature" in product brainstorming because it shifts TenantPilot from reactive restore capability to proactive recovery assurance — a category few competitors occupy.
- **Proposed direction**:
- **Automated restore confidence checks**: scheduled or operator-triggered restore validation runs that exercise restore paths without modifying the production tenant — leveraging existing dry-run/preview infrastructure, targeted at proving that backed-up configurations can be successfully restored. Confidence checks produce structured results (pass/fail/partial, coverage, blockers) rather than just logs.
- **Recoverability tracking model**: per-tenant, per-policy-family tracking of when each restore path was last validated, what the result was, and which paths remain unexercised. This is the persistent readiness state, not a one-time report. Tracks coverage (which policy families have been validated), freshness (how recently), and result quality (clean pass vs. partial vs. failed).
- **Restore readiness summaries and reporting**: tenant-level and workspace-level views that show recovery posture — coverage gaps, stale validations, unexercised restore paths, confidence scores or readiness indicators. Exportable for audit evidence, customer reviews, and management reporting. Integrates with the evidence/reporting direction as a high-value signal source.
- **Preflight scoring**: before a real restore is needed, operators can see a structured readiness assessment — which policy families are covered by recent successful validation, which have known blockers, which have never been tested. This turns restore from a "hope it works" moment into a predictable, pre-validated operation.
- **Validation evidence trail**: each confidence check produces an immutable evidence artifact — what was tested, when, by whom, what the result was, what blockers were found. This evidence feeds into review packs, audit conversations, and compliance readiness outputs. The validation run itself is an auditable governance event.
- **Tenant-scoped and portfolio-aware**: recovery confidence is evaluated per tenant; portfolio-level aggregation surfaces which tenants have strong vs. weak recovery posture for MSP operators managing fleets.
- **Explicit non-goals**: Not a rewrite or replacement of the restore engine (Spec 011 and related specs handle restore execution; this handles confidence measurement and readiness reporting on top of that foundation). Not a full disaster-recovery orchestration platform or automated failover system. Not a synthetic test lab that provisions isolated test tenants and deploys configurations into them — confidence checks leverage existing dry-run/preview/validation infrastructure, not a separate execution environment. Not a generic backup-health dashboard — backup health (coverage, freshness, size) is a prerequisite signal, not the same problem as restore confidence (proven recoverability). Not a vague "resilience" umbrella — this is specifically about proving and reporting on restore path readiness. Not a replacement for the Evidence Domain Foundation (which owns artifact curation) or Compliance Readiness (which owns presentation assembly) — recovery confidence produces evidence artifacts that those layers consume.
- **Boundary with restore execution (Spec 011, restore pipeline)**: Restore execution = the mechanism to restore configurations from backup to a tenant. Recovery confidence = the layer that exercises, measures, tracks, and reports on whether those mechanisms are ready and reliable. Execution is the tool; confidence is the proof.
- **Boundary with Security Posture Signals Foundation**: Security Posture Signals = ingestion and historization of external posture data (Defender, backup success/failure signals) as evidence inputs. Recovery Confidence = active validation of restore paths and structured readiness reporting. Posture signals may include backup-health inputs; recovery confidence actively exercises restore paths and produces readiness-specific evidence. Backup health signals are passive; restore confidence checks are active. They are complementary: posture signals feed portfolio health views; recovery confidence proves operational readiness.
- **Boundary with Evidence Domain Foundation**: Evidence Foundation = curation, completeness tracking, immutable artifact storage. Recovery Confidence = produces validation evidence artifacts (confidence check results) that Evidence Foundation may curate and bundle. Recovery confidence is a producer; evidence foundation is a curator/consumer.
- **Boundary with MSP Portfolio Dashboard**: Portfolio Dashboard = fleet-level health aggregation and SLA reporting. Recovery confidence signals (per-tenant readiness posture) are a high-value input to the portfolio dashboard, not a replacement for it. The dashboard consumes; this candidate produces the recovery-specific signal.
- **Dependencies**: Restore pipeline stable (Spec 011 and follow-ups), backup infrastructure mature, dry-run/preview infrastructure (restore preview), audit log foundation (Spec 134), RBAC/capability system (066+), evidence/reporting direction for downstream consumption
- **Priority**: medium (high strategic value and strong product differentiation potential, but depends on restore pipeline maturity and is realistically sequenced after current hardening work)
### MSP Multi-Tenant Portfolio Dashboard & SLA Reporting
- **Type**: feature
- **Source**: roadmap-to-spec coverage audit 2026-03-18, 0800-future-features brainstorming (pillar #1 — MSP Portfolio & Operations), product positioning for MSP portfolio owners
- **Problem**: TenantPilot provides strong per-tenant governance, monitoring, and operational surfaces, but MSP operators and portfolio owners managing 10100+ tenants across workspaces have no fleet-level view that answers "how is my portfolio doing?" There is no cross-tenant health summary, no SLA/compliance risk overview, no portfolio-level operational monitoring, and no structured reporting surface that supports recurring customer portfolio reviews. Operators must navigate tenant by tenant to assemble a portfolio picture, which does not scale and prevents proactive governance.
- **Why it matters**: MSP portfolio visibility is the #1 brainstorming priority and a core product differentiator. Without it, TenantPilot serves individual tenant management well but cannot position itself as the operational cockpit for MSP businesses. Portfolio-level health, SLA tracking, compliance risk summaries, and cross-tenant operational monitoring are the capabilities that justify platform-level pricing and recurring MSP engagement. This is the difference between a per-tenant tool and an MSP operations platform.
- **Proposed direction**:
- Workspace-level portfolio dashboard: aggregated health, governance, and operational status across all managed tenants in a workspace
- Key portfolio signals: backup health (last successful backup age, coverage), sync health (last successful sync, staleness), drift/findings posture (open findings count, severity distribution, trend), operational health (recent failures, stuck runs, throttling indicators), provider connection status (consent/verification posture across fleet)
- SLA/compliance risk summary views: which tenants are below operational health thresholds, which tenants have governance gaps, which tenants need attention — sortable, filterable, visually prioritized
- Cross-tenant operational monitoring: portfolio-level view of recent operation runs, failure clustering, and common error patterns across tenants
- Structured portfolio reporting: exportable portfolio health summaries for MSP-internal use, customer-facing SLA reports, and recurring review preparation
- Workspace-scoped, RBAC-gated: portfolio views respect workspace membership and capability authorization
- **Explicit non-goals**: Not a replacement for per-tenant dashboards or detail views (those remain the primary tenant-level surfaces). Not a generic BI/data warehouse initiative or a drag-and-drop report builder. Not a customer-facing analytics suite — this is an operator/MSP-internal tool. Not a cross-tenant compare/diff/promotion surface (that is the Cross-Tenant Compare & Promotion candidate). Not a system-console-level platform triage view (that is the System Console Multi-Workspace Operator UX candidate). Not a replacement for alerting (Specs 099/100 handle event-driven notifications; this is a review/monitoring surface).
- **Boundary with Cross-Tenant Compare & Promotion**: Portfolio Dashboard = fleet-level monitoring, health aggregation, SLA reporting, operational overview. Cross-Tenant Compare = policy-level diff, staging-to-production promotion, configuration comparison. They share the multi-tenant dimension but solve fundamentally different problems.
- **Boundary with System Console Multi-Workspace Operator UX**: Portfolio Dashboard = workspace-scoped MSP operator view, health/SLA/governance focus. System Console = platform-level triage, cross-workspace operator tooling, infrastructure focus. Different audiences, different panels.
- **Dependencies**: Per-tenant operational health signals (backup, sync, drift, findings, provider connection status), workspace model, tenant inventory, alerting foundations (Specs 099/100), RBAC/capability system (066+)
- **Priority**: medium (high strategic value, significant data aggregation effort; depends on per-tenant signal maturity)
### Policy Lifecycle / Ghost Policies (Spec 900 refresh)
- **Type**: feature
- **Source**: Spec 900 draft (2025-12-22), HANDOVER risk #9
- **Problem**: Policies deleted in Intune remain in TenantAtlas indefinitely. No deletion indicators. Backup items reference "ghost" policies.
- **Why it matters**: Data integrity, user confusion, backup reliability
- **Proposed direction**: Soft delete detection during sync, auto-restore on reappear, "Deleted" badge, restore from backup. Draft in Spec 900.
- **Dependencies**: Inventory sync stable
- **Priority**: medium
### Standardization & Policy Quality — Linting, Company Standards, Hygiene
- **Type**: feature
- **Source**: roadmap-to-spec coverage audit 2026-03-18, 0800-future-features brainstorming (pillar #3 — Standardization & Policy Quality / "Intune Linting")
- **Problem**: TenantPilot captures, versions, and governs Intune policy configurations, but provides no capability to evaluate whether those configurations meet quality, consistency, or organizational standards. Operators cannot answer questions like: "Do all policies follow our naming convention?", "Are there duplicate or near-duplicate policies?", "Which policies have no assignments?", "Are scope tags applied consistently?", "Does this tenant meet our company's minimum configuration standard?" Today, quality and hygiene assessment is manual, tenant-by-tenant, and invisible to governance workflows.
- **Why it matters**: Configuration quality is a distinct governance dimension from baseline drift and compliance findings. Drift detection answers "has something changed?"; standardization answers "is it correct and well-structured?" Enterprise customers and MSPs need both. Policy linting, hygiene checks, and company standards create a repeatable quality layer that reduces configuration debt, catches structural problems early, and supports standardization across managed tenants. This is the #3 brainstorming priority and a natural complement to the existing governance stack.
- **Proposed direction**:
- **Policy linting / quality checks**: rule-based evaluation of policy configurations against defined quality criteria — naming conventions, scope tag requirements, assignment presence, setting completeness, structural validity. Rules should be composable and extensible per workspace or tenant.
- **Company standards as reusable reference packs**: operators or MSPs define their own configuration standards ("Company Standard 2026") as reference expectations that policies can be evaluated against. Distinct from Microsoft baselines — these are organization-defined, not vendor-defined. A standard pack is a set of expected configuration postures, not a deployable template.
- **Hygiene checks**: automated detection of structural problems — duplicate or near-duplicate policies, unassigned policies, orphaned scope tags or filters, policies with no settings or empty payloads, stale policies not updated in extended periods, inconsistent naming patterns across policy families.
- **Quality findings integration**: hygiene and linting results should produce structured findings or quality signals that integrate with the existing findings workflow, not a separate parallel reporting system.
- **Tenant-scoped and portfolio-aware**: quality evaluation runs per tenant; portfolio views can aggregate quality posture across tenants for MSP operators.
- **Explicit non-goals**: Not a full compliance framework or certification engine (compliance readiness is a separate candidate). Not a generic recommendation engine or AI assistant. Not a replacement for baseline/drift detection (which answers "has it changed from a known-good state?" — standardization answers "is it well-structured and consistent?"). Not a policy deployment or remediation engine — this is evaluation and visibility, not automated correction. Not a replacement for the existing findings workflow — quality signals should flow into findings, not bypass them.
- **Boundary with baseline/drift engine**: Baselines compare current state against a snapshot of known-good state. Standardization evaluates current state against quality rules and organizational expectations. A policy can be drift-free (unchanged from baseline) but still fail quality checks (bad naming, missing assignments, no scope tags). These are complementary, not overlapping.
- **Boundary with Policy Setting Explorer**: Policy Setting Explorer = reverse lookup ("where is this setting defined?"). Standardization = quality evaluation ("is this policy well-structured and consistent?"). Different questions, different surfaces.
- **Dependencies**: Inventory sync stable, policy versioning, tenant context model, findings workflow (Spec 111) for quality findings integration, RBAC/capability system (066+)
- **Priority**: medium (high strategic value, incremental delivery possible starting with high-value hygiene checks)
### Schema-driven Secret Classification
- **Type**: hardening
- **Source**: Spec 120 deferred follow-up
- **Problem**: Secret redaction currently uses pattern-based detection. A schema-driven approach via `GraphContractRegistry` metadata would be more reliable.
- **Why it matters**: Reduces false negatives in secret redaction
- **Proposed direction**: Central classifier in `GraphContractRegistry`, regression corpus
- **Dependencies**: Secret redaction (120) stable, registry completeness (095)
- **Priority**: medium
### Script & Secrets Governance — Diff, Review, Scanning, and Lifecycle Controls for High-Risk Content
- **Type**: feature
- **Source**: roadmap-to-spec coverage audit 2026-03-18, 0800-future-features brainstorming (Script & Secrets Governance pillar), platform hardening direction
- **Problem**: TenantPilot governs a wide range of Intune policy configurations, but a subset of these configurations carries disproportionate operational risk: PowerShell remediation scripts, detection scripts, custom compliance scripts, proactive remediations, and policy artifacts that embed or reference secrets (pre-shared keys, certificate data, credentials, API tokens). These artifacts are fundamentally different from declarative policy settings — they contain executable logic or sensitive material where a single change can have outsized blast radius, and where silent or unreviewed modification creates real security and operational exposure. Today, TenantPilot treats script-bearing and secret-sensitive artifacts with the same governance depth as any other policy: they are versioned and backed up, but there is no dedicated diff/review surface for script content, no approval or guarded workflow for high-risk script changes, no scanning or policy checks for obviously unsafe secret-handling patterns, and no structured visibility into which configurations carry elevated risk because they contain executable or secret-sensitive content.
- **Why it matters**: Script and secret governance is a distinct risk dimension that cuts across policy families. A naming convention violation in a device configuration policy is a hygiene problem; an unreviewed script change in a remediation policy is a potential security incident. Enterprise customers and MSP operators need to trust that high-risk content changes are visible, reviewable, and governable — not just captured as another version snapshot. This capability strengthens audit conversations (proving that script changes are reviewed), operator safety (preventing silent high-risk modifications from going unnoticed), and platform credibility (demonstrating that TenantPilot understands which parts of Intune configuration carry elevated risk). Without it, backup and versioning give a false sense of governance completeness — the most dangerous artifacts receive the same governance treatment as the least dangerous ones.
- **Proposed direction**:
- **Script-aware diff and review surfaces**: dedicated diff views for script-bearing policy artifacts that render script content changes readably — not just JSON diff of the enclosing policy payload, but structured presentation of the script text itself (before/after, syntax-highlighted where practical, change summary). These surfaces make script changes reviewable by operators rather than buried in raw payload diffs.
- **Risk classification for script/secret-bearing artifacts**: extend the inventory or governance metadata so that policy artifacts containing scripts or secret-sensitive fields are identifiable as elevated-risk items. This classification enables filtering, alerting, and governance workflow differentiation — operators can see "which of my policies are script-bearing?" or "which versions changed script content?" without manually inspecting payloads.
- **Guarded change workflows for high-risk content**: optional governance gates for script-bearing or secret-sensitive changes — such as requiring explicit acknowledgment, capability-gated approval, or elevated audit logging when a versioned change involves script content or secret-sensitive fields. These are governance-layer controls, not Intune-side mutation blocks (TenantPilot observes configuration, it does not control the Intune mutation path). The gates apply to how TenantPilot classifies and routes detected changes.
- **Scanning and policy checks for secret-handling patterns**: lightweight rule-based checks that flag obviously unsafe patterns in script or configuration content — hardcoded credentials, plaintext secrets, overly broad credential scopes, known-bad patterns. Not a full SAST engine — focused, high-signal checks that catch the most common and most dangerous mistakes. Results integrate with the findings workflow as governance signals, not a parallel detection system.
- **Rollback and auditability expectations**: script and secret-sensitive changes should have clear rollback visibility (which version introduced the script change, who triggered restore, what was the before/after state). Audit trail expectations should be elevated for this content class — change, review, approval, and rollback events should be distinctly traceable in audit logs.
- **Operator visibility into script/secret risk posture**: tenant-level and portfolio-level views that surface which tenants have unreviewed script changes, which script-bearing policies have never been reviewed, and where secret-handling patterns have been flagged. This is the governance visibility layer, not a generic dashboard initiative.
- **Explicit non-goals**: Not a replacement for external secret vault or key management systems (Azure Key Vault, HashiCorp Vault, etc.) — TenantPilot does not store or manage secrets as a vault; it governs configurations that may contain or reference sensitive material. Not a full code-signing or binary-signing platform — governance focus is on reviewability and risk visibility, not cryptographic attestation. Not a SIEM, DLP, or broad security-monitoring system — this is governance of specific high-risk content classes within the existing policy governance architecture, not a generic security operations capability. Not a catch-all bucket for every security topic — this is bounded to script-bearing and secret-sensitive configuration artifacts. Not a replacement for the baseline/drift engine (which detects *any* configuration change) — this adds risk-aware governance specifically for the content classes where changes carry elevated operational risk. Not a policy deployment or remediation engine — this is detection, review, and governance, not automated correction. Not a full static analysis (SAST) engine for arbitrary scripts.
- **Boundary with Schema-driven Secret Classification**: Schema-driven Secret Classification = improving the *redaction mechanism's* reliability by using schema metadata to classify which fields contain secrets (a backend classification improvement for the existing redaction pipeline). Script & Secrets Governance = lifecycle governance around script-bearing and secret-sensitive *artifacts* — diff, review, scanning, approval workflows, risk visibility. Classification makes redaction more accurate; governance adds reviewability and lifecycle controls. Schema-driven classification may inform governance risk tagging (which fields are secret-sensitive), but the problems and deliverables are distinct.
- **Boundary with Standardization & Policy Quality**: Standardization = evaluating whether policies are well-structured, consistently named, properly assigned, and hygienically maintained. Script & Secrets Governance = evaluating whether high-risk content (scripts, secrets) is reviewed, safe, and governable. A policy can pass all quality checks (good naming, proper assignments, scope tags) but still have an unreviewed script change or a hardcoded credential. These are complementary governance dimensions, not overlapping.
- **Boundary with Security Posture Signals Foundation**: Security Posture Signals = ingesting and historizing external posture data (Defender, backup health) as evidence inputs for reporting. Script & Secrets Governance = internal governance of the product's own high-risk configuration content. Different data sources, different governance problems. Posture signals are external evidence; script governance is internal safety.
- **Boundary with baseline/drift engine (Specs 116119)**: Drift detection = detecting that *something changed*. Script & Secrets Governance = applying differentiated governance treatment *because of what changed* (script content, secret-sensitive fields). Drift is content-agnostic detection; script governance is risk-aware response. They compose: drift detection finds the change, script governance classifies and routes it based on risk.
- **Dependencies**: Inventory sync stable, policy versioning and snapshot infrastructure, secret redaction (Spec 120) stable, findings workflow (Spec 111) for governance signal integration, audit log foundation (Spec 134), RBAC/capability system (066+), GraphContractRegistry maturity for field-level metadata
- **Priority**: medium (high security-governance value and clear product differentiation, but realistically sequenced after current hardening work and dependent on inventory/versioning/findings maturity)
### Cross-Tenant Compare & Promotion
- **Type**: feature
- **Source**: Spec 043 draft, 0800-future-features
- **Problem**: No way to compare policies between tenants or promote configurations from staging to production.
- **Why it matters**: Core MSP/enterprise workflow. Identified as top revenue lever in brainstorming.
- **Proposed direction**: Compare/diff UI, group/scope-tag mapping, promotion plan (preview → dry-run → cutover → verify)
- **Dependencies**: Inventory sync, backup/restore mature
- **Spec 043 relationship**: Spec 043 (`specs/043-cross-tenant-compare-and-promotion/spec.md`) is a lightweight draft (scenarios + FRs, created 2026-01-07, status: Draft) that covers the core compare/promotion contract. This candidate captures the expanded strategic direction and scope refinements accumulated since the draft was written. When this candidate is promoted, it should refresh and supersede the existing Spec 043 draft rather than creating a parallel spec.
- **Priority**: medium (high value, high effort)
### System Console Scope Hardening
- **Type**: hardening
- **Source**: Spec 113/114 follow-up
- **Problem**: The system console (`/system`) needs a clear cross-workspace entitlement model. Current platform capabilities (Spec 114) define per-surface access, but cross-workspace query authorization and scope isolation for platform operators are not yet hardened as a standalone contract.
- **Why it matters**: Platform operators acting across workspaces need tight scope boundaries to prevent accidental cross-workspace data exposure in troubleshooting and monitoring flows.
- **Proposed direction**: Formalize cross-workspace query authorization model, scope isolation rules for platform operator sessions, and regression coverage for wrong-workspace access in system console surfaces.
- **Dependencies**: System console (114) stable, canonical tenant context (Specs 135/136)
- **Priority**: low
### System Console Multi-Workspace Operator UX
- **Type**: feature
- **Source**: Spec 113 deferred
- **Problem**: System console (`/system`) currently can't select/filter across workspaces for platform operators. Triage and monitoring require workspace-by-workspace navigation.
- **Why it matters**: Platform ops need cross-workspace visibility for troubleshooting and monitoring at scale.
- **Proposed direction**: Workspace selector/filter in system console views, cross-workspace run aggregation, unified triage entry point.
- **Dependencies**: System console (114) stable, System Console Scope Hardening
- **Priority**: low
### Operations Naming Harmonization Across Run Types, Catalog, UI, and Audit
- **Type**: hardening
- **Source**: coding discovery, operations UX consistency review
- **Why it matters**: Strategically important for enterprise UX, auditability, and long-term platform consistency. `OperationRun` is becoming a cross-domain execution and monitoring backbone, and the current naming drift will get more expensive as new run types and provider domains are added. This should reduce future naming drift, but it is not a blocker-critical refactor and should not be pulled in as a side quest during small UI changes.
- **Problem**: Naming around operations appears historically grown and not consistent enough across `OperationRunType` values, visible run labels, `OperationCatalog` mappings, notifications, audit events, filters, badges, and related UI copy. Internal type names and operator-facing language are not cleanly separated, domain/object/verb ordering is uneven, and small UX fixes risk reinforcing an already inconsistent scheme. If left as-is, new run types for baseline, review, alerts, and additional provider domains will extend the inconsistency instead of converging it.
- **Desired outcome**: A later spec should define a clear naming standard for `OperationRunType`, establish an explicit distinction between internal type identifiers and operator-facing labels, and align terminology across runs, notifications, audit text, monitoring views, and operations UI. New run types should have documented naming rules so they can be added without re-opening the vocabulary debate.
- **In scope**: Inventory of current operation-related naming surfaces; naming taxonomy for internal identifiers versus visible operator language; conventions for verb/object/domain ordering; alignment rules for `OperationCatalog`, run labels, notifications, audit events, filters, badges, and monitoring UI; forward-looking rules for adding new run types and provider/domain families; a pragmatic migration plan that minimizes churn and preserves audit clarity.
- **Out of scope**: Opportunistic mass-refactors during unrelated feature work; immediate renaming of all historical values without a compatibility plan; using a small UI wording issue such as "Sync from Intune" versus "Sync policies" as justification for broad churn; a full operations-domain rearchitecture unless later analysis proves it necessary.
- **Trigger / Best time to do this**: Best tackled when multiple new run types are about to land, when `OperationCatalog` / monitoring / operations hub work is already active, when new domains such as Entra or Teams are being integrated, or when a broader UI naming constitution is ready to be enforced technically. This is a good candidate for a planned cleanup window, not an ad hoc refactor.
- **Risks if ignored**: Continued terminology drift across UI and audit layers, higher cognitive load for operators, weaker enterprise polish, more brittle label mapping, and more expensive cleanup once additional domains and execution types are established. Audit/event language may diverge further from monitoring language, making cross-surface reasoning harder.
- **Suggested direction**: Define stable internal run-type identifiers separately from visible operator labels. Standardize a single naming grammar for operation concepts, including when to lead with verb, object, or domain, and when provider-specific wording is allowed. Apply changes incrementally with compatibility-minded mapping rather than a brachial rename of every historical string. Prefer a staged migration that first defines rules and mapping layers, then updates high-value operator surfaces, and only later addresses legacy internals where justified.
- **Readiness level**: Qualified and strategically important, but intentionally deferred. This should be specified before substantially more run types and provider domains are introduced, yet it should not become an immediate side-track or be bundled into minor UI wording fixes.
- **Candidate quality**:
- Clearly identified cross-cutting problem with architectural and UX impact
- Strong future-facing trigger conditions instead of vague "sometime later"
- Explicit boundaries to prevent opportunistic churn
- Concrete desired outcome without overdesigning the solution
- Easy to promote into a full spec once operations-domain work is prioritized
### Operator Presentation & Lifecycle Action Hardening
- **Type**: hardening
- **Source**: Evidence Snapshot / Ops-UX review 2026-03-19
- **Problem**: TenantPilot has strong shared presentation abstractions — `OperationCatalog` for operation labels, `BadgeRenderer` / `BadgeCatalog` for status/outcome badges, and some lifecycle-aware action gating patterns in selected resources — but these conventions are not consistently enforced across all operator-facing surfaces. Individual surfaces can bypass the shared sources of truth without triggering any architectural or CI feedback. This produces a repeatable class of operator-UX degradation:
- **Operation label bypass**: surfaces that render operation names directly from internal type keys instead of going through the shared operation catalog, leaking technical identifiers like `inventory_sync` or `compliance.snapshot` into operator-visible UI.
- **Status/outcome presentation bypass**: surfaces that render raw enum values (e.g. `queued`, `running`, `pending`, `succeeded`) directly from model attributes instead of using `BadgeRenderer`, producing unstyled debug-quality output where operators expect consistent badge rendering.
- **Missing lifecycle-aware action gating**: mutation and destructive actions (e.g. "Expire snapshot", "Refresh snapshot") that remain visible and invocable on records in terminal lifecycle states (Expired, Failed), because no shared convention requires actions to derive visibility from valid lifecycle transitions. Backend idempotency guards prevent data corruption but do not prevent operator confusion.
- **Unscoped global widget polling**: global widgets (e.g. `BulkOperationProgress`) that poll on every page including non-operational pages where no active runs are expected, creating unnecessary network noise and giving operators the impression that background activity is occurring when none is relevant.
- **Why it matters**: In enterprise SaaS, operator trust depends on consistent, predictable UI behavior across every surface. A single widget rendering raw `queued` instead of a styled badge, or a single page showing an "Expire" action on an already-expired record, undermines confidence in the product's governance capabilities. These are not cosmetic issues — they are operator-trust issues that compound as the product adds more lifecycle-driven surfaces (Findings, Review Packs, Baselines, Exceptions, Alerts, Drift governance). Without shared enforceable conventions, every new surface risks re-introducing the same failure modes.
- **Proposed direction**:
- **Operation label convention**: codify the rule that all operator-visible operation names must resolve through `OperationCatalog::label()` (or the equivalent shared source of truth). Add a lightweight enforcement mechanism (CI check, architectural test, or documented anti-pattern) that catches direct usage of raw operation type strings in Blade templates and widget renders.
- **Status/outcome badge convention**: codify the rule that all operator-visible status and outcome rendering must go through `BadgeRenderer` (or equivalent shared badge helpers). Enumerate the known surfaces that currently comply and identify any that bypass the convention. Add a regression mechanism to prevent new surfaces from bypassing.
- **Lifecycle-aware action visibility convention**: define a shared contract or trait that mutation/destructive actions must consult to determine visibility based on the record's current lifecycle state. Terminal-state records must not expose invalid lifecycle transitions as available actions. Suggest introducing `isTerminal(): bool` (or equivalent) on lifecycle enums (`EvidenceSnapshotStatus`, `ReviewPackStatus`, `OperationRunStatus`, etc.) so action visibility can be derived from lifecycle semantics rather than ad hoc per-resource `->hidden()` conditions.
- **Polling ownership convention**: codify the rule that global widgets must declare their polling scope — which pages or contexts justify active polling vs. idle/suppressed behavior. Ensure idle discovery polling intervals are intentional and documented, and that non-operational pages are not subjected to unnecessary polling overhead.
- **Scope boundaries**:
- **In scope**: shared convention definitions, enforcement mechanisms, anti-pattern catalog, lifecycle enum enrichment (`isTerminal()` or equivalent), regression coverage for badge/label/action consistency
- **Out of scope**: local Evidence Snapshot fixes (those belong in the active Evidence-related spec), operations naming vocabulary redesign (tracked separately as "Operations Naming Harmonization"), visual language canon / design-system codification (tracked separately as "Admin Visual Language Canon"), BulkOperationProgress architectural redesign, new badge domains or new operation types
- **Examples of failure modes this should prevent**:
- A widget rendering `{{ $run->status }}` directly instead of using `BadgeRenderer::render(BadgeDomain::OperationRunStatus, $run->status)`
- A card showing raw `outcome: pending` text instead of a styled outcome badge
- An "Expire snapshot" action visible on a record with status `Expired`
- A "Refresh snapshot" action visible on a record with status `Failed`
- A global progress widget polling every 30 seconds on the Evidence detail page where no active operation runs are relevant
- A new governance surface (e.g. Baseline review, Alert detail) shipping without badge rendering because no convention required it
- **Why this is a follow-up candidate, not part of current local fixes**: The active Evidence-related spec should fix the specific Evidence Snapshot bugs (raw status in `RecentOperationsSummary`, missing `->hidden()` on expire/refresh actions). This candidate addresses the **shared convention layer** that prevents the same class of bugs from recurring on every future lifecycle-driven surface. The local fixes prove the bugs exist; this candidate prevents their recurrence.
- **Dependencies**: BadgeRenderer / BadgeCatalog system (already stable), OperationCatalog (already stable), lifecycle enums (already defined, need `isTerminal()` enrichment), RBAC/capability system (066+) for action gating patterns
- **Related candidates**: Operations Naming Harmonization (naming vocabulary — complementary but distinct), Admin Visual Language Canon (visual conventions — broader scope), Action Surface Contract v1.1 (interaction-level action rules — complementary)
- **Priority**: medium
### Provider Connection Resolution Normalization
- **Type**: hardening
- **Source**: architecture audit provider connection resolution analysis
- **Problem**: The codebase has a dual-resolution model for provider connections. Gen 2 jobs (`ProviderInventorySyncJob`, `ProviderConnectionHealthCheckJob`, `ProviderComplianceSnapshotJob`) receive an explicit `providerConnectionId` and pass it through the `ProviderOperationStartGate`. Gen 1 jobs (`ExecuteRestoreRunJob`, `EntraGroupSyncJob`, `SyncRoleDefinitionsJob`, policy sync jobs, etc.) do NOT — their called services resolve the default connection at runtime via `MicrosoftGraphOptionsResolver::resolveForTenant()` or internal `resolveProviderConnection()` methods. This creates non-deterministic execution: a job dispatched against one connection may silently execute against a different one if the default changes between dispatch and execution. ~20 services use the Gen 1 implicit resolution pattern.
- **Why it matters**: Non-deterministic credential binding is a correctness and audit gap. Enterprise customers need to know exactly which connection identity was used for every Graph API call. The implicit pattern also prevents connection-scoped rate limiting, error attribution, and consent-scope validation. This is the foundational refactor that unblocks all other provider connection improvements.
- **Proposed direction**:
- Refactor all Gen 1 services to accept an explicit `ProviderConnection` (or `providerConnectionId`) parameter instead of resolving default internally
- Update all Gen 1 jobs to accept `providerConnectionId` at dispatch time (resolved at the UI/controller layer via `ProviderOperationStartGate` or equivalent)
- Deprecate `MicrosoftGraphOptionsResolver` — callers should use `ProviderGateway::graphOptions($connection)` directly
- Ensure `provider_connection_id` is recorded in every `OperationRun` context and audit event
- Standardize error handling: all resolution failures produce `ProviderConnectionResolution::blocked()` with structured `ProviderReasonCodes`, not mixed exceptions (`ProviderConfigurationRequiredException`, `RuntimeException`, `InvalidArgumentException`)
- **Known affected services** (Gen 1 / implicit resolution): `RestoreService` (line 2913 internal `resolveProviderConnection()`), `PolicySyncService` (lines 58, 450), `PolicySnapshotService` (line 752), `RbacHealthService` (line 192), `InventorySyncService` (line 730 internal `resolveProviderConnection()`), `EntraGroupSyncService`, `RoleDefinitionsSyncService`, `EntraAdminRolesReportService`, `AssignmentBackupService`, `AssignmentRestoreService`, `ScopeTagResolver`, `TenantPermissionService`, `VersionService`, `ConfigurationPolicyTemplateResolver`, `FoundationSnapshotService`, `FoundationMappingService`, `RestoreRiskChecker`, `PolicyCaptureOrchestrator`, `AssignmentFilterResolver`, `RbacOnboardingService`, `TenantConfigService`
- **Known affected jobs** (Gen 1 / no explicit connectionId): `ExecuteRestoreRunJob`, `EntraGroupSyncJob`, `SyncRoleDefinitionsJob`, `SyncEntraAdminRolesJob`, plus any job that calls a Gen 1 service
- **Gen 2 reference implementations** (correct pattern): `ProviderInventorySyncJob`, `ProviderConnectionHealthCheckJob`, `ProviderComplianceSnapshotJob` — all receive `providerConnectionId`, pass through `ProviderOperationStartGate`, lock row, create `OperationRun` with connection in context
- **Key architecture components**:
- `ProviderConnectionResolver` — correct, keep as-is. `resolveDefault()` returns `ProviderConnectionResolution` value object
- `ProviderOperationStartGate` — canonical dispatch-time gate, correct Gen 2 pattern. Handles 3 operation types: `provider.connection.check`, `inventory_sync`, `compliance.snapshot`
- `MicrosoftGraphOptionsResolver` — legacy bridge (32 lines), target for deprecation. Calls `resolveDefault()` internally, hides connection identity
- `ProviderGateway` — lower-level primitive, builds graph options from explicit connection. Correct, keep as-is
- `ProviderIdentityResolver` — resolves identity (platform vs dedicated) from connection. Correct, keep as-is
- Partial unique index on `provider_connections`: `(tenant_id, provider) WHERE is_default = true`
- **Out of scope**: UX label changes, UI banners, legacy credential field removal (those are separate candidates below)
- **Dependencies**: None — this is the foundational refactor
- **Related specs**: Spec 081 (Tenant credential migration CI guardrails), Spec 088 (provider connection model), Spec 089 (provider gateway), Spec 137 (data-layer provider prep)
- **Priority**: high
### Provider Connection UX Clarity
- **Type**: polish
- **Source**: architecture audit provider connection resolution analysis
- **Problem**: The operator-facing language and information architecture around provider connections creates confusion about why a "default" connection is required, what happens when it's missing, and when actions are tenant-wide vs connection-scoped. Specific issues: (1) "Set as Default" is misleading — it implies preference, but the connection is actually the canonical operational identity; (2) missing-default errors surface as blocked `OperationRun` records or exceptions, but there is no proactive banner/hint on the tenant or connection pages; (3) action labels don't distinguish tenant-wide operations (verify, sync) from connection-scoped operations (health check, test); (4) the singleton auto-promotion (first connection becomes default automatically) is invisible — operators don't understand why their first connection was special.
- **Why it matters**: Reduces support friction and operator confusion. Enterprise operators managing multiple tenants need clear, predictable language about connection lifecycle. The current UX makes the correct architecture feel like a bug ("why do I need a default?").
- **Proposed direction**:
- Rename "Set as Default" → "Promote to Primary" (or "Set as Primary Connection") across all surfaces
- Add a missing-primary-connection banner on tenant detail / connection list when no default exists — with a direct "Promote" action
- Distinguish action labels: tenant-wide actions ("Sync Tenant", "Verify Tenant") vs connection-scoped actions ("Check Connection Health", "Test Connection")
- Improve blocked-notification copy: instead of generic "provider connection required", show "No primary connection configured for [Provider]. Promote a connection to continue."
- Show a transient success notification when auto-promotion happens on first connection creation ("This connection was automatically set as primary because it's the first for this provider")
- Consider an info tooltip or help text explaining the primary connection concept on the connection resource pages
- **Key surfaces to update**: `ProviderConnectionResource` (row actions, header actions, table empty state), `TenantResource` (verify action, connection tab), onboarding wizard consent step, `ProviderNextStepsRegistry` remediation links, notification templates for blocked operations
- **Auto-default creation locations** (4 places, need UX feedback): `CreateProviderConnection` action, `TenantOnboardingController`, `AdminConsentCallbackController`, `ManagedTenantOnboardingWizard`
- **Out of scope**: Backend resolution refactoring (that's the normalization candidate above), legacy field removal
- **Dependencies**: Soft dependency on "Provider Connection Resolution Normalization" — UX improvements are more coherent when the backend consistently uses explicit connections, but many label/banner changes can proceed independently
- **Related specs**: Spec 061 (provider connection UX), Spec 088 (provider connection model)
- **Priority**: medium
### Provider Connection Legacy Cleanup
- **Type**: hardening
- **Source**: architecture audit provider connection resolution analysis
- **Problem**: After normalization is complete, several legacy artifacts remain: (1) `MicrosoftGraphOptionsResolver` — a 32-line convenience bridge that exists only because ~20 services haven't been updated to use explicit connections; (2) service-internal `resolveProviderConnection()` methods in `RestoreService` (line 2913), `InventorySyncService` (line 730), and similar — these are local resolution logic that should not exist once services receive explicit connections; (3) `Tenant` model legacy credential accessors (`app_client_id`, `app_client_secret` fields) — `graphOptions()` already throws `BadMethodCallException`, but the fields and accessors remain; (4) `migration_review_required` flag on `ProviderConnection` — used during the credential migration from tenant-level to connection-level, should be retired once all tenants are migrated.
- **Why it matters**: Dead code increases cognitive load and creates false affordances. New developers may use `MicrosoftGraphOptionsResolver` or internal resolution methods thinking they're the correct pattern. Legacy credential fields on `Tenant` suggest credentials still live there. Cleaning up after normalization makes the correct architecture self-documenting.
- **Proposed direction**:
- Remove `MicrosoftGraphOptionsResolver` class entirely (after normalization ensures zero callers)
- Remove all service-internal `resolveProviderConnection()` / `resolveDefault()` methods
- Remove legacy credential fields from `Tenant` model (migration to drop columns, update factory, update tests)
- Evaluate `migration_review_required` — if all tenants have migrated, remove the flag and related UI (banner, filter)
- Update CI guardrails: `NoLegacyTenantGraphOptionsTest` and `NoTenantCredentialRuntimeReadsSpec081Test` can be simplified or removed once the code they guard against is gone
- Verify no seeders, factories, or test helpers reference legacy patterns
- **Out of scope**: Any new features — this is pure cleanup
- **Dependencies**: Hard dependency on "Provider Connection Resolution Normalization" — cleanup cannot proceed until all callers are migrated
- **Related specs**: Spec 081 (credential migration guardrails), Spec 088 (provider connection model), Spec 137 (data-layer provider prep)
- **Priority**: medium (deferred until normalization is complete)
### Tenant App Status False-Truth Removal
- **Type**: hardening
- **Source**: legacy / orphaned truth audit 2026-03-16
- **Classification**: quick removal
- **Problem**: `Tenant.app_status` is displayed in tenant UI as current operational truth even though production code no longer writes it. Operators can see a frozen "OK" or other stale badge that does not reflect the real provider connection state.
- **Why it matters**: This is misleading operator-facing truth, not just dead schema. It creates false confidence on a tier-1 admin surface.
- **Target model**: `Tenant`
- **Canonical source of truth**: `ProviderConnection.consent_status` and `ProviderConnection.verification_status`
- **Must stop being read**: `Tenant.app_status` in `TenantResource` table columns, infolist/details, filters, and badge-domain mapping.
- **Can be removed immediately**:
- TenantResource reads of `app_status`
- tenant app-status badge domain / badge mapping usage
- factory defaults that seed `app_status`
- **Remove only after cutover**:
- the `tenants.app_status` column itself, once all UI/report/export reads are confirmed gone
- **Migration / backfill**: No backfill. One cleanup migration to drop `app_status`. `app_notes` may be dropped in the same migration only if it does not broaden the spec beyond tenant stale app fields.
- **UI / resource / policy / test impact**:
- UI/resources: remove misleading badge and filter from tenant surfaces
- Policy: none
- Tests: update `TenantFactory`, remove assertions that treat `app_status` as live truth
- **Scope boundaries**:
- In scope: remove stale tenant app-status reads and schema field
- Out of scope: provider connection UX redesign, credential migration, broader tenant health redesign
- **Dependencies**: None required if the immediate operator-facing action is removal rather than replacement with a new tenant-level derived badge.
- **Risks**: Low rollout risk. Main risk is short-term operator confusion about where to view connection health after removal.
- **Why it should be its own spec**: This is the cleanest high-severity operator-trust fix in the repo. It is bounded, low-coupling, and should not wait for the larger provider cutover work.
- **Priority**: high
### Provider Connection Status Vocabulary Cutover
- **Type**: hardening
- **Source**: legacy / orphaned truth audit 2026-03-16
- **Classification**: bounded cutover
- **Problem**: `ProviderConnection` currently exposes overlapping status vocabularies across `status`, `health_status`, `consent_status`, and `verification_status`. Resources, badges, and filters can read both projected legacy state and canonical enum state, creating drift and operator ambiguity.
- **Why it matters**: This is duplicate status truth on an operator-facing surface. It also leaves the system vulnerable to projector drift if legacy projected fields stop matching the enum source of truth.
- **Target model**: `ProviderConnection`
- **Canonical source of truth**: `ProviderConnection.consent_status` and `ProviderConnection.verification_status`
- **Must stop being read**: `ProviderConnection.status` and `ProviderConnection.health_status` in resources, filters, badges, and any operator-facing status summaries.
- **Can be removed immediately**:
- new operator-facing reads of legacy varchar status fields
- new badge/filter logic that depends on normalized legacy values
- **Remove only after cutover**:
- `status` and `health_status` columns
- projector persistence of those fields, if still retained for compatibility
- legacy badge normalization paths
- **Migration / backfill**: No data backfill if enum columns are already complete. Requires a later schema cleanup migration to drop legacy varchar columns after all reads are migrated.
- **UI / resource / policy / test impact**:
- UI/resources: `ProviderConnectionResource` and related badges/filters move to one coherent operator vocabulary
- Policy: none directly
- Tests: add exhaustive projection and badge mapping coverage during the transition; update resource/filter assertions to enum-driven behavior
- **Scope boundaries**:
- In scope: provider connection status fields, display semantics, badge/filter vocabulary, deprecation path for projected columns
- Out of scope: tenant credential migration, provider onboarding flow redesign, unrelated badge cleanup elsewhere
- **Dependencies**: Confirm all hidden read paths outside the main resource and define the operator-facing enum presentation.
- **Risks**: Medium rollout risk. Filters, badges, and operator language change together, and hidden reads may exist outside the primary resource.
- **Why it should be its own spec**: This is a self-contained source-of-truth cutover on one model. It is too important and too operationally visible to bury inside a generic provider cleanup spec.
- **Priority**: high
### Tenant Legacy Credential Source Decommission
- **Type**: hardening
- **Source**: legacy / orphaned truth audit 2026-03-16
- **Classification**: staged migration
- **Problem**: Tenant-level credential fields remain in the data model after ProviderCredential became the canonical identity store. They are still used for migration classification and are kept artificially alive by factory defaults, which obscures the real architecture and prolongs the cutover.
- **Why it matters**: This is an incomplete architectural cutover around sensitive identity data. The system needs an explicit end-state where runtime credential resolution no longer depends on tenant legacy fields.
- **Target model**: `Tenant`, with `ProviderCredential` as the destination canonical model
- **Canonical source of truth**: `ProviderCredential.client_id` and `ProviderCredential.client_secret`
- **Must stop being read**: tenant legacy credential fields in normal runtime credential resolution. Transitional reads remain allowed only inside migration-classification paths until exit criteria are met.
- **Can be removed immediately**:
- factory defaults that populate legacy tenant credentials by default
- any non-classification runtime reads if discovered during spec work
- UI affordances that imply tenant-stored credentials are active
- **Remove only after cutover**:
- `Tenant.app_client_id`, `Tenant.app_client_secret`, `Tenant.app_certificate_thumbprint`
- migration-classification reads and related transitional guardrails
- **Migration / backfill**: Requires explicit completion criteria for the tenant-to-provider credential migration. No blind backfill; removal should follow confirmed migration review state for all affected tenants.
- **UI / resource / policy / test impact**:
- UI/resources: remove any residual legacy credential messaging once the cutover is complete
- Policy: none directly
- Tests: `TenantFactory` must stop creating legacy credentials by default; transition-only tests should use explicit legacy states
- **Scope boundaries**:
- In scope: tenant legacy credential fields, classification-only transition reads, factory/test cleanup tied to the cutover
- Out of scope: provider connection status vocabulary, unrelated tenant stale fields, onboarding UX redesign
- **Dependencies**: Hard dependency on the provider credential migration/review lifecycle being complete enough to identify all remaining transitional tenants safely.
- **Risks**: Higher rollout risk than simple cleanup because this touches credential-path architecture and transitional data needed for migration review.
- **Why it should be its own spec**: This has distinct exit criteria, migration gating, and rollback concerns. It is not the same problem as stale operator-facing badges or provider status vocabulary cleanup.
- **Priority**: high
### Entra Group Authorization Capability Alignment
- **Type**: hardening
- **Source**: legacy / orphaned truth audit 2026-03-16
- **Classification**: bounded cutover
- **Problem**: `EntraGroupPolicy` currently grants read access based on tenant access alone and bypasses the capability layer used by the rest of the repo's authorization model.
- **Why it matters**: This is a security- and RBAC-relevant inconsistency. Even if currently read-only, it weakens the capability-first architecture and increases the chance of future authorization drift.
- **Target model**: `EntraGroupPolicy` and the Entra group read-access surface
- **Canonical source of truth**: capability-based authorization decisions layered on top of tenant-access checks
- **Must stop being read**: implicit "tenant access alone is sufficient" as the effective rule for Entra group read access.
- **Can be removed immediately**:
- the direct bypass if the correct capability already exists and seeded roles already carry it
- **Remove only after cutover**:
- any compatibility allowances needed while role-capability mappings are updated and verified
- **Migration / backfill**: Usually no schema migration. May require role-capability seeding updates or RBAC backfill so intended operators retain access.
- **UI / resource / policy / test impact**:
- UI/resources: some users may lose access if role mapping is incomplete; tenant-facing Entra group screens need regression verification
- Policy: this spec is the policy change
- Tests: add authorization matrix coverage proving tenant access alone no longer grants read access
- **Scope boundaries**:
- In scope: read authorization semantics for Entra group surfaces and the required capability mapping
- Out of scope: new CRUD semantics, role mapping product UI, unrelated policy tidy-up
- **Dependencies**: Choose the correct capability and verify seeded/default roles include it where intended.
- **Risks**: Medium rollout risk because authorization mistakes become access regressions for legitimate operators.
- **Why it should be its own spec**: This is a targeted RBAC hardening change with its own stakeholders, rollout checks, and regression matrix. It should not be hidden inside data or UI cleanup work.
- **Priority**: high
### Support Intake with Context (MVP)
- **Type**: feature
- **Source**: Product design, operator feedback
- **Problem**: Nutzer haben keinen strukturierten Weg, Probleme direkt aus dem Produkt zu melden. Bei technischen Fehlern fehlen Run-/Tenant-/Provider-Details; bei Access-/UX-Problemen fehlen Route-/RBAC-Kontext. Folge: ineffiziente Support-Schleifen und Rückfragen. Ein vollwertiges Ticketsystem ist falsch priorisiert.
- **Why it matters**: Reduziert Support-Reibung, erhöht Erfassungsqualität, steigert wahrgenommene Produktreife. Schafft typed intake layer für spätere Webhook-/PSA-/Ticketing-Erweiterungen, ohne jetzt ein Helpdesk einzuführen.
- **Proposed direction**: Neues `SupportRequest`-Modell (kein Ticket/Case) mit `source_type` (operation_run, provider_connection, access_denied, generic) und `issue_kind` (technical_problem, access_problem, ux_feedback, other). Drei Entry Paths: (1) Context-bound aus failed OperationRun, (2) Access-Denied/403-Kontext, (3) generischer Feedback-Einstieg (User-Menü). Automatischer Context-Snapshot per `SupportRequestContextBuilder` je source_type. Persistierung vor Delivery. E-Mail-Delivery an konfigurierte Support-Adresse. Fingerprint-basierter Spam-Guard. Audit-Events. RBAC via `support.request.create` Capability. Scope-Isolation. Secret-Redaction in context_jsonb.
- **Dependencies**: OperationRun-Domain stabil, RBAC/Capability-System (066+), Workspace-/Tenant-Scoping
- **Priority**: medium
### Policy Setting Explorer — Reverse Lookup for Tenant Configuration
- **Type**: feature
- **Source**: recurring enterprise pain point, governance/troubleshooting gap
- **Problem**: In medium-to-large Intune tenants with dozens of policy types and hundreds of policies, admins routinely face the question: "Where is this setting actually defined?" Examples: "Which policy configures BitLocker?", "Where is `EnableTPM` set to `true`?", "Why does this tenant enforce a specific firewall rule, and which policy is the source?" Today, answering this requires manually opening policies one by one across device configuration, compliance, endpoint security, admin templates, settings catalog, and more. TenantPilot inventories and versions these policies but provides no reverse-lookup surface that maps a setting name, key, or value back to the policies that explicitly define it.
- **Why it matters**: This is a governance, troubleshooting, and explainability gap — not a search convenience. Enterprise admins, auditors, and reviewers need authoritative answers to "where is X defined?" for incident triage, change review, compliance evidence, and duplicate detection. Without it, TenantPilot has deep policy data but cannot surface it from the operator's natural entry point (the setting, not the policy). This capability directly increases the product's value proposition for security reviews, audit preparation, and day-to-day configuration governance.
- **V1 scope**:
- Tenant-scoped only. User queries settings within the active tenant's indexed policies. No cross-tenant or portfolio-wide search in V1.
- Dedicated working surface: a tenant-level "Policy Explorer" or "Setting Search" page with query input, filters, and structured result inspection. Not a global header search widget.
- Query modes: search by setting name/label, by raw key/path, or by value-oriented query (e.g. `EnableTPM = true`).
- Results display: policy name, policy type/family, setting label/path/key, configured value, version/snapshot context, deep link to the policy detail or version inspector.
- Supported policy families: start with a curated subset of high-value indexed families (settings catalog, device configuration, compliance, endpoint security baselines, admin templates). Not every Microsoft policy type from day one.
- Search projection model: a lightweight extracted-setting-facts table per supported policy family. Preserves policy-family-local structure, retains raw path/key, stores search-friendly displayable rows. PostgreSQL-first (GIN indexes on JSONB or dedicated columns as appropriate). Not a universal canonical key normalization engine — a pragmatic, product-oriented search projection.
- Trust boundary: results reflect settings explicitly present in supported indexed policies. UI must clearly communicate this scope. No-result does NOT imply the setting is absent from effective tenant configuration — only that it was not found in currently supported indexed policy families. This distinction must be visible in the UX (scope indicator, help text, empty-state copy).
- "Defined in" only: V1 answers "where is this setting explicitly defined?" — it does NOT answer "is this setting effectively applied to devices/users?" The difference between explicit definition and effective state must be preserved and communicated.
- **Explicit non-goals (V1)**:
- No universal cross-provider canonical setting ontology. Avoid a large fragile semantic mapping project. Setting identity stays policy-family-local in V1.
- No effective-state guarantees. V1 does not resolve assignment targeting, conflict resolution, or platform-side precedence.
- No portfolio / cross-tenant / workspace-wide scope.
- No dependency on external search infrastructure (Elasticsearch, Meilisearch, etc.) if PostgreSQL-first is sufficient.
- No naive raw JSON full-text search as the product surface. The projection model must provide structured, rankable, explainable results — not grep output.
- No requirement to support every Microsoft policy family from day one.
- **Architectural direction**:
- Search projection layer: when policies are synced/versioned, extract setting facts into a dedicated search-friendly projection (e.g. `policy_setting_facts` table or JSONB-indexed structure). Each row captures: `tenant_id`, `policy_id`, `policy_version_id` (nullable), `policy_type`/`family`, `setting_key`/`path`, `setting_label` (display name where available), `configured_value`, `raw_payload_path`. Extraction logic is per-family, not a universal parser.
- PostgreSQL-first: use GIN indexes on JSONB or trigram indexes on text columns for efficient search. Evaluate `pg_trgm` for fuzzy matching.
- Extraction is append/rebuild on sync — not real-time transformation. Can be a post-sync projection step integrated into the existing inventory sync pipeline.
- Provider boundary stays explicit: the projection is populated by each policy family's extraction logic. No abstraction that pretends all policy families share the same schema.
- RBAC: tenant-scoped, gated by a capability (e.g. `policy.settings.search`). Results respect existing policy-level visibility rules.
- Audit: queries are loggable but do not require per-query audit entries in V1. The feature is read-only.
- **UX direction**:
- Primary surface: dedicated page under the tenant context (e.g. tenant → Policy Explorer or tenant → Setting Search). Full working surface with query input, optional filters (policy type, policy family, value match mode), and a results table.
- Result rows: policy name (linked), policy type badge, setting path/key, configured value, version indicator. Expandable detail or click-through to policy inspector.
- Empty state: clearly explains scope limitations ("No matching settings found in supported indexed policies. This does not mean the setting is absent from your tenant's effective configuration.").
- Scope indicator: persistent badge or label showing the search scope (e.g. "Searching N supported policy families in [tenant name]").
- Future quick-access entry point (e.g. command palette, header search shortcut) is a natural extension but not V1 scope.
- **Future expansion space** (not V1):
- Semantic aliases / display-name normalization across families
- Duplicate / conflict detection hints ("this setting is defined in 3 policies")
- Assignment-aware enrichment ("this policy targets group X")
- Setting history / change timeline ("this value changed from false to true in version 4")
- Baseline / drift linkage ("this setting deviates from the CIS baseline")
- Workspace-wide / portfolio search across tenants
- Quick-access command palette entry point
- **Risks / notes**:
- Extraction logic per policy family is the main incremental effort. Each new family supported requires a family-specific extractor. Start with the highest-value families and expand.
- Settings catalog policies have structured setting definitions that are relatively easy to extract. OMA-URI / admin template policies are less structured. The V1 family selection should favor extractability.
- The "no-result ≠ not configured" trust boundary is critical for enterprise credibility. Overcommitting search completeness erodes trust.
- Projection freshness depends on sync frequency. Stale projections must be visually flagged if the tenant hasn't been synced recently.
- **Dependencies**: Inventory sync stable, policy versioning (snapshots), tenant context model, RBAC capability system (066+)
- **Priority**: high
### Help Center / Documentation Surface
- **Type**: feature
- **Source**: product planning, operator support friction analysis
- **Problem**: TenantPilot lacks a central, searchable, in-product knowledge surface for operators. Product documentation, glossary content, workflow walkthroughs, and conceptual explanations are fragmented across specs, internal docs, and implicit operator expectations. Operators who need to understand a concept, look up a term, or read a walkthrough must leave the admin experience entirely.
- **Why it matters**: A canonical in-product documentation surface reduces support friction, enables self-service resolution, and makes advanced governance features more understandable and adoptable. This is the product's central knowledge layer — distinct from contextual inline help (separate candidate), page-level instructional panels (separate candidate), and onboarding next-step guidance (separate candidate). It is also distinct from audit/evidence/reporting artifacts. This is a product maturity and support-efficiency capability, not a content management system.
- **Proposed direction**:
- Markdown-based documentation stored in-repo, rendered inside the Filament admin panel as a dedicated help center surface
- Global documentation search across all help articles
- Structured article hierarchy: conceptual docs, walkthroughs, glossary, role/capability explanations, domain-specific governance guidance
- Clear separation between product help/knowledge and audit/report/evidence exports
- Workspace/tenant context awareness only where helpful for navigation, not to turn docs into tenant data
- Native first-party rendering approach — no external CMS dependency, no third-party documentation platform required
- **Explicit non-goals**: Not a customer support ticket system. Not an audit pack feature. Not a generic CMS. Not a replacement for external knowledge bases if those exist separately. Not the delivery mechanism for contextual inline help or page-level guidance panels — those are separate capabilities that may link into this surface but are independently spec-able. Not a video-first help strategy. Not a forced guided tour infrastructure. v1 does not include a `/system` help-management UI. Content authoring, editing, and lifecycle remain code/repo-driven. Any future `/system` surface is limited to governance/observability concerns, not CMS behavior.
- **Dependencies**: Filament panel infrastructure, existing navigation/information architecture
- **Related candidates**: Contextual Help and Inline Guidance Framework, Page-Level Guidance Patterns, Onboarding Guidance and Next-Step Surfaces, Documentation Generation Pipeline and Editorial Workflow
- **Priority**: medium
### Contextual Help and Inline Guidance Framework
- **Type**: feature
- **Source**: product planning, operator support friction analysis, governance UX complexity
- **Problem**: As TenantPilot's governance surface area grows — findings workflows, RBAC capabilities, provider connection lifecycle, restore risk assessment, compliance baselines — operators encounter complex concepts and multi-step workflows where the purpose, consequences, or next steps are not self-evident from the UI alone. There is no standard mechanism for surfacing short, targeted, inline explanations at the point of need. Operators must either already know the domain or leave the product to find help.
- **Why it matters**: Contextual help reduces cognitive load on high-complexity governance surfaces, decreases support escalations, and makes advanced features accessible to operators who are not domain experts. This is distinct from the central documentation surface (which is a browsable knowledge base) and from page-level instructional panels (which provide section-level orientation). Contextual help is the layer that answers "what does this mean?" and "what happens if I do this?" at the point of interaction.
- **Proposed direction**:
- Standardized inline help entry points (e.g. `?` icon actions, help affordances on action buttons, info popovers on complex form fields) integrated with Filament's action and component patterns
- Short-form help content rendered in slideover or modal surfaces — not tooltips, not a tooltip explosion
- Help content is text-first: concise explanation of what the feature does, why it matters, and what happens next
- Content sourced from markdown or structured help definitions stored in-repo, maintainable alongside code
- Optional deep-link from contextual help into the central Help Center for extended reading
- Clean integration with Filament v5 actions, render hooks, and CSS hook classes — no internal view publishing
- First-party, native approach — no third-party guided-tour or walkthrough library dependency
- **Explicit non-goals**: Not a tooltip explosion across every field. Not a video-first help strategy. Not a forced guided tour or product walkthrough system. Not a replacement for the central Help Center documentation surface. Not intercom-style chat or conversational help. Not an onboarding checklist (separate candidate). Not a generic CMS feature. v1 does not include a `/system` help-management UI. Content authoring, editing, and lifecycle remain code/repo-driven. Any future `/system` surface is limited to governance/observability concerns, not CMS behavior.
- **Dependencies**: Filament panel infrastructure, action system (Filament v5 actions), Help Center / Documentation Surface (for deep-link target, but functionally independent)
- **Related candidates**: Help Center / Documentation Surface, Page-Level Guidance Patterns, Admin Visual Language Canon
- **Priority**: medium
### Page-Level Guidance Patterns
- **Type**: feature
- **Source**: product planning, governance UX complexity analysis
- **Problem**: Several TenantPilot admin pages serve governance-heavy, interpretation-heavy, or consequence-heavy functions — findings review, RBAC capability management, provider connection lifecycle, restore dry-run results, compliance baseline comparison, drift analysis. These pages present data and actions that require domain context to interpret correctly, but operators arrive without orientation. There is no standard pattern for providing page-level introductory guidance, "learn more" affordances, or instructional panels that help operators understand what a page shows, why it matters, and how to use it effectively.
- **Why it matters**: Page-level guidance reduces operator confusion on the product's most complex and highest-stakes surfaces. It bridges the gap between contextual inline help (which explains individual concepts) and the central documentation surface (which is a browsable reference). It provides section-level orientation without requiring operators to leave the page or consult external resources. For governance and compliance surfaces, clear page-level framing increases operator confidence and reduces misinterpretation of presented data.
- **Proposed direction**:
- Standardized page-level instructional panel pattern: optional intro/help section at the top of qualifying pages, rendered via Filament render hooks or section components
- "Learn more" affordances linking from the page context into the central Help Center for extended documentation
- Pattern supports dismissibility and operator preference (e.g. collapsible, "don't show again" per-page or per-user)
- Content stored as markdown or structured definitions in-repo, not hardcoded in Blade templates
- Visual pattern consistent with the Admin Visual Language Canon direction — does not introduce a new visual system
- Qualifying page criteria: governance-heavy pages, consequence-heavy action pages, interpretation-heavy data review pages. Not every list page or simple CRUD form.
- **Explicit non-goals**: Not a banner/notification system. Not contextual inline help for individual fields or actions (separate candidate). Not onboarding step-by-step flow (separate candidate). Not a forced walkthrough or modal tutorial. Not applied universally to every admin page — only where governance complexity warrants it. v1 does not include a `/system` help-management UI. Content definitions remain code/repo-driven. Any future `/system` surface is limited to governance/observability concerns, not CMS behavior.
- **Dependencies**: Filament panel infrastructure, render hooks, Admin Visual Language Canon (for visual consistency), Help Center / Documentation Surface (for "learn more" link targets)
- **Related candidates**: Help Center / Documentation Surface, Contextual Help and Inline Guidance Framework, Admin Visual Language Canon
- **Priority**: low
### Onboarding Guidance and Next-Step Surfaces
- **Type**: feature
- **Source**: product planning, operator time-to-value analysis
- **Problem**: New operators and new tenants arrive in TenantPilot without structured product-level guidance about what to do next. The managed-tenant onboarding wizard handles the technical setup flow (consent, connection, initial sync), but after onboarding completes — or for operators exploring governance features for the first time — there is no product-level guidance surface that helps operators understand what is available, what to configure next, or how to reach productive use of governance workflows. Empty states, action labels, and page descriptions do not consistently communicate next steps or paths to value.
- **Why it matters**: Reduces time-to-value for new operators without introducing a forced product tour. Improves discoverability of governance features (findings, baselines, RBAC, drift, reporting) by surfacing actionable next steps rather than relying on operators to discover capabilities through sidebar exploration. Improves the quality of empty states and first-use experiences across the product. This is a product adoption and operator efficiency capability, not a marketing or onboarding conversion feature.
- **Proposed direction**:
- Smart empty states with actionable next-step guidance on high-value surfaces (e.g. findings list with no findings yet, backup history with no backups, dashboard with no synced tenants)
- Optional post-onboarding next-step surface or checklist-style orientation (not a forced wizard — an opt-in guidance panel or dedicated page)
- Descriptive action labels and page descriptions that reduce confusion about what features do, especially on first encounter
- Progress-aware guidance: surfaces that adapt based on what the operator has already configured (e.g. "sync complete — next: review your compliance baseline" vs. "no sync yet — start here")
- Lightweight, maintainable: guidance content stored as structured definitions in-repo, not hardcoded across scattered Blade templates
- Native Filament integration via empty states, info sections, and render hooks
- **Explicit non-goals**: Not a forced guided tour or product walkthrough. Not a video-first onboarding strategy. Not a step-by-step wizard for every product feature. Not a tooltip explosion. Not a replacement for the managed-tenant onboarding wizard (which handles technical setup). Not a marketing/conversion funnel. Not a third-party onboarding library integration (e.g. no Appcues, no Intercom tours). Not the same as contextual inline help (separate candidate) or page-level instructional panels (separate candidate). v1 does not include a `/system` help-management UI. Guidance definitions remain code/repo-driven. Any future `/system` surface is limited to governance/observability concerns, not CMS behavior.
- **Dependencies**: Filament panel infrastructure, empty-state patterns, Admin Visual Language Canon (for visual consistency), managed-tenant onboarding wizard (Spec 001 and follow-ups — this candidate covers life after technical onboarding)
- **Related candidates**: Help Center / Documentation Surface, Contextual Help and Inline Guidance Framework, Page-Level Guidance Patterns
- **Priority**: medium
### Documentation Generation Pipeline and Editorial Workflow
- **Type**: feature
- **Source**: product planning, documentation sustainability analysis
- **Problem**: Even with a markdown-based knowledge layer, documentation quality and coverage will degrade without a lightweight authoring pipeline. The product needs a structured way to generate document skeletons/templates, support repeatable documentation workflows, and optionally use AI-assisted drafting without treating generated text as authoritative by default.
- **Why it matters**: Without a documentation pipeline, docs become inconsistent, coverage drifts as features grow, teams fall back to ad hoc writing, the help layer becomes expensive to maintain, and future AI-assisted documentation lacks guardrails.
- **Proposed direction**:
- Document skeleton or template generation (e.g. command/tooling such as `docs:generate`)
- Structured frontmatter / metadata expectations where useful
- Editorial states such as draft / needs review / published
- Explicit "AI draft needs review" semantics to distinguish generated drafts from canonical reviewed documentation
- Repo-native markdown workflow as the source of truth
- **Explicit non-goals**: Not a replacement for careful documentation authorship. Not a public marketing content engine. Not a promise of autonomous documentation generation. This is an internal/product documentation pipeline and editorial guardrail layer.
- **Dependencies**: Help Center / Documentation Surface (this candidate builds on the rendering/delivery surface)
- **Priority**: low
### Drift Notifications Settings Surface
- **Type**: feature
- **Source**: product planning, governance alerting direction
- **Problem**: TenantPilot has governance/alerting direction, but operators still lack a clear product surface to configure drift-related notification behavior in a predictable way. Without a dedicated settings experience, alert routing feels infrastructural rather than operator-manageable.
- **Why it matters**: Operators need tenant/workspace-level control over how governance signals reach them — email, Microsoft Teams, severity-aware routing, notification fatigue reduction, and confidence that important drift events will not be silently missed. Especially relevant for MSP-style operations and ongoing tenant reviews.
- **Proposed direction**:
- Dedicated settings-level drift notification management surface
- Delivery targets such as email and Teams
- Routing preferences by severity / event type where appropriately bounded
- Sensible defaults with cooldown / dedup / quiet-hours framing if those concepts already exist in the broader alerting direction
- Clear alignment with broader Alerts v1 direction, focused on the operator settings UX and configuration model
- **Explicit non-goals**: Not a reinvention of the whole alerts engine. Not a generic notification center for every product event. This is the operator-facing configuration surface for drift/governance notifications.
- **Dependencies**: Alerting v1 direction, drift detection foundation (Spec 044), tenant/workspace context model
- **Priority**: medium
### Drift Change Governance — Approval Workflows, Freeze Windows, Tamper Detection
- **Type**: feature
- **Source**: roadmap-to-spec coverage audit 2026-03-18, 0800-future-features brainstorming (pillar #2 — Drift & Change Governance / "Zahlhebel #1")
- **Problem**: TenantPilot's drift/baseline engine (Specs 116119) detects configuration changes and surfaces them as findings. But detection alone does not govern _when_ and _how_ changes are allowed. There is no approval workflow for high-risk configuration changes, no protected time windows during which changes are blocked or escalated, and no tamper detection that distinguishes authorized changes from suspicious or unauthorized modifications. The drift engine answers "what changed?"; this capability answers "was the change allowed, and should it have happened now?"
- **Why it matters**: Change governance is the #2 revenue lever identified in product brainstorming. Enterprise customers and MSPs managing production tenants need controlled change processes — especially for high-risk policy families (endpoint security, compliance, conditional access). Without approval workflows and freeze windows, every detected drift is reactive: something already happened, the only question is whether to accept or revert it. Adding a governance layer turns drift from a detection feature into a change-control platform — the core value proposition for regulated environments, MSP SLA enforcement, and enterprise change management.
- **Proposed direction**:
- **Change approval workflows**: define which policy families, change types, or severity levels require explicit approval before being accepted in the governance record. Approval can be gated by capability (e.g. `drift.change.approve`), with structured approval/rejection, justification, actor, and timestamp. Approval workflows are governance-layer constructs — they do not block changes in Intune (TenantPilot does not control the mutation path in the source tenant), but they govern how TenantPilot treats detected changes: approved (accepted into baseline), rejected (escalated as governance violation), pending (awaiting review).
- **Protected / frozen windows**: workspace- or tenant-level configuration of time periods during which detected changes are automatically escalated or flagged (e.g. "no high-risk changes accepted during weekend maintenance windows" or "freeze all baseline-covered policy families during audit preparation"). Freeze windows do not prevent changes in Intune — they elevate the governance response when changes are detected during protected periods.
- **Tamper / suspicious change detection**: heuristic or rule-based identification of changes that look unauthorized or anomalous — changes outside business hours, changes by unexpected actors, bulk changes across multiple policy families, changes to policies that haven't been modified in extended periods. These produce elevated findings or alerts, not automated blocks.
- **Integration with existing drift/findings pipeline**: change governance operates on top of the drift detection pipeline. It consumes drift findings and applies governance rules (approval requirements, freeze window evaluation, tamper heuristics) to classify and route them. It does not replace the detection engine.
- **Explicit non-goals**: Not a rewrite or replacement of the drift/baseline engine (Specs 116119 handle detection; this handles governance response). Not a DevOps CI/CD pipeline or deployment system — TenantPilot does not deploy to Intune tenants. Not a mutation-path control mechanism — TenantPilot cannot prevent changes in the source tenant; it can only govern how detected changes are classified and acted upon. Not a generic security hardening bucket. Not a real-time blocking proxy between operators and Intune.
- **Boundary with drift/baseline engine (Specs 116119)**: Drift engine = detect changes, capture content, produce findings. Change governance = classify detected changes against approval/freeze/tamper rules, route them through governance workflows. The engine feeds this layer; this layer does not modify detection behavior.
- **Boundary with Drift Notifications Settings Surface**: Drift notifications = operator-facing configuration of alert delivery for drift events. Change governance = approval workflows, freeze windows, tamper classification. Notifications deliver signals; governance adds workflow, classification, and control semantics.
- **Boundary with Exception / Risk-Acceptance Workflow for Findings**: Risk acceptance = post-hoc acknowledgment that a known finding is intentionally accepted. Change governance = pre/peri-change classification of whether a detected change was authorized and occurred under acceptable conditions. Different lifecycle positions, complementary capabilities.
- **Dependencies**: Drift/baseline engine (Specs 116119) fully shipped, findings workflow (Spec 111), alerting foundations (Specs 099/100), audit log foundation (Spec 134), RBAC/capability system (066+)
- **Priority**: medium (high strategic and revenue value, but depends on drift engine and findings workflow maturity)
### User Invitations and Directory-based User Selection
- **Type**: feature
- **Source**: product planning, access-management UX analysis
- **Problem**: Workspace and tenant membership flows currently lack a polished enterprise-grade invitation and directory-assisted user selection experience. Operators should not need brittle manual steps to add the right person to the right workspace/tenant context.
- **Why it matters**: Improves onboarding speed, operator/admin efficiency, correctness of membership assignment, enterprise credibility of the access-management UX, and future scalability of workspace/tenant administration.
- **Proposed direction**:
- Directory-based user lookup / selection where supported
- Invitation flows initiated directly from membership management surfaces
- Invitation link / invitation lifecycle support
- Clear distinction between selecting an existing directory identity vs inviting a not-yet-active participant
- Alignment with existing RBAC / membership / workspace-first context model
- **Explicit non-goals**: Not a full identity-provider redesign. Not a replacement for the Entra auth architecture. Not a generic address-book feature. This is a bounded access-administration workflow improvement.
- **Dependencies**: RBAC/capability system (066+), workspace membership model, Entra identity integration
- **Priority**: medium
<!-- Row Interaction / Action Surface follow-up cluster (2026-03-16) -->
> **Action Surface follow-up direction** — The action-surface contract foundation (Specs 082, 090) and the follow-up taxonomy/viewer specs (143146) are all fully implemented. The remaining gaps are not architectural redesign — they are incomplete adoption, missing decision criteria, and scope boundaries that haven't expanded to cover all product surfaces. The correct shape is: one foundation amendment to codify the missing rules and extend contract scope (v1.1), two compliance rollout specs to enroll currently-exempted surface families, and one targeted correction to fix the clearest remaining anti-pattern on a high-signal surface. This avoids reinventing the architecture, avoids umbrella "consistency" specs, and produces bounded, independently shippable work. TenantResource lifecycle-conditional actions and PolicyResource More-menu ordering are addressed by the updated foundation rules, not by standalone specs. Widgets, choosers, and pickers remain deferred/exempt.
### Action Surface Contract v1.1 — Decision Criteria, Ordering Rules, and System Scope Extension
- **Type**: foundation/spec amendment
- **Source**: row interaction / action surface architecture analysis 2026-03-16
- **Problem**: The action-surface contract (Spec 082) establishes profiles, slots, affordances, validator tests, and guard tests — but does not codify three things: (1) formal decision criteria for when a surface should use ClickableRow vs ViewAction vs PrimaryLinkColumn as its inspect affordance; (2) ordering rules for actions inside the More menu (destructive-last, lifecycle position, stable grouping); (3) system-panel table surfaces are explicitly excluded from contract scope, meaning ~6 operational surfaces have no declaration and no CI coverage. The architecture is correct; it just cannot prevent inconsistent choices on new surfaces or catch drift on existing ones.
- **Why this is its own spec**: This is a foundation amendment — it changes the rules that all other surfaces must follow. Rollout specs (system panel enrollment, relation manager enrollment) depend on this spec's updated rules existing first. Merging rollout work into a foundation amendment blurs the boundary between "what the rules are" and "who must comply."
- **In scope**:
- Codify inspect-affordance decision tree (ClickableRow default, ViewAction exception criteria, PrimaryLinkColumn criteria) in `docs/ui/action-surface-contract.md`
- Define the "lone ViewAction" anti-pattern formally and add it to validator detection
- Codify More-menu action ordering rules (lifecycle actions, severity ordering, destructive-last)
- Extend contract scope so system-panel table surfaces are enrollable (not exempt by default)
- Add guidance that cross-panel surface taxonomy should converge where semantically equivalent
- Update `ActionSurfaceValidator` to enforce new criteria
- Update guard/contract tests to cover new rules
- **Non-goals**:
- Retrofitting all existing system-panel pages (separate rollout spec)
- Retrofitting all relation managers (separate rollout spec)
- One-off resource-level fixes (those are tasks within rollout or correction specs)
- TenantResource or PolicyResource redesign (addressed by applying the updated rules, not by dedicated specs)
- Chooser/picker/widget contracts (remain deferred/exempt)
- **Depends on**: Spec 082, Spec 090 (both fully complete — this extends their foundation)
- **Suggested order**: First. All other candidates in this cluster depend on the updated rules.
- **Risk**: Low. This adds rules and extends scope — it does not change existing compliant declarations.
- **Why this boundary is right**: Foundation rules must be codified before rollout enforcement. Mixing rule definition with compliance rollout makes it impossible to review the rules independently and creates circular dependencies.
- **Priority**: high
### System Panel Action Surface Contract Enrollment
- **Type**: compliance rollout
- **Source**: row interaction / action surface architecture analysis 2026-03-16
- **Problem**: System-panel table surfaces (Ops/Runs, Ops/Failures, Ops/Stuck, Directory/Tenants, Directory/Workspaces, Security/AccessLogs) use `recordUrl()` consistently but have no `ActionSurfaceDeclaration`, no CI coverage, and are exempt from the contract by default. They are the largest family of undeclared table surfaces in the product.
- **Why this is its own spec**: System-panel surfaces belong to a different panel with different operator audiences and potentially different profile requirements. Enrolling them is a distinct compliance effort from tenant-panel relation managers or targeted resource corrections. The scope is bounded and independently shippable.
- **In scope**:
- Declare `ActionSurfaceDeclaration` for each system-panel table surface (~6 pages)
- Map to existing profiles where semantically correct (e.g., `ListOnlyReadOnly` for access logs, `RunLog` for ops run tables)
- Introduce new system-specific profiles only if existing profiles truly do not fit
- Remove enrolled system-panel pages from `ActionSurfaceExemptions` baseline
- Add guard test coverage for enrolled system surfaces
- **Non-goals**:
- Tenant-panel resource declarations (already covered by Spec 090)
- Relation manager enrollment (separate candidate)
- Non-table system pages (dashboards, diagnostics, choosers)
- System-panel RBAC redesign
- Cross-workspace query authorization (tracked as "System Console Scope Hardening" candidate)
- **Depends on**: Action Surface Contract v1.1 (must extend scope to system panel first)
- **Suggested order**: Second, in parallel with "Run Log Inspect Affordance Alignment" after v1.1 is complete.
- **Risk**: Low. These surfaces already behave consistently; this work adds formal declarations and CI coverage.
- **Why this boundary is right**: System-panel enrollment is self-contained — it doesn't touch tenant-panel resources or relation managers. Completing it independently gives CI coverage over a currently-invisible surface family.
- **Priority**: medium
### Relation Manager Action Surface Contract Enrollment
- **Type**: compliance rollout
- **Source**: row interaction / action surface architecture analysis 2026-03-16
- **Problem**: Three relation managers (`BackupItemsRelationManager`, `TenantMembershipsRelationManager`, `WorkspaceMembershipsRelationManager`) are in the `ActionSurfaceExemptions` baseline with no declaration. They were exempted during initial rollout (Spec 090) because relation-manager-specific profile semantics were not yet settled. Three other relation managers already have declarations. The exemption should be reduced, not permanent.
- **Why this is its own spec**: Relation managers have different interaction expectations than standalone list resources (context is always nested under a parent record, pagination/empty-state semantics differ, attach/detach may replace create/delete in some cases). Enrollment requires relation-manager-specific review of profile fit, not just copying resource-level declarations.
- **In scope**:
- Declare `ActionSurfaceDeclaration` for each currently-exempted relation manager (3 components)
- Validate profile fit (`RelationManager` profile vs a more specific variant)
- Reduce `ActionSurfaceExemptions` baseline by removing enrolled relation managers
- Add guard test coverage
- **Non-goals**:
- Redesigning backup item management UX
- Redesigning membership management UX
- Parent resource changes (TenantResource, WorkspaceResource)
- Full restore/backup domain redesign
- Introducing new relation managers
- **Depends on**: Action Surface Contract v1.1 (for any updated profile guidance or relation-manager-specific ordering rules)
- **Suggested order**: Third, after both v1.1 and System Panel Enrollment are complete. Lowest urgency because these surfaces are low-traffic and already functionally correct.
- **Risk**: Low. These relation managers already work correctly. This adds formal compliance, not behavioral change.
- **Why this boundary is right**: Relation manager enrollment is a distinct surface family with its own profile semantics. Mixing it with system-panel enrollment or targeted resource corrections would create an unfocused rollout spec.
- **Priority**: low
### Run Log Inspect Affordance Alignment
- **Type**: targeted surface correction
- **Source**: row interaction / action surface architecture analysis 2026-03-16
- **Problem**: `OperationRunResource` declares the `RunLog` profile with `ViewAction` as its inspect affordance. In practice, it renders a lone `ViewAction` in the actions column — the "lone ViewAction" anti-pattern identified in `docs/ui/action-surface-contract.md`. The row-click-first direction means this surface should use `ClickableRow` drill-down to the canonical tenantless viewer (`OperationRunLinks::tenantlessView()`), not a standalone View button. This surface is also inherited by the `Monitoring/Operations` page (which delegates to `OperationRunResource::table()`), so the fix propagates to both surfaces.
- **Why this is its own spec**: This is the single highest-signal concrete violation of the action-surface contract direction. It is bounded to one resource declaration + one inherited page. It does not require rewriting the canonical viewer, redesigning the operations domain, or touching other monitoring surfaces. Keeping it separate from foundation amendments ensures it can ship quickly after v1.1 codifies the anti-pattern rule.
- **In scope**:
- Change `OperationRunResource` inspect affordance from `ViewAction` to `ClickableRow`
- Verify `recordUrl()` points to the canonical tenantless viewer
- Remove the lone `ViewAction` from the actions column
- Confirm the change propagates correctly to `Monitoring/Operations` (which delegates to `OperationRunResource::table()`)
- Update/add guard test assertion for the corrected declaration
- **Non-goals**:
- Rewriting the canonical operation viewer (Spec 144 already complete)
- Broad operations UX redesign
- All monitoring pages (Alerts, Stuck, Failures are separate surfaces with distinct interaction models)
- RestoreRunResource alignment (currently exempted — separate concern)
- Action hierarchy / More-menu changes on this surface (belong to a general rollout, not this correction)
- **Depends on**: Action Surface Contract v1.1 (for codified anti-pattern rule and ClickableRow-default guidance)
- **Suggested order**: Second, in parallel with "System Panel Enrollment" after v1.1 is complete. Quickest win and highest signal correction.
- **Risk**: Low. Single resource, no behavioral regression, no data model change.
- **Why this boundary is right**: One resource, one anti-pattern, one fix. Expanding scope to "all run-log surfaces" or "all operation views" would turn a quick correction into a rollout spec and delay the most visible improvement.
- **Priority**: medium
### Admin Visual Language Canon — First-Party UI Convention Codification and Drift Prevention
- **Type**: foundation
- **Source**: admin UI consistency analysis 2026-03-17
- **Problem**: TenantPilot has accumulated a strong set of first-party visual conventions across Filament resources, widgets, detail pages, badges, status indicators, action hierarchies, and operational surfaces. These conventions are emerging organically and are already broadly consistent — but they remain implicit. No canonical reference defines when to use native Filament patterns vs custom enterprise-detail compositions, which badge/status semantics apply to which domain states, how timestamps should render (`since()` vs absolute datetime vs contextual format), what the card/section/surface hierarchy rules are, which widget composition strategies are canonical, or where cross-panel visual divergence is intentional vs accidental. As the product's surface area grows — new policy families, new governance domains, new operational pages, new evidence/reporting surfaces — the risk is not current visual chaos but future drift caused by missing written selection criteria and decision rules.
- **Why it matters**: Without a codified visual language reference, each new surface is a local design decision made without canonical guidance. This produces slow, cumulative inconsistency that becomes expensive to correct retroactively and degrades enterprise UX credibility. The problem is amplified by multi-agent development: multiple contributors (human and AI) cannot converge on implicit conventions they haven't seen documented. The value is not aesthetic — it is architectural: a canonical reference prevents divergent local choices, reduces review friction, accelerates new surface development, and establishes a stable foundation for the product's long-term visual identity without introducing third-party theme dependencies.
- **Proposed direction**:
- Codify the existing first-party admin visual conventions as a canonical reference document (e.g. `docs/ui/admin-visual-language.md` or similar), covering:
- Badge/status semantics: color mapping rules, icon usage criteria, domain-specific badge extraction patterns, when to use Filament native badge vs custom status composition
- Timestamp rendering: decision rules for `since()` (relative) vs absolute datetime vs contextual format, with domain-specific overrides where justified
- Action hierarchy: primary action vs header actions vs row actions vs bulk actions presentation conventions (complementing the Action Surface Contract's interaction-level rules with visual-level guidance)
- Widget composition: selection criteria for stat cards, chart widgets, list widgets, and custom compositions; density and grouping rules
- Surface/card/section hierarchy: when to use native Filament sections vs custom detail cards vs grouped infoblocks; nesting and visual weight rules
- Enterprise-detail page composition: canonical structure for entity detail/view pages (header, metadata, status, content sections, related data)
- Cross-panel visual divergence: explicit rules for where admin-panel and system-panel styling may diverge and where they must converge
- Typography and spacing: canonical use of Filament's built-in text scales and spacing tokens; rules against ad hoc inline styles
- Establish guardrails against ad hoc local visual overrides (documented anti-patterns, PR review checklist items, or lightweight CI checks where practical)
- Explicitly state that native Filament v5 configuration and CSS hook classes remain the primary styling foundation; a thin first-party theme layer is only justified if native configuration proves insufficient for a documented, bounded set of requirements
- Explicitly reject third-party theme packages (e.g. Filament theme marketplace packages) as an architectural baseline unless separately justified by a dedicated evaluation spec with clear acceptance criteria
- Where existing conventions have already diverged, define the canonical choice and flag surfaces that need alignment (as future cleanup tasks, not as part of this spec's implementation scope)
- **In scope**:
- Inventory of existing visual conventions across tier-1 admin surfaces (resources, detail pages, dashboards, operational views)
- Canonical reference document with decision rules and examples
- Anti-pattern catalog (known visual drift patterns to avoid)
- Lightweight enforcement strategy (review checklist, optional CI, or validator approach)
- Explicit architectural position on theme dependencies
- **Out of scope**:
- Visual redesign of any existing surface (this is codification, not redesign)
- Aesthetic refresh or "make it look nicer" polish work
- Third-party theme evaluation, selection, or integration
- Broad Filament view publishing or deep customization layer
- Marketing/branding/identity work (this is internal admin UX, not external brand)
- Color palette redesign or new design-system creation
- Retrofitting all existing surfaces to strict compliance (alignment cleanup is tracked separately per surface)
- **Key architectural positions**:
- Native Filament v5 remains the primary visual foundation. The product's visual identity is expressed through intentional use of native Filament configuration, not through override layers.
- CSS hook classes are the canonical customization mechanism where native configuration is insufficient. No publishing of Filament internal views for styling purposes.
- The main gap is missing canonical reference and decision rules, not missing components or missing technology.
- The value proposition is preventing future UI drift as more surfaces are added, not correcting a current visual crisis.
- **Dependencies**: Action Surface Contract (Spec 082 / v1.1 candidate) for interaction-level conventions that this visual-level reference complements but does not duplicate. Operations Naming Harmonization candidate for operator-facing terminology alignment that is a distinct concern from visual conventions.
- **Related candidates**: Action Surface Contract v1.1, Operations Naming Harmonization, Help Center / Documentation Surface (the visual language reference could eventually link from contextual help)
- **Trigger / best time to do this**: Before the next wave of new governance domain surfaces (Entra Role Governance, Enterprise App Governance, SharePoint Sharing Governance, Evidence Domain) and before the Policy Setting Explorer UX, so those surfaces are built against documented canonical conventions rather than best-effort pattern matching.
- **Risks if ignored**: Slow visual drift across surfaces, increasing review friction for new surfaces, divergent local conventions that become expensive to reconcile, weakened enterprise UX credibility as surface count grows, and higher cost of eventual systematic alignment.
- **Priority**: medium
### Infrastructure & Platform Debt — CI, Static Analysis, Test Parity, Release Process
- **Type**: hardening
- **Source**: roadmap-to-spec coverage audit 2026-03-18, Infrastructure & Platform Debt table in `docs/product/roadmap.md`
- **Problem**: TenantPilot's product architecture and governance domain have matured significantly, but the surrounding delivery infrastructure has not kept pace. The roadmap acknowledges six open infrastructure debt items — no CI pipeline, no static analysis (PHPStan/Larastan), SQLite-for-tests vs. PostgreSQL-in-production schema drift risk, no `.env.example`, no formal release process, and Dokploy configuration external to the repo — but none of these has a planning home or a specifiable path to resolution. Individually, each is a small-to-medium task. Collectively, they represent a real delivery confidence and maintainability gap: regressions are caught manually, schema drift between test and runtime is a known risk, deploys are manual, there is no static analysis baseline, and developer onboarding has unnecessary friction. As surface area and contributor count grow, this gap becomes more expensive and more dangerous.
- **Why it matters**: Delivery infrastructure is the foundation that makes product-level correctness sustainable. Without CI, regressions that product architecture hardening work has eliminated can silently return. Without static analysis, type-safety gains from PHP 8.4 and strict Filament/Livewire patterns are unenforced. Test/runtime environment parity gaps mean that passing tests do not prove production correctness — a particularly dangerous problem for a product that governs enterprise tenant configurations. No formal release process means deploy confidence depends on human discipline, which degrades as velocity increases. These are not individually urgent, but they are collectively a prerequisite for scaling the product safely. A platform that governs enterprise Intune tenants should have its own delivery governance in order.
- **Proposed direction**:
- **CI pipeline**: establish a CI configuration (compatible with Gitea runners or external CI) that runs the test suite, Pint formatting checks, and (once added) static analysis on every push/PR. Start with a minimal pipeline that provides a pass/fail quality gate rather than a complex multi-stage build system. The goal is "every merge request is automatically validated" — not a full platform engineering initiative.
- **Static analysis baseline**: introduce PHPStan or Larastan at a pragmatic starting level (e.g. level 56), baselined against the current codebase. Focus on catching type errors, undefined method calls, and incorrect return types. Do not aim for level-max compliance as a first step — establish the tool, baseline the noise, and raise the level incrementally.
- **Test/runtime environment parity**: resolve the SQLite-for-tests vs. PostgreSQL-in-production gap. The existing `phpunit.pgsql.xml` suggests this work is partially started. The goal is that the default test suite runs against the same database engine used in production, so that schema-level and query-level differences do not create silent correctness gaps. This is particularly important for JSONB-dependent domains (policy snapshots, backup payloads, operation context).
- **Developer onboarding hygiene**: add `.env.example` with documented defaults. Small but persistent friction item that affects new contributor experience and reduces setup-related support burden.
- **Release process formalization**: define a lightweight, documented release process covering version tagging, migration verification, asset compilation (`filament:assets`, `npm run build`), and staging-to-production promotion checks. Not a full release engineering overhaul — a minimal repeatable process that replaces purely manual deploys with a documented, verifiable workflow.
- **Deployment configuration traceability**: evaluate bringing essential Dokploy/deploy configuration into the repo (or at minimum documenting the external configuration surface) so that environment drift between staging and production is detectable rather than discovered after deployment.
- **Explicit non-goals**: Not a full platform engineering or DevOps transformation initiative. Not a rewrite of deployment architecture or infrastructure provisioning. Not a generic "clean up the repo" bucket for unrelated code quality tasks. Not a replacement for product-level architecture hardening work (queued execution reauthorization, Livewire context locking, etc. are distinct product-safety concerns). Not a mandate to achieve maximum static analysis strictness immediately. Not a CI/CD feature-flag or canary-deployment system. Not an internal developer tooling platform with custom CLIs, dashboards, or abstraction layers. The scope is bounded to the six concrete items identified in the roadmap's Infrastructure & Platform Debt table, plus the minimal CI/release process that connects them into an actionable delivery improvement.
- **Boundary with product architecture hardening (Queued Execution Reauthorization, Livewire Context Locking, etc.)**: Product hardening candidates address trust, authorization, and isolation correctness in the running application. Infrastructure debt addresses delivery confidence — the tooling and process that ensures correctness is verified continuously and shipped reliably. These are complementary layers: product hardening fixes what the code does; infrastructure maturity ensures the fixes stay fixed.
- **Boundary with Operations Naming Harmonization**: Operations naming is about operator-facing terminology consistency across product surfaces. Infrastructure debt is about developer-facing delivery tooling and process. Different audiences, different concerns.
- **Boundary with Admin Visual Language Canon**: The visual language canon mentions lightweight CI enforcement as a possible delivery mechanism for visual convention compliance. If this infrastructure candidate delivers CI, the visual canon can use it — but the CI pipeline itself is infrastructure scope, not visual-canon scope.
- **Dependencies**: None — this is foundational work that other candidates can build on. CI pipeline benefits every future spec by providing automated regression coverage. Static analysis benefits every future hardening spec by enforcing type-safety contractually.
- **Priority**: medium (high cumulative value for delivery confidence and maintainability, but individual items are execution-level tasks rather than product-architecture blockers; should be prioritized pragmatically alongside product work rather than treated as urgent or deferred indefinitely)
---
## Covered / Absorbed
> Candidates that were previously qualified but are now substantially covered by existing specs, or were umbrella labels whose children have been promoted individually.
### Governance Architecture Hardening Wave (umbrella — dissolved)
- **Original source**: architecture audit 2026-03-15
- **Status**: Dissolved into individual candidates. The four children are now tracked separately in Qualified: Queued Execution Reauthorization, Tenant-Owned Query Canon, Livewire Context Locking. The fourth child (Findings Workflow Enforcement) is absorbed below.
- **Reference**: [../audits/tenantpilot-architecture-audit-constitution.md](../audits/tenantpilot-architecture-audit-constitution.md), [../audits/2026-03-15-audit-spec-candidates.md](../audits/2026-03-15-audit-spec-candidates.md)
### Findings Workflow Enforcement and Audit Backstop
- **Original source**: architecture audit 2026-03-15, candidate C
- **Status**: Largely absorbed by Spec 111 (findings workflow v2) which defines transition enforcement, timestamp tracking, reason validation, and audit logging. The remaining architectural enforcement gap (model-level bypass prevention) is a hardening follow-up to Spec 111, not a standalone spec-sized problem. Re-qualify only if enforcement softness surfaces as a concrete regression or audit finding.
### Workspace Chooser v2
- **Original source**: Spec 107 deferred backlog
- **Status**: Workspace chooser v1 is covered by Spec 107 + semantic fix in Spec 121. The v2 polish items (search, sort, favorites, pins, environment badges) remain tracked as an Inbox entry. Not qualified as a standalone spec candidate at current priority.
### Dashboard Polish (Enterprise-grade)
- **Original source**: Product review 2026-03-08
- **Status**: Core tenant dashboard is covered by Spec 058 (drift-first KPIs, needs attention, recent lists). Workspace-level landing is in progress via Spec 129. The remaining polish items (sparklines, compliance gauge, progressive disclosure) are tracked in Inbox. This was demoted because the candidate lacked a bounded spec scope — it read as a wish list rather than a specifiable problem.
---
## Planned
> Ready for spec creation. Waiting for slot in active work.
*(empty — move items here when prioritized for next sprint)*
---
## Template
```md
### Title
- **Type**: feature | polish | hardening | bug | research
- **Source**: chat | audit | coding discovery | customer feedback | spec N follow-up
- **Problem**:
- **Why it matters**:
- **Proposed direction**:
- **Dependencies**:
- **Priority**: low | medium | high
```