ahmido 5bcb4f6ab8 feat: harden queued execution legitimacy (#179 )

## Summary
- add a canonical queued execution legitimacy contract for actor-bound and system-authority operation runs
- enforce legitimacy before queued jobs transition runs to running across provider, inventory, restore, bulk, sync, and scheduled backup flows
- surface blocked execution outcomes consistently in Monitoring, notifications, audit data, and the tenantless operation viewer
- add Spec 149 artifacts and focused Pest coverage for legitimacy decisions, middleware ordering, blocked presentation, retry behavior, and cross-family adoption

## Testing
- vendor/bin/sail artisan test --compact tests/Unit/Operations/QueuedExecutionLegitimacyGateTest.php
- vendor/bin/sail artisan test --compact tests/Feature/Operations/QueuedExecutionMiddlewareOrderingTest.php
- vendor/bin/sail artisan test --compact tests/Feature/Verification/ProviderExecutionReauthorizationTest.php
- vendor/bin/sail artisan test --compact tests/Feature/Operations/RunInventorySyncExecutionReauthorizationTest.php
- vendor/bin/sail artisan test --compact tests/Feature/Operations/ExecuteRestoreRunExecutionReauthorizationTest.php
- vendor/bin/sail artisan test --compact tests/Feature/Operations/SystemRunBlockedExecutionNotificationTest.php
- vendor/bin/sail artisan test --compact tests/Feature/Operations/BulkOperationExecutionReauthorizationTest.php
- vendor/bin/sail artisan test --compact tests/Feature/Operations/QueuedExecutionRetryReauthorizationTest.php
- vendor/bin/sail artisan test --compact tests/Feature/Operations/QueuedExecutionContractMatrixTest.php
- vendor/bin/sail artisan test --compact tests/Feature/Operations/OperationRunBlockedExecutionPresentationTest.php
- vendor/bin/sail artisan test --compact tests/Feature/Operations/QueuedExecutionAuditTrailTest.php
- vendor/bin/sail artisan test --compact tests/Feature/Operations/TenantlessOperationRunViewerTest.php
- vendor/bin/sail bin pint --dirty --format agent

## Manual validation
- validated queued provider execution blocking for tenant operability drift in the integrated browser on /admin/operations and /admin/operations/{run}
- validated 404 vs 403 route behavior for non-membership vs in-scope capability denial
- validated initiator-null blocked system-run behavior without creating a user terminal notification

Co-authored-by: Ahmed Darrazi <ahmed.darrazi@live.de>
Reviewed-on: #179

2026-03-17 21:52:40 +00:00

77 KiB

Raw Blame History

Spec Candidates

Concrete future specs waiting for prioritization. Each entry has enough structure to become a real spec when the time comes.

Flow: Inbox → Qualified → Planned → Spec created → removed from this file

Last reviewed: 2026-03-17 (Enterprise App / SP Governance, SharePoint Sharing Governance, Entra Role Governance, Security Posture Signals added)

Inbox

Ungefiltert. Kurze Notiz reicht. Wöchentlich sichten.

Dashboard trend visualizations (sparklines, compliance gauge, drift-over-time chart)
Dashboard "Needs Attention" should be visually louder (alert color, icon, severity weighting)
Dashboard enterprise polish: severity-weighted drift table, actionable alert buttons, progressive disclosure (demoted from Qualified — needs bounded scope before re-qualifying)
Operations table should show duration + affected policy count
Density control / comfortable view toggle for admin tables
Inventory landing page may be redundant — consider pure navigation section
Settings change history → explainable change tracking
Workspace chooser v2: search, sort, favorites, pins, environment badges, last activity

Qualified

Problem + Nutzen klar. Scope noch offen. Braucht noch Priorisierung.

Queued Execution Reauthorization and Scope Continuity

Type: hardening
Source: architecture audit 2026-03-15
Problem: Queued work still relies too heavily on dispatch-time actor and tenant state. Execution-time scope continuity and capability revalidation are not yet hardened as a canonical backend contract.
Why it matters: This is a backend trust-gap on the mutation path. It creates the class of failure where a UI action was valid at dispatch time but the queued execution is no longer legitimate when it runs.
Proposed direction: Define execution-time reauthorization, tenant operability rechecks, denial semantics, and audit visibility as a dedicated spec instead of scattering local authorize() patches.
Dependencies: Existing operations semantics, audit log foundation, queued job execution paths
Priority: high

Tenant-Owned Query Canon and Wrong-Tenant Guards

Type: hardening
Source: architecture audit 2026-03-15
Problem: Tenant isolation exists, but many reads still depend on local tenant_id filters instead of a reusable canonical query path. Wrong-tenant regression coverage is also uneven.
Why it matters: This is isolation drift. Repeated local filtering increases the chance of future cross-tenant mistakes across resources, widgets, actions, and detail pages.
Proposed direction: Define a canonical query entry pattern for tenant-owned models plus a required wrong-tenant regression matrix for tier-1 surfaces.
Dependencies: Canonical tenant context work in Specs 135 and 136
Priority: high

Livewire Context Locking and Trusted-State Reduction

Type: hardening
Source: architecture audit 2026-03-15
Problem: Complex Livewire and Filament flows still expose ownership-relevant context in public component state without one explicit repo-wide hardening standard.
Why it matters: This is a trust-boundary problem. Even without a known exploit, mutable client-visible identifiers and workflow context make future authorization and isolation mistakes more likely.
Proposed direction: Define a reusable hardening pattern for locked identifiers, server-derived workflow truth, and forged-state regression tests on tier-1 component families.
Dependencies: Managed tenant onboarding draft identity (Spec 138), onboarding lifecycle checkpoint work (Spec 140)
Priority: medium

Tenant Draft Discard Lifecycle and Orphaned Draft Visibility

Type: hardening
Source: domain architecture analysis 2026-03-16 — tenant lifecycle vs onboarding workflow lifecycle review
Problem: TenantPilot correctly separates durable tenant lifecycle (draft, onboarding, active, archived) from onboarding workflow lifecycle (draft → completed / cancelled), but there is no end-of-life path for abandoned draft tenants. When all onboarding sessions for a tenant are cancelled, the tenant reverts to draft and remains visible indefinitely without a semantically correct cleanup action. Archive/restore do not apply (draft tenants have no operational data worth preserving), and force delete requires archive first (which is semantically wrong for a provisional record). Operators cannot remove orphaned drafts.
Why it matters: Without a discard path, abandoned draft tenants accumulate as orphaned rows in the tenant list. This creates operator confusion (draft vs. archived vs. active ambiguity), data hygiene issues, and forces operators to either ignore stale records or misuse lifecycle actions that don't fit the domain semantics. The gap also makes tenant list UX harder to trust for enterprise operators managing many tenants.
Proposed direction:
- Introduce a canonical draft discardability contract (central service/policy, not scattered UI visibility logic) that determines whether a draft tenant may be safely removed, considering linked onboarding sessions, downstream artifacts, and operational traces
- Add a discard draft destructive action for tenant records in draft status with no resumable onboarding sessions, gated by the discardability contract, capability authorization (tenant.delete or a dedicated tenant.discard_draft), and confirmation modal
- Add an orphaned draft indicator to the tenant list/detail views — visual distinction between a resumable draft (has active session) and an abandoned draft (all sessions terminal or none exist)
- Emit a distinct audit event (tenant.draft_discarded) separate from tenant.force_deleted, capturing workspace context, tenant identifiers, linked session state, and acting user
- Preserve and reinforce the existing domain separation: archive/restore/force_delete remain reserved for durable tenant lifecycle; cancel/delete remain reserved for onboarding workflow lifecycle; discard is the new end-of-life action for provisional drafts
Key domain rules:
- archive = preserve durable tenant for compliance while removing from active use
- restore = reactivate an archived durable tenant
- force delete = permanently destroy an already archived durable tenant
- discard draft = permanently remove a provisional tenant that never became a durable operational entity
- Draft tenants must NOT become archivable or restorable
Safety preconditions for discard: tenant is in draft status, not trashed, no resumable onboarding sessions exist, no accumulated operational data (no policies, backups, operation runs beyond onboarding)
Out of scope: automatic cleanup without operator confirmation, retention policy for cancelled onboarding sessions, changes to the 4-state tenant lifecycle enum, changes to the 7-state onboarding session lifecycle enum
Dependencies: Spec 140 (onboarding lifecycle checkpoints — already shipped), Spec 143 (tenant lifecycle operability context semantics)
Related specs: Spec 138 (draft identity), Spec 140 (lifecycle checkpoints), Spec 143 (lifecycle operability semantics)
Priority: medium

Exception / Risk-Acceptance Workflow for Findings

Type: feature
Source: HANDOVER gap analysis, Spec 111 follow-up
Problem: Finding has a risk_accepted status value but no formal exception lifecycle. Today, accepting risk is a status transition — there is no dedicated entity to record who accepted the risk, why, under what conditions, or when the acceptance expires. No approval workflow, no expiry/renewal semantics, no structured justification. Auditors cannot answer "who accepted this risk, what was the justification, and is it still valid?"
Why it matters: Enterprise compliance frameworks (ISO 27001, SOC 2, CIS) require documented, time-bounded risk acceptance with clear ownership. A bare status flag does not meet this bar. Without a formal exception lifecycle, risk acceptance becomes invisible to audit trails and impossible to govern at scale.
Proposed direction: First-class RiskException (or FindingException) entity linked to Finding, with: justification text, owner (actor), accepted_at, expires_at, renewal/reminder semantics, optional linkage to verification checks or related findings. Approval flow with capability-gated acceptance. Audit trail for creation, renewal, expiry, and revocation. Findings in risk_accepted state without a valid exception should surface as governance warnings.
Dependencies: Findings workflow (Spec 111) complete, audit log foundation (Spec 134)
Priority: high

Evidence Domain Foundation

Type: feature
Source: HANDOVER gap, R2 theme completion
Problem: Review pack export (Spec 109) and permission posture reports (104/105) exist as separate output artifacts. There is no first-class evidence domain model that curates, bundles, and tracks these artifacts as a coherent compliance deliverable for external audit submission.
Why it matters: Enterprise customers need a single, versioned, auditor-ready package — not a collection of separate exports assembled manually. The gap is not export packaging (Spec 109 handles that); it is the absence of an evidence domain layer that owns curation, completeness tracking, and audit-trail linkage.
Proposed direction: Evidence domain model with curated artifact references (review packs, posture reports, findings summaries, baseline governance snapshots). Completeness metadata. Immutable snapshots with generation timestamp and actor. Not a re-implementation of export — a higher-order assembly layer.
Dependencies: Review pack export (109), permission posture (104/105)
Priority: high

Enterprise App / Service Principal Governance

Type: feature
Source: platform domain coverage planning, governance gap analysis
Problem: TenantPilot covers tenant configuration and governance workflows, but lacks a first-class governance surface for enterprise applications and service principals. Operators cannot easily answer which app identities exist, which ones hold privileged permissions, which credentials are nearing expiry, and where renewal/review workflows are needed.
Why it matters: Enterprise apps and service principals are a major governance and security pain point in Microsoft cloud environments. Expiring secrets/certificates, over-privileged app permissions, and unclear ownership create real audit, operational, and risk-management gaps. This is highly relevant for MSP reviews, customer reporting, and exception workflows.
Proposed direction: Add a governance-oriented domain surface for enterprise applications and service principals, starting with inventory, privileged-permission visibility, expiring credential visibility, ownership/review metadata, alerting hooks, and exception/renewal workflow support. Keep the scope centered on governance and reviewability rather than trying to model all enterprise app administration.
Dependencies: Evidence/reporting direction, alerting foundations, RBAC/capability model, domain coverage strategy
Priority: high

Type: feature
Source: platform domain coverage planning, audit/compliance positioning
Problem: TenantPilot currently focuses on device and identity governance domains, but does not yet cover one of the most audit-relevant Microsoft 365 data-governance control surfaces: tenant-level SharePoint and OneDrive external sharing settings. Operators lack a governance view for high-risk sharing posture at tenant scope.
Why it matters: Tenant-level sharing controls are central to data exposure, external collaboration, and audit readiness. For many customers, especially compliance-oriented SMB and midmarket environments, these settings are part of the core governance story and should not remain outside the platform's planned coverage.
Proposed direction: Introduce a bounded governance surface for tenant-level SharePoint and OneDrive sharing/access settings, focused on inventory, reviewability, explainability, and later alignment with evidence/reporting workflows. Start at tenant-level controls rather than attempting full site-level administration or a broad SharePoint management surface.
Dependencies: Domain coverage strategy, Microsoft 365 policy-domain expansion, reporting/evidence direction
Priority: medium

Entra Role Governance

Type: feature
Source: platform domain coverage planning, identity governance expansion
Problem: TenantPilot does not yet provide a first-class governance surface for Microsoft Entra roles. Built-in roles, custom role definitions, and role assignments are highly relevant for identity governance, but today they are not planned as a dedicated product capability.
Why it matters: Role governance is a central part of tenant security posture, privileged access control, and audit readiness. Customers need visibility into how administrative authority is defined and assigned, especially as Entra role usage grows beyond default out-of-the-box roles.
Proposed direction: Add a first-class Entra role governance capability focused on role definitions and assignments as governable objects. Start with inventory, visibility, and review-oriented explainability. Preserve the possibility of future attestation/review workflows without making them mandatory in V1.
Dependencies: Identity governance expansion, RBAC/capability model, reporting/evidence direction
Priority: medium

Security Posture Signals Foundation

Type: feature
Source: platform domain coverage planning, compliance/readiness reporting direction
Problem: TenantPilot's evidence and reporting direction is strong, but high-value security posture signals such as Defender Vulnerability Management exposure data and backup assurance signals are not yet represented as a bounded product capability. This leaves a gap between governance findings and the operational evidence customers want in recurring reviews.
Why it matters: Customers and MSP operators increasingly want proof that security operations are functioning, not just that configurations exist. Exposure trends, vulnerability posture, and backup success/failure signals are highly valuable inputs for executive reviews, customer reporting, and audit preparation.
Proposed direction: Establish a bounded evidence/signal foundation for ingesting, historizing, correlating, and reporting on selected posture signals, starting with Defender Vulnerability Management and backup success/failure/protection-state signals. Keep this clearly in the evidence domain, not the policy domain.
Dependencies: StoredReports/Evidence direction, signal ingestion foundations, reporting/export maturity
Priority: medium

Policy Lifecycle / Ghost Policies (Spec 900 refresh)

Type: feature
Source: Spec 900 draft (2025-12-22), HANDOVER risk #9
Problem: Policies deleted in Intune remain in TenantAtlas indefinitely. No deletion indicators. Backup items reference "ghost" policies.
Why it matters: Data integrity, user confusion, backup reliability
Proposed direction: Soft delete detection during sync, auto-restore on reappear, "Deleted" badge, restore from backup. Draft in Spec 900.
Dependencies: Inventory sync stable
Priority: medium

Schema-driven Secret Classification

Type: hardening
Source: Spec 120 deferred follow-up
Problem: Secret redaction currently uses pattern-based detection. A schema-driven approach via GraphContractRegistry metadata would be more reliable.
Why it matters: Reduces false negatives in secret redaction
Proposed direction: Central classifier in GraphContractRegistry, regression corpus
Dependencies: Secret redaction (120) stable, registry completeness (095)
Priority: medium

Cross-Tenant Compare & Promotion

Type: feature
Source: Spec 043 draft, 0800-future-features
Problem: No way to compare policies between tenants or promote configurations from staging to production.
Why it matters: Core MSP/enterprise workflow. Identified as top revenue lever in brainstorming.
Proposed direction: Compare/diff UI, group/scope-tag mapping, promotion plan (preview → dry-run → cutover → verify)
Dependencies: Inventory sync, backup/restore mature
Priority: medium (high value, high effort)

System Console Scope Hardening

Type: hardening
Source: Spec 113/114 follow-up
Problem: The system console (/system) needs a clear cross-workspace entitlement model. Current platform capabilities (Spec 114) define per-surface access, but cross-workspace query authorization and scope isolation for platform operators are not yet hardened as a standalone contract.
Why it matters: Platform operators acting across workspaces need tight scope boundaries to prevent accidental cross-workspace data exposure in troubleshooting and monitoring flows.
Proposed direction: Formalize cross-workspace query authorization model, scope isolation rules for platform operator sessions, and regression coverage for wrong-workspace access in system console surfaces.
Dependencies: System console (114) stable, canonical tenant context (Specs 135/136)
Priority: low

System Console Multi-Workspace Operator UX

Type: feature
Source: Spec 113 deferred
Problem: System console (/system) currently can't select/filter across workspaces for platform operators. Triage and monitoring require workspace-by-workspace navigation.
Why it matters: Platform ops need cross-workspace visibility for troubleshooting and monitoring at scale.
Proposed direction: Workspace selector/filter in system console views, cross-workspace run aggregation, unified triage entry point.
Dependencies: System console (114) stable, System Console Scope Hardening
Priority: low

Operations Naming Harmonization Across Run Types, Catalog, UI, and Audit

Type: hardening
Source: coding discovery, operations UX consistency review
Why it matters: Strategically important for enterprise UX, auditability, and long-term platform consistency. OperationRun is becoming a cross-domain execution and monitoring backbone, and the current naming drift will get more expensive as new run types and provider domains are added. This should reduce future naming drift, but it is not a blocker-critical refactor and should not be pulled in as a side quest during small UI changes.
Problem: Naming around operations appears historically grown and not consistent enough across OperationRunType values, visible run labels, OperationCatalog mappings, notifications, audit events, filters, badges, and related UI copy. Internal type names and operator-facing language are not cleanly separated, domain/object/verb ordering is uneven, and small UX fixes risk reinforcing an already inconsistent scheme. If left as-is, new run types for baseline, review, alerts, and additional provider domains will extend the inconsistency instead of converging it.
Desired outcome: A later spec should define a clear naming standard for OperationRunType, establish an explicit distinction between internal type identifiers and operator-facing labels, and align terminology across runs, notifications, audit text, monitoring views, and operations UI. New run types should have documented naming rules so they can be added without re-opening the vocabulary debate.
In scope: Inventory of current operation-related naming surfaces; naming taxonomy for internal identifiers versus visible operator language; conventions for verb/object/domain ordering; alignment rules for OperationCatalog, run labels, notifications, audit events, filters, badges, and monitoring UI; forward-looking rules for adding new run types and provider/domain families; a pragmatic migration plan that minimizes churn and preserves audit clarity.
Out of scope: Opportunistic mass-refactors during unrelated feature work; immediate renaming of all historical values without a compatibility plan; using a small UI wording issue such as "Sync from Intune" versus "Sync policies" as justification for broad churn; a full operations-domain rearchitecture unless later analysis proves it necessary.
Trigger / Best time to do this: Best tackled when multiple new run types are about to land, when OperationCatalog / monitoring / operations hub work is already active, when new domains such as Entra or Teams are being integrated, or when a broader UI naming constitution is ready to be enforced technically. This is a good candidate for a planned cleanup window, not an ad hoc refactor.
Risks if ignored: Continued terminology drift across UI and audit layers, higher cognitive load for operators, weaker enterprise polish, more brittle label mapping, and more expensive cleanup once additional domains and execution types are established. Audit/event language may diverge further from monitoring language, making cross-surface reasoning harder.
Suggested direction: Define stable internal run-type identifiers separately from visible operator labels. Standardize a single naming grammar for operation concepts, including when to lead with verb, object, or domain, and when provider-specific wording is allowed. Apply changes incrementally with compatibility-minded mapping rather than a brachial rename of every historical string. Prefer a staged migration that first defines rules and mapping layers, then updates high-value operator surfaces, and only later addresses legacy internals where justified.
Readiness level: Qualified and strategically important, but intentionally deferred. This should be specified before substantially more run types and provider domains are introduced, yet it should not become an immediate side-track or be bundled into minor UI wording fixes.
Candidate quality:
- Clearly identified cross-cutting problem with architectural and UX impact
- Strong future-facing trigger conditions instead of vague "sometime later"
- Explicit boundaries to prevent opportunistic churn
- Concrete desired outcome without overdesigning the solution
- Easy to promote into a full spec once operations-domain work is prioritized

Provider Connection Resolution Normalization

Type: hardening
Source: architecture audit – provider connection resolution analysis
Problem: The codebase has a dual-resolution model for provider connections. Gen 2 jobs (ProviderInventorySyncJob, ProviderConnectionHealthCheckJob, ProviderComplianceSnapshotJob) receive an explicit providerConnectionId and pass it through the ProviderOperationStartGate. Gen 1 jobs (ExecuteRestoreRunJob, EntraGroupSyncJob, SyncRoleDefinitionsJob, policy sync jobs, etc.) do NOT — their called services resolve the default connection at runtime via MicrosoftGraphOptionsResolver::resolveForTenant() or internal resolveProviderConnection() methods. This creates non-deterministic execution: a job dispatched against one connection may silently execute against a different one if the default changes between dispatch and execution. ~20 services use the Gen 1 implicit resolution pattern.
Why it matters: Non-deterministic credential binding is a correctness and audit gap. Enterprise customers need to know exactly which connection identity was used for every Graph API call. The implicit pattern also prevents connection-scoped rate limiting, error attribution, and consent-scope validation. This is the foundational refactor that unblocks all other provider connection improvements.
Proposed direction:
- Refactor all Gen 1 services to accept an explicit ProviderConnection (or providerConnectionId) parameter instead of resolving default internally
- Update all Gen 1 jobs to accept providerConnectionId at dispatch time (resolved at the UI/controller layer via ProviderOperationStartGate or equivalent)
- Deprecate MicrosoftGraphOptionsResolver — callers should use ProviderGateway::graphOptions($connection) directly
- Ensure provider_connection_id is recorded in every OperationRun context and audit event
- Standardize error handling: all resolution failures produce ProviderConnectionResolution::blocked() with structured ProviderReasonCodes, not mixed exceptions (ProviderConfigurationRequiredException, RuntimeException, InvalidArgumentException)
Known affected services (Gen 1 / implicit resolution): RestoreService (line 2913 internal resolveProviderConnection()), PolicySyncService (lines 58, 450), PolicySnapshotService (line 752), RbacHealthService (line 192), InventorySyncService (line 730 internal resolveProviderConnection()), EntraGroupSyncService, RoleDefinitionsSyncService, EntraAdminRolesReportService, AssignmentBackupService, AssignmentRestoreService, ScopeTagResolver, TenantPermissionService, VersionService, ConfigurationPolicyTemplateResolver, FoundationSnapshotService, FoundationMappingService, RestoreRiskChecker, PolicyCaptureOrchestrator, AssignmentFilterResolver, RbacOnboardingService, TenantConfigService
Known affected jobs (Gen 1 / no explicit connectionId): ExecuteRestoreRunJob, EntraGroupSyncJob, SyncRoleDefinitionsJob, SyncEntraAdminRolesJob, plus any job that calls a Gen 1 service
Gen 2 reference implementations (correct pattern): ProviderInventorySyncJob, ProviderConnectionHealthCheckJob, ProviderComplianceSnapshotJob — all receive providerConnectionId, pass through ProviderOperationStartGate, lock row, create OperationRun with connection in context
Key architecture components:
- ProviderConnectionResolver — correct, keep as-is. resolveDefault() returns ProviderConnectionResolution value object
- ProviderOperationStartGate — canonical dispatch-time gate, correct Gen 2 pattern. Handles 3 operation types: provider.connection.check, inventory_sync, compliance.snapshot
- MicrosoftGraphOptionsResolver — legacy bridge (32 lines), target for deprecation. Calls resolveDefault() internally, hides connection identity
- ProviderGateway — lower-level primitive, builds graph options from explicit connection. Correct, keep as-is
- ProviderIdentityResolver — resolves identity (platform vs dedicated) from connection. Correct, keep as-is
- Partial unique index on provider_connections: (tenant_id, provider) WHERE is_default = true
Out of scope: UX label changes, UI banners, legacy credential field removal (those are separate candidates below)
Dependencies: None — this is the foundational refactor
Related specs: Spec 081 (Tenant credential migration CI guardrails), Spec 088 (provider connection model), Spec 089 (provider gateway), Spec 137 (data-layer provider prep)
Priority: high

Provider Connection UX Clarity

Type: polish
Source: architecture audit – provider connection resolution analysis
Problem: The operator-facing language and information architecture around provider connections creates confusion about why a "default" connection is required, what happens when it's missing, and when actions are tenant-wide vs connection-scoped. Specific issues: (1) "Set as Default" is misleading — it implies preference, but the connection is actually the canonical operational identity; (2) missing-default errors surface as blocked OperationRun records or exceptions, but there is no proactive banner/hint on the tenant or connection pages; (3) action labels don't distinguish tenant-wide operations (verify, sync) from connection-scoped operations (health check, test); (4) the singleton auto-promotion (first connection becomes default automatically) is invisible — operators don't understand why their first connection was special.
Why it matters: Reduces support friction and operator confusion. Enterprise operators managing multiple tenants need clear, predictable language about connection lifecycle. The current UX makes the correct architecture feel like a bug ("why do I need a default?").
Proposed direction:
- Rename "Set as Default" → "Promote to Primary" (or "Set as Primary Connection") across all surfaces
- Add a missing-primary-connection banner on tenant detail / connection list when no default exists — with a direct "Promote" action
- Distinguish action labels: tenant-wide actions ("Sync Tenant", "Verify Tenant") vs connection-scoped actions ("Check Connection Health", "Test Connection")
- Improve blocked-notification copy: instead of generic "provider connection required", show "No primary connection configured for [Provider]. Promote a connection to continue."
- Show a transient success notification when auto-promotion happens on first connection creation ("This connection was automatically set as primary because it's the first for this provider")
- Consider an info tooltip or help text explaining the primary connection concept on the connection resource pages
Key surfaces to update: ProviderConnectionResource (row actions, header actions, table empty state), TenantResource (verify action, connection tab), onboarding wizard consent step, ProviderNextStepsRegistry remediation links, notification templates for blocked operations
Auto-default creation locations (4 places, need UX feedback): CreateProviderConnection action, TenantOnboardingController, AdminConsentCallbackController, ManagedTenantOnboardingWizard
Out of scope: Backend resolution refactoring (that's the normalization candidate above), legacy field removal
Dependencies: Soft dependency on "Provider Connection Resolution Normalization" — UX improvements are more coherent when the backend consistently uses explicit connections, but many label/banner changes can proceed independently
Related specs: Spec 061 (provider connection UX), Spec 088 (provider connection model)
Priority: medium

Provider Connection Legacy Cleanup

Type: hardening
Source: architecture audit – provider connection resolution analysis
Problem: After normalization is complete, several legacy artifacts remain: (1) MicrosoftGraphOptionsResolver — a 32-line convenience bridge that exists only because ~20 services haven't been updated to use explicit connections; (2) service-internal resolveProviderConnection() methods in RestoreService (line 2913), InventorySyncService (line 730), and similar — these are local resolution logic that should not exist once services receive explicit connections; (3) Tenant model legacy credential accessors (app_client_id, app_client_secret fields) — graphOptions() already throws BadMethodCallException, but the fields and accessors remain; (4) migration_review_required flag on ProviderConnection — used during the credential migration from tenant-level to connection-level, should be retired once all tenants are migrated.
Why it matters: Dead code increases cognitive load and creates false affordances. New developers may use MicrosoftGraphOptionsResolver or internal resolution methods thinking they're the correct pattern. Legacy credential fields on Tenant suggest credentials still live there. Cleaning up after normalization makes the correct architecture self-documenting.
Proposed direction:
- Remove MicrosoftGraphOptionsResolver class entirely (after normalization ensures zero callers)
- Remove all service-internal resolveProviderConnection() / resolveDefault() methods
- Remove legacy credential fields from Tenant model (migration to drop columns, update factory, update tests)
- Evaluate migration_review_required — if all tenants have migrated, remove the flag and related UI (banner, filter)
- Update CI guardrails: NoLegacyTenantGraphOptionsTest and NoTenantCredentialRuntimeReadsSpec081Test can be simplified or removed once the code they guard against is gone
- Verify no seeders, factories, or test helpers reference legacy patterns
Out of scope: Any new features — this is pure cleanup
Dependencies: Hard dependency on "Provider Connection Resolution Normalization" — cleanup cannot proceed until all callers are migrated
Related specs: Spec 081 (credential migration guardrails), Spec 088 (provider connection model), Spec 137 (data-layer provider prep)
Priority: medium (deferred until normalization is complete)

Tenant App Status False-Truth Removal

Type: hardening
Source: legacy / orphaned truth audit 2026-03-16
Classification: quick removal
Problem: Tenant.app_status is displayed in tenant UI as current operational truth even though production code no longer writes it. Operators can see a frozen "OK" or other stale badge that does not reflect the real provider connection state.
Why it matters: This is misleading operator-facing truth, not just dead schema. It creates false confidence on a tier-1 admin surface.
Target model: Tenant
Canonical source of truth: ProviderConnection.consent_status and ProviderConnection.verification_status
Must stop being read: Tenant.app_status in TenantResource table columns, infolist/details, filters, and badge-domain mapping.
Can be removed immediately:
- TenantResource reads of app_status
- tenant app-status badge domain / badge mapping usage
- factory defaults that seed app_status
Remove only after cutover:
- the tenants.app_status column itself, once all UI/report/export reads are confirmed gone
Migration / backfill: No backfill. One cleanup migration to drop app_status. app_notes may be dropped in the same migration only if it does not broaden the spec beyond tenant stale app fields.
UI / resource / policy / test impact:
- UI/resources: remove misleading badge and filter from tenant surfaces
- Policy: none
- Tests: update TenantFactory, remove assertions that treat app_status as live truth
Scope boundaries:
- In scope: remove stale tenant app-status reads and schema field
- Out of scope: provider connection UX redesign, credential migration, broader tenant health redesign
Dependencies: None required if the immediate operator-facing action is removal rather than replacement with a new tenant-level derived badge.
Risks: Low rollout risk. Main risk is short-term operator confusion about where to view connection health after removal.
Why it should be its own spec: This is the cleanest high-severity operator-trust fix in the repo. It is bounded, low-coupling, and should not wait for the larger provider cutover work.
Priority: high

Provider Connection Status Vocabulary Cutover

Type: hardening
Source: legacy / orphaned truth audit 2026-03-16
Classification: bounded cutover
Problem: ProviderConnection currently exposes overlapping status vocabularies across status, health_status, consent_status, and verification_status. Resources, badges, and filters can read both projected legacy state and canonical enum state, creating drift and operator ambiguity.
Why it matters: This is duplicate status truth on an operator-facing surface. It also leaves the system vulnerable to projector drift if legacy projected fields stop matching the enum source of truth.
Target model: ProviderConnection
Canonical source of truth: ProviderConnection.consent_status and ProviderConnection.verification_status
Must stop being read: ProviderConnection.status and ProviderConnection.health_status in resources, filters, badges, and any operator-facing status summaries.
Can be removed immediately:
- new operator-facing reads of legacy varchar status fields
- new badge/filter logic that depends on normalized legacy values
Remove only after cutover:
- status and health_status columns
- projector persistence of those fields, if still retained for compatibility
- legacy badge normalization paths
Migration / backfill: No data backfill if enum columns are already complete. Requires a later schema cleanup migration to drop legacy varchar columns after all reads are migrated.
UI / resource / policy / test impact:
- UI/resources: ProviderConnectionResource and related badges/filters move to one coherent operator vocabulary
- Policy: none directly
- Tests: add exhaustive projection and badge mapping coverage during the transition; update resource/filter assertions to enum-driven behavior
Scope boundaries:
- In scope: provider connection status fields, display semantics, badge/filter vocabulary, deprecation path for projected columns
- Out of scope: tenant credential migration, provider onboarding flow redesign, unrelated badge cleanup elsewhere
Dependencies: Confirm all hidden read paths outside the main resource and define the operator-facing enum presentation.
Risks: Medium rollout risk. Filters, badges, and operator language change together, and hidden reads may exist outside the primary resource.
Why it should be its own spec: This is a self-contained source-of-truth cutover on one model. It is too important and too operationally visible to bury inside a generic provider cleanup spec.
Priority: high

Tenant Legacy Credential Source Decommission

Type: hardening
Source: legacy / orphaned truth audit 2026-03-16
Classification: staged migration
Problem: Tenant-level credential fields remain in the data model after ProviderCredential became the canonical identity store. They are still used for migration classification and are kept artificially alive by factory defaults, which obscures the real architecture and prolongs the cutover.
Why it matters: This is an incomplete architectural cutover around sensitive identity data. The system needs an explicit end-state where runtime credential resolution no longer depends on tenant legacy fields.
Target model: Tenant, with ProviderCredential as the destination canonical model
Canonical source of truth: ProviderCredential.client_id and ProviderCredential.client_secret
Must stop being read: tenant legacy credential fields in normal runtime credential resolution. Transitional reads remain allowed only inside migration-classification paths until exit criteria are met.
Can be removed immediately:
- factory defaults that populate legacy tenant credentials by default
- any non-classification runtime reads if discovered during spec work
- UI affordances that imply tenant-stored credentials are active
Remove only after cutover:
- Tenant.app_client_id, Tenant.app_client_secret, Tenant.app_certificate_thumbprint
- migration-classification reads and related transitional guardrails
Migration / backfill: Requires explicit completion criteria for the tenant-to-provider credential migration. No blind backfill; removal should follow confirmed migration review state for all affected tenants.
UI / resource / policy / test impact:
- UI/resources: remove any residual legacy credential messaging once the cutover is complete
- Policy: none directly
- Tests: TenantFactory must stop creating legacy credentials by default; transition-only tests should use explicit legacy states
Scope boundaries:
- In scope: tenant legacy credential fields, classification-only transition reads, factory/test cleanup tied to the cutover
- Out of scope: provider connection status vocabulary, unrelated tenant stale fields, onboarding UX redesign
Dependencies: Hard dependency on the provider credential migration/review lifecycle being complete enough to identify all remaining transitional tenants safely.
Risks: Higher rollout risk than simple cleanup because this touches credential-path architecture and transitional data needed for migration review.
Why it should be its own spec: This has distinct exit criteria, migration gating, and rollback concerns. It is not the same problem as stale operator-facing badges or provider status vocabulary cleanup.
Priority: high

Entra Group Authorization Capability Alignment

Type: hardening
Source: legacy / orphaned truth audit 2026-03-16
Classification: bounded cutover
Problem: EntraGroupPolicy currently grants read access based on tenant access alone and bypasses the capability layer used by the rest of the repo's authorization model.
Why it matters: This is a security- and RBAC-relevant inconsistency. Even if currently read-only, it weakens the capability-first architecture and increases the chance of future authorization drift.
Target model: EntraGroupPolicy and the Entra group read-access surface
Canonical source of truth: capability-based authorization decisions layered on top of tenant-access checks
Must stop being read: implicit "tenant access alone is sufficient" as the effective rule for Entra group read access.
Can be removed immediately:
- the direct bypass if the correct capability already exists and seeded roles already carry it
Remove only after cutover:
- any compatibility allowances needed while role-capability mappings are updated and verified
Migration / backfill: Usually no schema migration. May require role-capability seeding updates or RBAC backfill so intended operators retain access.
UI / resource / policy / test impact:
- UI/resources: some users may lose access if role mapping is incomplete; tenant-facing Entra group screens need regression verification
- Policy: this spec is the policy change
- Tests: add authorization matrix coverage proving tenant access alone no longer grants read access
Scope boundaries:
- In scope: read authorization semantics for Entra group surfaces and the required capability mapping
- Out of scope: new CRUD semantics, role mapping product UI, unrelated policy tidy-up
Dependencies: Choose the correct capability and verify seeded/default roles include it where intended.
Risks: Medium rollout risk because authorization mistakes become access regressions for legitimate operators.
Why it should be its own spec: This is a targeted RBAC hardening change with its own stakeholders, rollout checks, and regression matrix. It should not be hidden inside data or UI cleanup work.
Priority: high

Support Intake with Context (MVP)

Type: feature
Source: Product design, operator feedback
Problem: Nutzer haben keinen strukturierten Weg, Probleme direkt aus dem Produkt zu melden. Bei technischen Fehlern fehlen Run-/Tenant-/Provider-Details; bei Access-/UX-Problemen fehlen Route-/RBAC-Kontext. Folge: ineffiziente Support-Schleifen und Rückfragen. Ein vollwertiges Ticketsystem ist falsch priorisiert.
Why it matters: Reduziert Support-Reibung, erhöht Erfassungsqualität, steigert wahrgenommene Produktreife. Schafft typed intake layer für spätere Webhook-/PSA-/Ticketing-Erweiterungen, ohne jetzt ein Helpdesk einzuführen.
Proposed direction: Neues SupportRequest-Modell (kein Ticket/Case) mit source_type (operation_run, provider_connection, access_denied, generic) und issue_kind (technical_problem, access_problem, ux_feedback, other). Drei Entry Paths: (1) Context-bound aus failed OperationRun, (2) Access-Denied/403-Kontext, (3) generischer Feedback-Einstieg (User-Menü). Automatischer Context-Snapshot per SupportRequestContextBuilder je source_type. Persistierung vor Delivery. E-Mail-Delivery an konfigurierte Support-Adresse. Fingerprint-basierter Spam-Guard. Audit-Events. RBAC via support.request.create Capability. Scope-Isolation. Secret-Redaction in context_jsonb.
Dependencies: OperationRun-Domain stabil, RBAC/Capability-System (066+), Workspace-/Tenant-Scoping
Priority: medium

Policy Setting Explorer — Reverse Lookup for Tenant Configuration

Type: feature
Source: recurring enterprise pain point, governance/troubleshooting gap
Problem: In medium-to-large Intune tenants with dozens of policy types and hundreds of policies, admins routinely face the question: "Where is this setting actually defined?" Examples: "Which policy configures BitLocker?", "Where is EnableTPM set to true?", "Why does this tenant enforce a specific firewall rule, and which policy is the source?" Today, answering this requires manually opening policies one by one across device configuration, compliance, endpoint security, admin templates, settings catalog, and more. TenantPilot inventories and versions these policies but provides no reverse-lookup surface that maps a setting name, key, or value back to the policies that explicitly define it.
Why it matters: This is a governance, troubleshooting, and explainability gap — not a search convenience. Enterprise admins, auditors, and reviewers need authoritative answers to "where is X defined?" for incident triage, change review, compliance evidence, and duplicate detection. Without it, TenantPilot has deep policy data but cannot surface it from the operator's natural entry point (the setting, not the policy). This capability directly increases the product's value proposition for security reviews, audit preparation, and day-to-day configuration governance.
V1 scope:
- Tenant-scoped only. User queries settings within the active tenant's indexed policies. No cross-tenant or portfolio-wide search in V1.
- Dedicated working surface: a tenant-level "Policy Explorer" or "Setting Search" page with query input, filters, and structured result inspection. Not a global header search widget.
- Query modes: search by setting name/label, by raw key/path, or by value-oriented query (e.g. EnableTPM = true).
- Results display: policy name, policy type/family, setting label/path/key, configured value, version/snapshot context, deep link to the policy detail or version inspector.
- Supported policy families: start with a curated subset of high-value indexed families (settings catalog, device configuration, compliance, endpoint security baselines, admin templates). Not every Microsoft policy type from day one.
- Search projection model: a lightweight extracted-setting-facts table per supported policy family. Preserves policy-family-local structure, retains raw path/key, stores search-friendly displayable rows. PostgreSQL-first (GIN indexes on JSONB or dedicated columns as appropriate). Not a universal canonical key normalization engine — a pragmatic, product-oriented search projection.
- Trust boundary: results reflect settings explicitly present in supported indexed policies. UI must clearly communicate this scope. No-result does NOT imply the setting is absent from effective tenant configuration — only that it was not found in currently supported indexed policy families. This distinction must be visible in the UX (scope indicator, help text, empty-state copy).
- "Defined in" only: V1 answers "where is this setting explicitly defined?" — it does NOT answer "is this setting effectively applied to devices/users?" The difference between explicit definition and effective state must be preserved and communicated.
Explicit non-goals (V1):
- No universal cross-provider canonical setting ontology. Avoid a large fragile semantic mapping project. Setting identity stays policy-family-local in V1.
- No effective-state guarantees. V1 does not resolve assignment targeting, conflict resolution, or platform-side precedence.
- No portfolio / cross-tenant / workspace-wide scope.
- No dependency on external search infrastructure (Elasticsearch, Meilisearch, etc.) if PostgreSQL-first is sufficient.
- No naive raw JSON full-text search as the product surface. The projection model must provide structured, rankable, explainable results — not grep output.
- No requirement to support every Microsoft policy family from day one.
Architectural direction:
- Search projection layer: when policies are synced/versioned, extract setting facts into a dedicated search-friendly projection (e.g. policy_setting_facts table or JSONB-indexed structure). Each row captures: tenant_id, policy_id, policy_version_id (nullable), policy_type/family, setting_key/path, setting_label (display name where available), configured_value, raw_payload_path. Extraction logic is per-family, not a universal parser.
- PostgreSQL-first: use GIN indexes on JSONB or trigram indexes on text columns for efficient search. Evaluate pg_trgm for fuzzy matching.
- Extraction is append/rebuild on sync — not real-time transformation. Can be a post-sync projection step integrated into the existing inventory sync pipeline.
- Provider boundary stays explicit: the projection is populated by each policy family's extraction logic. No abstraction that pretends all policy families share the same schema.
- RBAC: tenant-scoped, gated by a capability (e.g. policy.settings.search). Results respect existing policy-level visibility rules.
- Audit: queries are loggable but do not require per-query audit entries in V1. The feature is read-only.
UX direction:
- Primary surface: dedicated page under the tenant context (e.g. tenant → Policy Explorer or tenant → Setting Search). Full working surface with query input, optional filters (policy type, policy family, value match mode), and a results table.
- Result rows: policy name (linked), policy type badge, setting path/key, configured value, version indicator. Expandable detail or click-through to policy inspector.
- Empty state: clearly explains scope limitations ("No matching settings found in supported indexed policies. This does not mean the setting is absent from your tenant's effective configuration.").
- Scope indicator: persistent badge or label showing the search scope (e.g. "Searching N supported policy families in [tenant name]").
- Future quick-access entry point (e.g. command palette, header search shortcut) is a natural extension but not V1 scope.
Future expansion space (not V1):
- Semantic aliases / display-name normalization across families
- Duplicate / conflict detection hints ("this setting is defined in 3 policies")
- Assignment-aware enrichment ("this policy targets group X")
- Setting history / change timeline ("this value changed from false to true in version 4")
- Baseline / drift linkage ("this setting deviates from the CIS baseline")
- Workspace-wide / portfolio search across tenants
- Quick-access command palette entry point
Risks / notes:
- Extraction logic per policy family is the main incremental effort. Each new family supported requires a family-specific extractor. Start with the highest-value families and expand.
- Settings catalog policies have structured setting definitions that are relatively easy to extract. OMA-URI / admin template policies are less structured. The V1 family selection should favor extractability.
- The "no-result ≠ not configured" trust boundary is critical for enterprise credibility. Overcommitting search completeness erodes trust.
- Projection freshness depends on sync frequency. Stale projections must be visually flagged if the tenant hasn't been synced recently.
Dependencies: Inventory sync stable, policy versioning (snapshots), tenant context model, RBAC capability system (066+)
Priority: high

Help Center / Documentation Surface

Type: feature
Source: product planning, operator support friction analysis
Problem: TenantPilot lacks a first-class in-product knowledge surface for operators. As the platform grows in governance depth, operators need contextual guidance, workflow explanations, role/capability explanations, remediation help, and product documentation without leaving the admin experience. Today, knowledge is fragmented across specs, internal docs, and implicit operator expectations.
Why it matters: Reduces support friction, improves operator onboarding, enables self-service resolution, and makes advanced governance features more understandable and adoptable. Provides a canonical product-help layer distinct from audit/evidence/reporting artifacts. This is a product maturity and support-efficiency capability, not a content management system.
Proposed direction:
- Markdown-based documentation stored in-repo, rendered inside the Filament admin product
- Global documentation search
- Contextual help entry points on relevant resources/pages (modal / slideover preview where appropriate)
- Clear separation between product help/knowledge and audit/report/evidence exports
- Workspace/tenant context awareness only where helpful for navigation, not to turn docs into tenant data
Explicit non-goals: Not a customer support ticket system. Not an audit pack feature. Not a generic CMS. Not a replacement for external knowledge bases if those exist separately.
Dependencies: Filament panel infrastructure, existing navigation/information architecture
Priority: medium

Documentation Generation Pipeline and Editorial Workflow

Type: feature
Source: product planning, documentation sustainability analysis
Problem: Even with a markdown-based knowledge layer, documentation quality and coverage will degrade without a lightweight authoring pipeline. The product needs a structured way to generate document skeletons/templates, support repeatable documentation workflows, and optionally use AI-assisted drafting without treating generated text as authoritative by default.
Why it matters: Without a documentation pipeline, docs become inconsistent, coverage drifts as features grow, teams fall back to ad hoc writing, the help layer becomes expensive to maintain, and future AI-assisted documentation lacks guardrails.
Proposed direction:
- Document skeleton or template generation (e.g. command/tooling such as docs:generate)
- Structured frontmatter / metadata expectations where useful
- Editorial states such as draft / needs review / published
- Explicit "AI draft needs review" semantics to distinguish generated drafts from canonical reviewed documentation
- Repo-native markdown workflow as the source of truth
Explicit non-goals: Not a replacement for careful documentation authorship. Not a public marketing content engine. Not a promise of autonomous documentation generation. This is an internal/product documentation pipeline and editorial guardrail layer.
Dependencies: Help Center / Documentation Surface (this candidate builds on the rendering/delivery surface)
Priority: low

Drift Notifications Settings Surface

Type: feature
Source: product planning, governance alerting direction
Problem: TenantPilot has governance/alerting direction, but operators still lack a clear product surface to configure drift-related notification behavior in a predictable way. Without a dedicated settings experience, alert routing feels infrastructural rather than operator-manageable.
Why it matters: Operators need tenant/workspace-level control over how governance signals reach them — email, Microsoft Teams, severity-aware routing, notification fatigue reduction, and confidence that important drift events will not be silently missed. Especially relevant for MSP-style operations and ongoing tenant reviews.
Proposed direction:
- Dedicated settings-level drift notification management surface
- Delivery targets such as email and Teams
- Routing preferences by severity / event type where appropriately bounded
- Sensible defaults with cooldown / dedup / quiet-hours framing if those concepts already exist in the broader alerting direction
- Clear alignment with broader Alerts v1 direction, focused on the operator settings UX and configuration model
Explicit non-goals: Not a reinvention of the whole alerts engine. Not a generic notification center for every product event. This is the operator-facing configuration surface for drift/governance notifications.
Dependencies: Alerting v1 direction, drift detection foundation (Spec 044), tenant/workspace context model
Priority: medium

User Invitations and Directory-based User Selection

Type: feature
Source: product planning, access-management UX analysis
Problem: Workspace and tenant membership flows currently lack a polished enterprise-grade invitation and directory-assisted user selection experience. Operators should not need brittle manual steps to add the right person to the right workspace/tenant context.
Why it matters: Improves onboarding speed, operator/admin efficiency, correctness of membership assignment, enterprise credibility of the access-management UX, and future scalability of workspace/tenant administration.
Proposed direction:
- Directory-based user lookup / selection where supported
- Invitation flows initiated directly from membership management surfaces
- Invitation link / invitation lifecycle support
- Clear distinction between selecting an existing directory identity vs inviting a not-yet-active participant
- Alignment with existing RBAC / membership / workspace-first context model
Explicit non-goals: Not a full identity-provider redesign. Not a replacement for the Entra auth architecture. Not a generic address-book feature. This is a bounded access-administration workflow improvement.
Dependencies: RBAC/capability system (066+), workspace membership model, Entra identity integration
Priority: medium

Action Surface follow-up direction — The action-surface contract foundation (Specs 082, 090) and the follow-up taxonomy/viewer specs (143–146) are all fully implemented. The remaining gaps are not architectural redesign — they are incomplete adoption, missing decision criteria, and scope boundaries that haven't expanded to cover all product surfaces. The correct shape is: one foundation amendment to codify the missing rules and extend contract scope (v1.1), two compliance rollout specs to enroll currently-exempted surface families, and one targeted correction to fix the clearest remaining anti-pattern on a high-signal surface. This avoids reinventing the architecture, avoids umbrella "consistency" specs, and produces bounded, independently shippable work. TenantResource lifecycle-conditional actions and PolicyResource More-menu ordering are addressed by the updated foundation rules, not by standalone specs. Widgets, choosers, and pickers remain deferred/exempt.

Action Surface Contract v1.1 — Decision Criteria, Ordering Rules, and System Scope Extension

Type: foundation/spec amendment
Source: row interaction / action surface architecture analysis 2026-03-16
Problem: The action-surface contract (Spec 082) establishes profiles, slots, affordances, validator tests, and guard tests — but does not codify three things: (1) formal decision criteria for when a surface should use ClickableRow vs ViewAction vs PrimaryLinkColumn as its inspect affordance; (2) ordering rules for actions inside the More menu (destructive-last, lifecycle position, stable grouping); (3) system-panel table surfaces are explicitly excluded from contract scope, meaning ~6 operational surfaces have no declaration and no CI coverage. The architecture is correct; it just cannot prevent inconsistent choices on new surfaces or catch drift on existing ones.
Why this is its own spec: This is a foundation amendment — it changes the rules that all other surfaces must follow. Rollout specs (system panel enrollment, relation manager enrollment) depend on this spec's updated rules existing first. Merging rollout work into a foundation amendment blurs the boundary between "what the rules are" and "who must comply."
In scope:
- Codify inspect-affordance decision tree (ClickableRow default, ViewAction exception criteria, PrimaryLinkColumn criteria) in docs/ui/action-surface-contract.md
- Define the "lone ViewAction" anti-pattern formally and add it to validator detection
- Codify More-menu action ordering rules (lifecycle actions, severity ordering, destructive-last)
- Extend contract scope so system-panel table surfaces are enrollable (not exempt by default)
- Add guidance that cross-panel surface taxonomy should converge where semantically equivalent
- Update ActionSurfaceValidator to enforce new criteria
- Update guard/contract tests to cover new rules
Non-goals:
- Retrofitting all existing system-panel pages (separate rollout spec)
- Retrofitting all relation managers (separate rollout spec)
- One-off resource-level fixes (those are tasks within rollout or correction specs)
- TenantResource or PolicyResource redesign (addressed by applying the updated rules, not by dedicated specs)
- Chooser/picker/widget contracts (remain deferred/exempt)
Depends on: Spec 082, Spec 090 (both fully complete — this extends their foundation)
Suggested order: First. All other candidates in this cluster depend on the updated rules.
Risk: Low. This adds rules and extends scope — it does not change existing compliant declarations.
Why this boundary is right: Foundation rules must be codified before rollout enforcement. Mixing rule definition with compliance rollout makes it impossible to review the rules independently and creates circular dependencies.
Priority: high

System Panel Action Surface Contract Enrollment

Type: compliance rollout
Source: row interaction / action surface architecture analysis 2026-03-16
Problem: System-panel table surfaces (Ops/Runs, Ops/Failures, Ops/Stuck, Directory/Tenants, Directory/Workspaces, Security/AccessLogs) use recordUrl() consistently but have no ActionSurfaceDeclaration, no CI coverage, and are exempt from the contract by default. They are the largest family of undeclared table surfaces in the product.
Why this is its own spec: System-panel surfaces belong to a different panel with different operator audiences and potentially different profile requirements. Enrolling them is a distinct compliance effort from tenant-panel relation managers or targeted resource corrections. The scope is bounded and independently shippable.
In scope:
- Declare ActionSurfaceDeclaration for each system-panel table surface (~6 pages)
- Map to existing profiles where semantically correct (e.g., ListOnlyReadOnly for access logs, RunLog for ops run tables)
- Introduce new system-specific profiles only if existing profiles truly do not fit
- Remove enrolled system-panel pages from ActionSurfaceExemptions baseline
- Add guard test coverage for enrolled system surfaces
Non-goals:
- Tenant-panel resource declarations (already covered by Spec 090)
- Relation manager enrollment (separate candidate)
- Non-table system pages (dashboards, diagnostics, choosers)
- System-panel RBAC redesign
- Cross-workspace query authorization (tracked as "System Console Scope Hardening" candidate)
Depends on: Action Surface Contract v1.1 (must extend scope to system panel first)
Suggested order: Second, in parallel with "Run Log Inspect Affordance Alignment" after v1.1 is complete.
Risk: Low. These surfaces already behave consistently; this work adds formal declarations and CI coverage.
Why this boundary is right: System-panel enrollment is self-contained — it doesn't touch tenant-panel resources or relation managers. Completing it independently gives CI coverage over a currently-invisible surface family.
Priority: medium

Relation Manager Action Surface Contract Enrollment

Type: compliance rollout
Source: row interaction / action surface architecture analysis 2026-03-16
Problem: Three relation managers (BackupItemsRelationManager, TenantMembershipsRelationManager, WorkspaceMembershipsRelationManager) are in the ActionSurfaceExemptions baseline with no declaration. They were exempted during initial rollout (Spec 090) because relation-manager-specific profile semantics were not yet settled. Three other relation managers already have declarations. The exemption should be reduced, not permanent.
Why this is its own spec: Relation managers have different interaction expectations than standalone list resources (context is always nested under a parent record, pagination/empty-state semantics differ, attach/detach may replace create/delete in some cases). Enrollment requires relation-manager-specific review of profile fit, not just copying resource-level declarations.
In scope:
- Declare ActionSurfaceDeclaration for each currently-exempted relation manager (3 components)
- Validate profile fit (RelationManager profile vs a more specific variant)
- Reduce ActionSurfaceExemptions baseline by removing enrolled relation managers
- Add guard test coverage
Non-goals:
- Redesigning backup item management UX
- Redesigning membership management UX
- Parent resource changes (TenantResource, WorkspaceResource)
- Full restore/backup domain redesign
- Introducing new relation managers
Depends on: Action Surface Contract v1.1 (for any updated profile guidance or relation-manager-specific ordering rules)
Suggested order: Third, after both v1.1 and System Panel Enrollment are complete. Lowest urgency because these surfaces are low-traffic and already functionally correct.
Risk: Low. These relation managers already work correctly. This adds formal compliance, not behavioral change.
Why this boundary is right: Relation manager enrollment is a distinct surface family with its own profile semantics. Mixing it with system-panel enrollment or targeted resource corrections would create an unfocused rollout spec.
Priority: low

Run Log Inspect Affordance Alignment

Type: targeted surface correction
Source: row interaction / action surface architecture analysis 2026-03-16
Problem: OperationRunResource declares the RunLog profile with ViewAction as its inspect affordance. In practice, it renders a lone ViewAction in the actions column — the "lone ViewAction" anti-pattern identified in docs/ui/action-surface-contract.md. The row-click-first direction means this surface should use ClickableRow drill-down to the canonical tenantless viewer (OperationRunLinks::tenantlessView()), not a standalone View button. This surface is also inherited by the Monitoring/Operations page (which delegates to OperationRunResource::table()), so the fix propagates to both surfaces.
Why this is its own spec: This is the single highest-signal concrete violation of the action-surface contract direction. It is bounded to one resource declaration + one inherited page. It does not require rewriting the canonical viewer, redesigning the operations domain, or touching other monitoring surfaces. Keeping it separate from foundation amendments ensures it can ship quickly after v1.1 codifies the anti-pattern rule.
In scope:
- Change OperationRunResource inspect affordance from ViewAction to ClickableRow
- Verify recordUrl() points to the canonical tenantless viewer
- Remove the lone ViewAction from the actions column
- Confirm the change propagates correctly to Monitoring/Operations (which delegates to OperationRunResource::table())
- Update/add guard test assertion for the corrected declaration
Non-goals:
- Rewriting the canonical operation viewer (Spec 144 already complete)
- Broad operations UX redesign
- All monitoring pages (Alerts, Stuck, Failures are separate surfaces with distinct interaction models)
- RestoreRunResource alignment (currently exempted — separate concern)
- Action hierarchy / More-menu changes on this surface (belong to a general rollout, not this correction)
Depends on: Action Surface Contract v1.1 (for codified anti-pattern rule and ClickableRow-default guidance)
Suggested order: Second, in parallel with "System Panel Enrollment" after v1.1 is complete. Quickest win and highest signal correction.
Risk: Low. Single resource, no behavioral regression, no data model change.
Why this boundary is right: One resource, one anti-pattern, one fix. Expanding scope to "all run-log surfaces" or "all operation views" would turn a quick correction into a rollout spec and delay the most visible improvement.
Priority: medium

Admin Visual Language Canon — First-Party UI Convention Codification and Drift Prevention

Type: foundation
Source: admin UI consistency analysis 2026-03-17
Problem: TenantPilot has accumulated a strong set of first-party visual conventions across Filament resources, widgets, detail pages, badges, status indicators, action hierarchies, and operational surfaces. These conventions are emerging organically and are already broadly consistent — but they remain implicit. No canonical reference defines when to use native Filament patterns vs custom enterprise-detail compositions, which badge/status semantics apply to which domain states, how timestamps should render (since() vs absolute datetime vs contextual format), what the card/section/surface hierarchy rules are, which widget composition strategies are canonical, or where cross-panel visual divergence is intentional vs accidental. As the product's surface area grows — new policy families, new governance domains, new operational pages, new evidence/reporting surfaces — the risk is not current visual chaos but future drift caused by missing written selection criteria and decision rules.
Why it matters: Without a codified visual language reference, each new surface is a local design decision made without canonical guidance. This produces slow, cumulative inconsistency that becomes expensive to correct retroactively and degrades enterprise UX credibility. The problem is amplified by multi-agent development: multiple contributors (human and AI) cannot converge on implicit conventions they haven't seen documented. The value is not aesthetic — it is architectural: a canonical reference prevents divergent local choices, reduces review friction, accelerates new surface development, and establishes a stable foundation for the product's long-term visual identity without introducing third-party theme dependencies.
Proposed direction:
- Codify the existing first-party admin visual conventions as a canonical reference document (e.g. docs/ui/admin-visual-language.md or similar), covering:
  - Badge/status semantics: color mapping rules, icon usage criteria, domain-specific badge extraction patterns, when to use Filament native badge vs custom status composition
  - Timestamp rendering: decision rules for since() (relative) vs absolute datetime vs contextual format, with domain-specific overrides where justified
  - Action hierarchy: primary action vs header actions vs row actions vs bulk actions presentation conventions (complementing the Action Surface Contract's interaction-level rules with visual-level guidance)
  - Widget composition: selection criteria for stat cards, chart widgets, list widgets, and custom compositions; density and grouping rules
  - Surface/card/section hierarchy: when to use native Filament sections vs custom detail cards vs grouped infoblocks; nesting and visual weight rules
  - Enterprise-detail page composition: canonical structure for entity detail/view pages (header, metadata, status, content sections, related data)
  - Cross-panel visual divergence: explicit rules for where admin-panel and system-panel styling may diverge and where they must converge
  - Typography and spacing: canonical use of Filament's built-in text scales and spacing tokens; rules against ad hoc inline styles
- Establish guardrails against ad hoc local visual overrides (documented anti-patterns, PR review checklist items, or lightweight CI checks where practical)
- Explicitly state that native Filament v5 configuration and CSS hook classes remain the primary styling foundation; a thin first-party theme layer is only justified if native configuration proves insufficient for a documented, bounded set of requirements
- Explicitly reject third-party theme packages (e.g. Filament theme marketplace packages) as an architectural baseline unless separately justified by a dedicated evaluation spec with clear acceptance criteria
- Where existing conventions have already diverged, define the canonical choice and flag surfaces that need alignment (as future cleanup tasks, not as part of this spec's implementation scope)
In scope:
- Inventory of existing visual conventions across tier-1 admin surfaces (resources, detail pages, dashboards, operational views)
- Canonical reference document with decision rules and examples
- Anti-pattern catalog (known visual drift patterns to avoid)
- Lightweight enforcement strategy (review checklist, optional CI, or validator approach)
- Explicit architectural position on theme dependencies
Out of scope:
- Visual redesign of any existing surface (this is codification, not redesign)
- Aesthetic refresh or "make it look nicer" polish work
- Third-party theme evaluation, selection, or integration
- Broad Filament view publishing or deep customization layer
- Marketing/branding/identity work (this is internal admin UX, not external brand)
- Color palette redesign or new design-system creation
- Retrofitting all existing surfaces to strict compliance (alignment cleanup is tracked separately per surface)
Key architectural positions:
- Native Filament v5 remains the primary visual foundation. The product's visual identity is expressed through intentional use of native Filament configuration, not through override layers.
- CSS hook classes are the canonical customization mechanism where native configuration is insufficient. No publishing of Filament internal views for styling purposes.
- The main gap is missing canonical reference and decision rules, not missing components or missing technology.
- The value proposition is preventing future UI drift as more surfaces are added, not correcting a current visual crisis.
Dependencies: Action Surface Contract (Spec 082 / v1.1 candidate) for interaction-level conventions that this visual-level reference complements but does not duplicate. Operations Naming Harmonization candidate for operator-facing terminology alignment that is a distinct concern from visual conventions.
Related candidates: Action Surface Contract v1.1, Operations Naming Harmonization, Help Center / Documentation Surface (the visual language reference could eventually link from contextual help)
Trigger / best time to do this: Before the next wave of new governance domain surfaces (Entra Role Governance, Enterprise App Governance, SharePoint Sharing Governance, Evidence Domain) and before the Policy Setting Explorer UX, so those surfaces are built against documented canonical conventions rather than best-effort pattern matching.
Risks if ignored: Slow visual drift across surfaces, increasing review friction for new surfaces, divergent local conventions that become expensive to reconcile, weakened enterprise UX credibility as surface count grows, and higher cost of eventual systematic alignment.
Priority: medium

Covered / Absorbed

Candidates that were previously qualified but are now substantially covered by existing specs, or were umbrella labels whose children have been promoted individually.

Governance Architecture Hardening Wave (umbrella — dissolved)

Original source: architecture audit 2026-03-15
Status: Dissolved into individual candidates. The four children are now tracked separately in Qualified: Queued Execution Reauthorization, Tenant-Owned Query Canon, Livewire Context Locking. The fourth child (Findings Workflow Enforcement) is absorbed below.
Reference: ../audits/tenantpilot-architecture-audit-constitution.md, ../audits/2026-03-15-audit-spec-candidates.md

Findings Workflow Enforcement and Audit Backstop

Original source: architecture audit 2026-03-15, candidate C
Status: Largely absorbed by Spec 111 (findings workflow v2) which defines transition enforcement, timestamp tracking, reason validation, and audit logging. The remaining architectural enforcement gap (model-level bypass prevention) is a hardening follow-up to Spec 111, not a standalone spec-sized problem. Re-qualify only if enforcement softness surfaces as a concrete regression or audit finding.

Workspace Chooser v2

Original source: Spec 107 deferred backlog
Status: Workspace chooser v1 is covered by Spec 107 + semantic fix in Spec 121. The v2 polish items (search, sort, favorites, pins, environment badges) remain tracked as an Inbox entry. Not qualified as a standalone spec candidate at current priority.

Dashboard Polish (Enterprise-grade)

Original source: Product review 2026-03-08
Status: Core tenant dashboard is covered by Spec 058 (drift-first KPIs, needs attention, recent lists). Workspace-level landing is in progress via Spec 129. The remaining polish items (sparklines, compliance gauge, progressive disclosure) are tracked in Inbox. This was demoted because the candidate lacked a bounded spec scope — it read as a wish list rather than a specifiable problem.

Planned

Ready for spec creation. Waiting for slot in active work.

(empty — move items here when prioritized for next sprint)

Template

### Title
- **Type**: feature | polish | hardening | bug | research
- **Source**: chat | audit | coding discovery | customer feedback | spec N follow-up
- **Problem**:
- **Why it matters**:
- **Proposed direction**:
- **Dependencies**:
- **Priority**: low | medium | high

77 KiB Raw Blame History Unescape Escape