Implements spec 094 (assignment fetch/restore observability hardening): - Adds OperationRun tracking for assignment fetch (during backup) and assignment restore (during restore execution) - Normalizes failure codes/reason_code and sanitizes failure messages - Ensures exactly one audit log entry per assignment restore execution - Enforces correct guard/membership vs capability semantics on affected admin surfaces - Switches assignment Graph services to depend on GraphClientInterface Also includes Postgres-only FK defense-in-depth check and a discoverable `composer test:pgsql` runner (scoped to the FK constraint test). Tests: - `vendor/bin/sail artisan test --compact` (passed) - `vendor/bin/sail composer test:pgsql` (passed) Co-authored-by: Ahmed Darrazi <ahmed.darrazi@live.de> Reviewed-on: #113
11 KiB
Feature Specification: Assignment Operations Observability Hardening
Feature Branch: 094-assignment-ops-observability-hardening
Created: 2026-02-14
Status: Draft
Input: User description: "Harden assignment operation observability and close remaining authorization inconsistencies so operations are fully traceable, diagnosable, and correctly access-controlled."
Spec Scope Fields (mandatory)
- Scope: canonical-view
- Primary Routes:
- Admin Monitoring → Operations (list + detail views)
- Provider Connections list
- Backup Sets list
- Restore Runs list
- Backup Items relationship view under a backup set
- Data Ownership: operational activity records must be scoped to the correct workspace and tenant context where applicable; no new domain entities are introduced.
- RBAC: all affected admin surfaces require workspace membership; actions that change configuration or trigger restores require explicit permissions as defined in the central capability registry.
Canonical-view Constraints
- Default behavior when tenant-context is active: Monitoring → Operations defaults to showing operational records for the currently selected tenant context (if one is selected).
- Entitlement checks: Monitoring → Operations MUST not reveal tenant-owned operational records unless the actor is entitled to that tenant scope within the active workspace context.
Clarifications
Session 2026-02-14
- Q: For FR-005 / SC-004, what should “identity scope” mean for deduping “active” runs? → A: Dedupe per tenant and operation target/scope.
- Q: For FR-003 (“summary counters”) what exact semantics should total / processed / failed follow? → A: total = items attempted; processed = succeeded; failed = failed.
- Q: For FR-004 (“stable failure code”), which convention should we standardize on for assignment fetch/restore runs? → A: Operation-specific code namespaces + normalized reason_code for the cause.
- Q: For FR-006 (identity scope = tenant + operation type + target/scope), which identifiers should define “target/scope” for these two operations? → A: Fetch targets the backup item (or policy version) identifier; restore targets the restore run identifier.
- Q: For US1 / FR-002 (“restore is observable/auditable”), what audit log granularity do you want for assignment restore? → A: One audit log entry per assignment restore execution (per restore run).
User Scenarios & Testing (mandatory)
User Story 1 — Observe assignment operations end-to-end (Priority: P1)
Workspace administrators need to see assignment-related operations (both read-only fetch and destructive restore) in Monitoring so they can confirm what ran, what changed, and why something failed, without relying on server logs.
Why this priority: Assignment restore is high-risk; missing visibility creates operational and audit gaps.
Independent Test: Trigger both an assignment fetch and an assignment restore; verify each produces a monitoring-visible run record with correct lifecycle and failure details.
Acceptance Scenarios:
- Given an administrator triggers an assignment fetch, When the operation starts and completes, Then Monitoring shows a run record with start/end timestamps, final outcome, and summary counters.
- Given an administrator triggers an assignment restore, When the operation starts and completes, Then Monitoring shows a run record including a clear indication that it was a change-making operation.
- And exactly one audit log entry is written for the restore execution.
- Given an assignment fetch or restore fails due to an external dependency error, When the run completes, Then Monitoring shows a stable failure code and a sanitized, user-readable message.
- Given the same assignment operation is triggered multiple times concurrently for the same tenant and scope, When the system creates tracking records, Then the admin sees a single “active” run per identity (or an equivalent deduped representation).
User Story 2 — Enforce correct access control semantics on affected admin surfaces (Priority: P2)
Workspace administrators and platform operators must not be able to cross authentication “planes” accidentally, and the admin UI must not expose bypasses that let users initiate sensitive actions without authorization.
Why this priority: Prevents cross-plane leakage and closes known authorization inconsistencies.
Independent Test: Attempt access with the wrong authentication plane and with insufficient permissions; verify outcomes are deny-as-not-found (404) or forbidden (403) per policy.
Acceptance Scenarios:
- Given a user is authenticated in a different auth plane, When they attempt to access workspace-scoped admin routes, Then the response is deny-as-not-found (404).
- Given a user is not a member of the workspace, When they attempt to view backup items under a backup set, Then the response is deny-as-not-found (404) and does not reveal record existence.
- Given a user is a workspace member but lacks the required permission, When they attempt a protected action (such as managing provider connections), Then the response is forbidden (403).
User Story 3 — Validate assignment operations safely in non-production contexts (Priority: P3)
Platform engineers need to validate that assignment operations behave correctly without requiring live external dependencies, so regressions can be caught early.
Why this priority: Improves reliability and reduces the risk of shipping changes that only fail when external services are slow/unavailable.
Independent Test: Run automated tests that simulate both successful and failing external interactions; verify monitoring records and authorization behaviors.
Acceptance Scenarios:
- Given external interactions are simulated as “successful”, When the operation runs, Then the run is marked successful and includes expected summary counters.
- Given external interactions are simulated as “failing”, When the operation runs, Then the run is marked failed with a stable failure code and sanitized message.
Edge Cases
- External dependency timeouts, throttling, or transient failures.
- Retries and duplicate dispatches (ensure tracking remains coherent and non-spammy).
- Missing or inconsistent tenant/workspace context (must fail safely, not leak).
- Partial completion (some items processed, some failed): counters and failure details must remain interpretable.
Requirements (mandatory)
Functional Requirements
- FR-001: The system MUST create and maintain an operational tracking record (OperationRun) for every assignment fetch operation execution.
- FR-002: The system MUST create and maintain an operational tracking record (OperationRun) for every assignment restore operation execution.
- FR-003: Tracking records MUST include lifecycle state (queued/running/completed), timestamps, outcome, and summary counters sufficient to understand progress and results.
- Counter semantics:
total= items attempted,processed= succeeded,failed= failed.
- Counter semantics:
- FR-004: Failed runs MUST include a stable failure code and a sanitized, user-readable failure message.
- Failure convention: use operation-specific
codenamespaces for the run, and store the underlying cause as a normalizedreason_code.
- Failure convention: use operation-specific
- FR-005: The system MUST prevent duplicate “active” runs for the same tenant and identity scope (tenant + operation type + operation target/scope), or otherwise present a deduped representation that avoids operator confusion.
- FR-006: Identity scope MUST be defined as:
- For assignment fetch operations: tenant + operation type + backup item identifier (or equivalent policy-version identifier).
- For assignment restore operations: tenant + operation type + restore run identifier.
- FR-007: Monitoring pages MUST render using persisted operational data only (no outbound calls during page render).
- FR-008: Cross-plane access MUST be deny-as-not-found (404) on affected routes.
- FR-009: Authorization MUST follow consistent semantics:
- non-member / not entitled → 404 (deny-as-not-found)
- member without required permission → 403
- FR-010: Any action that can change configuration or trigger a restore MUST be server-authorized; UI visibility MUST NOT be treated as authorization.
- FR-011: Sensitive or destructive-like actions MUST require explicit confirmation.
- FR-012: Each assignment restore execution MUST write exactly one audit log entry for the restore run.
Dependencies & Assumptions
- Assignment fetch and assignment restore operations can be triggered by administrators or scheduled/queued execution paths.
- Monitoring users have access to the Operations area only when they have appropriate workspace membership and permissions.
- External dependency failures may occur and must be represented consistently (stable failure codes + sanitized messages).
UI Action Matrix (mandatory when the admin UI is changed)
| Surface | Location | Header Actions | Inspect Affordance (List/Table) | Row Actions (max 2 visible) | Bulk Actions (grouped) | Empty-State CTA(s) | View Header Actions | Create/Edit Save+Cancel | Audit log? | Notes / Exemptions |
|---|---|---|---|---|---|---|---|---|---|---|
| Provider Connections | Admin → Provider Connections | Create connection (permission-gated) | View connection details | Edit / Delete (permission-gated; destructive confirms) | None | Create connection | N/A | Save + Cancel | Yes | Ensures no authorization bypass exists on header/empty-state CTAs. |
| Backup Sets | Admin → Backups | None (unchanged) | View backup set | Actions unchanged | None | None | N/A | N/A | N/A | Only enforcement helper consistency is affected. |
| Restore Runs | Admin → Restores | None (unchanged) | View restore run | Actions unchanged | None | None | N/A | N/A | Yes | Restore operations must be observable via Monitoring. |
| Backup Items | Under a backup set | None | View backup item details | Actions unchanged | None | None | N/A | N/A | N/A | Membership/404 checks must occur before capability/403 checks. |
| Monitoring → Operations | Admin → Monitoring | None | View operation run details | None | None | None | N/A | N/A | Yes | Read-only view; must not call external services during render. |
Success Criteria (mandatory)
Measurable Outcomes
- SC-001: 100% of assignment fetch and assignment restore executions appear in Monitoring with a completed outcome (success or failure) and timestamps.
- SC-002: For failed runs, operators can identify a stable failure code and a readable failure message from Monitoring within 60 seconds, without checking server logs.
- SC-003: Automated tests verify deny-as-not-found (404) vs forbidden (403) semantics for the affected surfaces.
- SC-004: Duplicate active-run confusion is eliminated: repeated triggers produce a single active run per identity scope (tenant + operation type + operation target/scope) or equivalent deduped visibility.