Implements spec 094 (assignment fetch/restore observability hardening): - Adds OperationRun tracking for assignment fetch (during backup) and assignment restore (during restore execution) - Normalizes failure codes/reason_code and sanitizes failure messages - Ensures exactly one audit log entry per assignment restore execution - Enforces correct guard/membership vs capability semantics on affected admin surfaces - Switches assignment Graph services to depend on GraphClientInterface Also includes Postgres-only FK defense-in-depth check and a discoverable `composer test:pgsql` runner (scoped to the FK constraint test). Tests: - `vendor/bin/sail artisan test --compact` (passed) - `vendor/bin/sail composer test:pgsql` (passed) Co-authored-by: Ahmed Darrazi <ahmed.darrazi@live.de> Reviewed-on: #113
138 lines
11 KiB
Markdown
138 lines
11 KiB
Markdown
# Feature Specification: Assignment Operations Observability Hardening
|
|
|
|
**Feature Branch**: `094-assignment-ops-observability-hardening`
|
|
**Created**: 2026-02-14
|
|
**Status**: Draft
|
|
**Input**: User description: "Harden assignment operation observability and close remaining authorization inconsistencies so operations are fully traceable, diagnosable, and correctly access-controlled."
|
|
|
|
## Spec Scope Fields *(mandatory)*
|
|
|
|
- **Scope**: canonical-view
|
|
- **Primary Routes**:
|
|
- Admin Monitoring → Operations (list + detail views)
|
|
- Provider Connections list
|
|
- Backup Sets list
|
|
- Restore Runs list
|
|
- Backup Items relationship view under a backup set
|
|
- **Data Ownership**: operational activity records must be scoped to the correct workspace and tenant context where applicable; no new domain entities are introduced.
|
|
- **RBAC**: all affected admin surfaces require workspace membership; actions that change configuration or trigger restores require explicit permissions as defined in the central capability registry.
|
|
|
|
### Canonical-view Constraints
|
|
|
|
- **Default behavior when tenant-context is active**: Monitoring → Operations defaults to showing operational records for the currently selected tenant context (if one is selected).
|
|
- **Entitlement checks**: Monitoring → Operations MUST not reveal tenant-owned operational records unless the actor is entitled to that tenant scope within the active workspace context.
|
|
|
|
## Clarifications
|
|
|
|
### Session 2026-02-14
|
|
|
|
- Q: For FR-005 / SC-004, what should “identity scope” mean for deduping “active” runs? → A: Dedupe per tenant and operation target/scope.
|
|
- Q: For FR-003 (“summary counters”) what exact semantics should total / processed / failed follow? → A: total = items attempted; processed = succeeded; failed = failed.
|
|
- Q: For FR-004 (“stable failure code”), which convention should we standardize on for assignment fetch/restore runs? → A: Operation-specific code namespaces + normalized reason_code for the cause.
|
|
- Q: For FR-006 (identity scope = tenant + operation type + target/scope), which identifiers should define “target/scope” for these two operations? → A: Fetch targets the backup item (or policy version) identifier; restore targets the restore run identifier.
|
|
- Q: For US1 / FR-002 (“restore is observable/auditable”), what audit log granularity do you want for assignment restore? → A: One audit log entry per assignment restore execution (per restore run).
|
|
|
|
## User Scenarios & Testing *(mandatory)*
|
|
|
|
### User Story 1 — Observe assignment operations end-to-end (Priority: P1)
|
|
|
|
Workspace administrators need to see assignment-related operations (both read-only fetch and destructive restore) in Monitoring so they can confirm what ran, what changed, and why something failed, without relying on server logs.
|
|
|
|
**Why this priority**: Assignment restore is high-risk; missing visibility creates operational and audit gaps.
|
|
|
|
**Independent Test**: Trigger both an assignment fetch and an assignment restore; verify each produces a monitoring-visible run record with correct lifecycle and failure details.
|
|
|
|
**Acceptance Scenarios**:
|
|
|
|
1. **Given** an administrator triggers an assignment fetch, **When** the operation starts and completes, **Then** Monitoring shows a run record with start/end timestamps, final outcome, and summary counters.
|
|
2. **Given** an administrator triggers an assignment restore, **When** the operation starts and completes, **Then** Monitoring shows a run record including a clear indication that it was a change-making operation.
|
|
- And exactly one audit log entry is written for the restore execution.
|
|
3. **Given** an assignment fetch or restore fails due to an external dependency error, **When** the run completes, **Then** Monitoring shows a stable failure code and a sanitized, user-readable message.
|
|
4. **Given** the same assignment operation is triggered multiple times concurrently for the same tenant and scope, **When** the system creates tracking records, **Then** the admin sees a single “active” run per identity (or an equivalent deduped representation).
|
|
|
|
---
|
|
|
|
### User Story 2 — Enforce correct access control semantics on affected admin surfaces (Priority: P2)
|
|
|
|
Workspace administrators and platform operators must not be able to cross authentication “planes” accidentally, and the admin UI must not expose bypasses that let users initiate sensitive actions without authorization.
|
|
|
|
**Why this priority**: Prevents cross-plane leakage and closes known authorization inconsistencies.
|
|
|
|
**Independent Test**: Attempt access with the wrong authentication plane and with insufficient permissions; verify outcomes are deny-as-not-found (404) or forbidden (403) per policy.
|
|
|
|
**Acceptance Scenarios**:
|
|
|
|
1. **Given** a user is authenticated in a different auth plane, **When** they attempt to access workspace-scoped admin routes, **Then** the response is deny-as-not-found (404).
|
|
2. **Given** a user is not a member of the workspace, **When** they attempt to view backup items under a backup set, **Then** the response is deny-as-not-found (404) and does not reveal record existence.
|
|
3. **Given** a user is a workspace member but lacks the required permission, **When** they attempt a protected action (such as managing provider connections), **Then** the response is forbidden (403).
|
|
|
|
---
|
|
|
|
### User Story 3 — Validate assignment operations safely in non-production contexts (Priority: P3)
|
|
|
|
Platform engineers need to validate that assignment operations behave correctly without requiring live external dependencies, so regressions can be caught early.
|
|
|
|
**Why this priority**: Improves reliability and reduces the risk of shipping changes that only fail when external services are slow/unavailable.
|
|
|
|
**Independent Test**: Run automated tests that simulate both successful and failing external interactions; verify monitoring records and authorization behaviors.
|
|
|
|
**Acceptance Scenarios**:
|
|
|
|
1. **Given** external interactions are simulated as “successful”, **When** the operation runs, **Then** the run is marked successful and includes expected summary counters.
|
|
2. **Given** external interactions are simulated as “failing”, **When** the operation runs, **Then** the run is marked failed with a stable failure code and sanitized message.
|
|
|
|
### Edge Cases
|
|
|
|
- External dependency timeouts, throttling, or transient failures.
|
|
- Retries and duplicate dispatches (ensure tracking remains coherent and non-spammy).
|
|
- Missing or inconsistent tenant/workspace context (must fail safely, not leak).
|
|
- Partial completion (some items processed, some failed): counters and failure details must remain interpretable.
|
|
|
|
## Requirements *(mandatory)*
|
|
|
|
### Functional Requirements
|
|
|
|
- **FR-001**: The system MUST create and maintain an operational tracking record (OperationRun) for every assignment fetch operation execution.
|
|
- **FR-002**: The system MUST create and maintain an operational tracking record (OperationRun) for every assignment restore operation execution.
|
|
- **FR-003**: Tracking records MUST include lifecycle state (queued/running/completed), timestamps, outcome, and summary counters sufficient to understand progress and results.
|
|
- Counter semantics: `total` = items attempted, `processed` = succeeded, `failed` = failed.
|
|
- **FR-004**: Failed runs MUST include a stable failure code and a sanitized, user-readable failure message.
|
|
- Failure convention: use operation-specific `code` namespaces for the run, and store the underlying cause as a normalized `reason_code`.
|
|
- **FR-005**: The system MUST prevent duplicate “active” runs for the same tenant and identity scope (tenant + operation type + operation target/scope), or otherwise present a deduped representation that avoids operator confusion.
|
|
- **FR-006**: Identity scope MUST be defined as:
|
|
- For assignment fetch operations: tenant + operation type + backup item identifier (or equivalent policy-version identifier).
|
|
- For assignment restore operations: tenant + operation type + restore run identifier.
|
|
- **FR-007**: Monitoring pages MUST render using persisted operational data only (no outbound calls during page render).
|
|
- **FR-008**: Cross-plane access MUST be deny-as-not-found (404) on affected routes.
|
|
- **FR-009**: Authorization MUST follow consistent semantics:
|
|
- non-member / not entitled → 404 (deny-as-not-found)
|
|
- member without required permission → 403
|
|
- **FR-010**: Any action that can change configuration or trigger a restore MUST be server-authorized; UI visibility MUST NOT be treated as authorization.
|
|
- **FR-011**: Sensitive or destructive-like actions MUST require explicit confirmation.
|
|
- **FR-012**: Each assignment restore execution MUST write exactly one audit log entry for the restore run.
|
|
|
|
### Dependencies & Assumptions
|
|
|
|
- Assignment fetch and assignment restore operations can be triggered by administrators or scheduled/queued execution paths.
|
|
- Monitoring users have access to the Operations area only when they have appropriate workspace membership and permissions.
|
|
- External dependency failures may occur and must be represented consistently (stable failure codes + sanitized messages).
|
|
|
|
## UI Action Matrix *(mandatory when the admin UI is changed)*
|
|
|
|
| Surface | Location | Header Actions | Inspect Affordance (List/Table) | Row Actions (max 2 visible) | Bulk Actions (grouped) | Empty-State CTA(s) | View Header Actions | Create/Edit Save+Cancel | Audit log? | Notes / Exemptions |
|
|
|---|---|---|---|---|---|---|---|---|---|---|
|
|
| Provider Connections | Admin → Provider Connections | Create connection (permission-gated) | View connection details | Edit / Delete (permission-gated; destructive confirms) | None | Create connection | N/A | Save + Cancel | Yes | Ensures no authorization bypass exists on header/empty-state CTAs. |
|
|
| Backup Sets | Admin → Backups | None (unchanged) | View backup set | Actions unchanged | None | None | N/A | N/A | N/A | Only enforcement helper consistency is affected. |
|
|
| Restore Runs | Admin → Restores | None (unchanged) | View restore run | Actions unchanged | None | None | N/A | N/A | Yes | Restore operations must be observable via Monitoring. |
|
|
| Backup Items | Under a backup set | None | View backup item details | Actions unchanged | None | None | N/A | N/A | N/A | Membership/404 checks must occur before capability/403 checks. |
|
|
| Monitoring → Operations | Admin → Monitoring | None | View operation run details | None | None | None | N/A | N/A | Yes | Read-only view; must not call external services during render. |
|
|
|
|
## Success Criteria *(mandatory)*
|
|
|
|
### Measurable Outcomes
|
|
|
|
- **SC-001**: 100% of assignment fetch and assignment restore executions appear in Monitoring with a completed outcome (success or failure) and timestamps.
|
|
- **SC-002**: For failed runs, operators can identify a stable failure code and a readable failure message from Monitoring within 60 seconds, without checking server logs.
|
|
- **SC-003**: Automated tests verify deny-as-not-found (404) vs forbidden (403) semantics for the affected surfaces.
|
|
- **SC-004**: Duplicate active-run confusion is eliminated: repeated triggers produce a single active run per identity scope (tenant + operation type + operation target/scope) or equivalent deduped visibility.
|