# Feature Specification: Assignment Operations Observability Hardening **Feature Branch**: `094-assignment-ops-observability-hardening` **Created**: 2026-02-14 **Status**: Draft **Input**: User description: "Harden assignment operation observability and close remaining authorization inconsistencies so operations are fully traceable, diagnosable, and correctly access-controlled." ## Spec Scope Fields *(mandatory)* - **Scope**: canonical-view - **Primary Routes**: - Admin Monitoring → Operations (list + detail views) - Provider Connections list - Backup Sets list - Restore Runs list - Backup Items relationship view under a backup set - **Data Ownership**: operational activity records must be scoped to the correct workspace and tenant context where applicable; no new domain entities are introduced. - **RBAC**: all affected admin surfaces require workspace membership; actions that change configuration or trigger restores require explicit permissions as defined in the central capability registry. ### Canonical-view Constraints - **Default behavior when tenant-context is active**: Monitoring → Operations defaults to showing operational records for the currently selected tenant context (if one is selected). - **Entitlement checks**: Monitoring → Operations MUST not reveal tenant-owned operational records unless the actor is entitled to that tenant scope within the active workspace context. ## Clarifications ### Session 2026-02-14 - Q: For FR-005 / SC-004, what should “identity scope” mean for deduping “active” runs? → A: Dedupe per tenant and operation target/scope. - Q: For FR-003 (“summary counters”) what exact semantics should total / processed / failed follow? → A: total = items attempted; processed = succeeded; failed = failed. - Q: For FR-004 (“stable failure code”), which convention should we standardize on for assignment fetch/restore runs? → A: Operation-specific code namespaces + normalized reason_code for the cause. - Q: For FR-006 (identity scope = tenant + operation type + target/scope), which identifiers should define “target/scope” for these two operations? → A: Fetch targets the backup item (or policy version) identifier; restore targets the restore run identifier. - Q: For US1 / FR-002 (“restore is observable/auditable”), what audit log granularity do you want for assignment restore? → A: One audit log entry per assignment restore execution (per restore run). ## User Scenarios & Testing *(mandatory)* ### User Story 1 — Observe assignment operations end-to-end (Priority: P1) Workspace administrators need to see assignment-related operations (both read-only fetch and destructive restore) in Monitoring so they can confirm what ran, what changed, and why something failed, without relying on server logs. **Why this priority**: Assignment restore is high-risk; missing visibility creates operational and audit gaps. **Independent Test**: Trigger both an assignment fetch and an assignment restore; verify each produces a monitoring-visible run record with correct lifecycle and failure details. **Acceptance Scenarios**: 1. **Given** an administrator triggers an assignment fetch, **When** the operation starts and completes, **Then** Monitoring shows a run record with start/end timestamps, final outcome, and summary counters. 2. **Given** an administrator triggers an assignment restore, **When** the operation starts and completes, **Then** Monitoring shows a run record including a clear indication that it was a change-making operation. - And exactly one audit log entry is written for the restore execution. 3. **Given** an assignment fetch or restore fails due to an external dependency error, **When** the run completes, **Then** Monitoring shows a stable failure code and a sanitized, user-readable message. 4. **Given** the same assignment operation is triggered multiple times concurrently for the same tenant and scope, **When** the system creates tracking records, **Then** the admin sees a single “active” run per identity (or an equivalent deduped representation). --- ### User Story 2 — Enforce correct access control semantics on affected admin surfaces (Priority: P2) Workspace administrators and platform operators must not be able to cross authentication “planes” accidentally, and the admin UI must not expose bypasses that let users initiate sensitive actions without authorization. **Why this priority**: Prevents cross-plane leakage and closes known authorization inconsistencies. **Independent Test**: Attempt access with the wrong authentication plane and with insufficient permissions; verify outcomes are deny-as-not-found (404) or forbidden (403) per policy. **Acceptance Scenarios**: 1. **Given** a user is authenticated in a different auth plane, **When** they attempt to access workspace-scoped admin routes, **Then** the response is deny-as-not-found (404). 2. **Given** a user is not a member of the workspace, **When** they attempt to view backup items under a backup set, **Then** the response is deny-as-not-found (404) and does not reveal record existence. 3. **Given** a user is a workspace member but lacks the required permission, **When** they attempt a protected action (such as managing provider connections), **Then** the response is forbidden (403). --- ### User Story 3 — Validate assignment operations safely in non-production contexts (Priority: P3) Platform engineers need to validate that assignment operations behave correctly without requiring live external dependencies, so regressions can be caught early. **Why this priority**: Improves reliability and reduces the risk of shipping changes that only fail when external services are slow/unavailable. **Independent Test**: Run automated tests that simulate both successful and failing external interactions; verify monitoring records and authorization behaviors. **Acceptance Scenarios**: 1. **Given** external interactions are simulated as “successful”, **When** the operation runs, **Then** the run is marked successful and includes expected summary counters. 2. **Given** external interactions are simulated as “failing”, **When** the operation runs, **Then** the run is marked failed with a stable failure code and sanitized message. ### Edge Cases - External dependency timeouts, throttling, or transient failures. - Retries and duplicate dispatches (ensure tracking remains coherent and non-spammy). - Missing or inconsistent tenant/workspace context (must fail safely, not leak). - Partial completion (some items processed, some failed): counters and failure details must remain interpretable. ## Requirements *(mandatory)* ### Functional Requirements - **FR-001**: The system MUST create and maintain an operational tracking record (OperationRun) for every assignment fetch operation execution. - **FR-002**: The system MUST create and maintain an operational tracking record (OperationRun) for every assignment restore operation execution. - **FR-003**: Tracking records MUST include lifecycle state (queued/running/completed), timestamps, outcome, and summary counters sufficient to understand progress and results. - Counter semantics: `total` = items attempted, `processed` = succeeded, `failed` = failed. - **FR-004**: Failed runs MUST include a stable failure code and a sanitized, user-readable failure message. - Failure convention: use operation-specific `code` namespaces for the run, and store the underlying cause as a normalized `reason_code`. - **FR-005**: The system MUST prevent duplicate “active” runs for the same tenant and identity scope (tenant + operation type + operation target/scope), or otherwise present a deduped representation that avoids operator confusion. - **FR-006**: Identity scope MUST be defined as: - For assignment fetch operations: tenant + operation type + backup item identifier (or equivalent policy-version identifier). - For assignment restore operations: tenant + operation type + restore run identifier. - **FR-007**: Monitoring pages MUST render using persisted operational data only (no outbound calls during page render). - **FR-008**: Cross-plane access MUST be deny-as-not-found (404) on affected routes. - **FR-009**: Authorization MUST follow consistent semantics: - non-member / not entitled → 404 (deny-as-not-found) - member without required permission → 403 - **FR-010**: Any action that can change configuration or trigger a restore MUST be server-authorized; UI visibility MUST NOT be treated as authorization. - **FR-011**: Sensitive or destructive-like actions MUST require explicit confirmation. - **FR-012**: Each assignment restore execution MUST write exactly one audit log entry for the restore run. ### Dependencies & Assumptions - Assignment fetch and assignment restore operations can be triggered by administrators or scheduled/queued execution paths. - Monitoring users have access to the Operations area only when they have appropriate workspace membership and permissions. - External dependency failures may occur and must be represented consistently (stable failure codes + sanitized messages). ## UI Action Matrix *(mandatory when the admin UI is changed)* | Surface | Location | Header Actions | Inspect Affordance (List/Table) | Row Actions (max 2 visible) | Bulk Actions (grouped) | Empty-State CTA(s) | View Header Actions | Create/Edit Save+Cancel | Audit log? | Notes / Exemptions | |---|---|---|---|---|---|---|---|---|---|---| | Provider Connections | Admin → Provider Connections | Create connection (permission-gated) | View connection details | Edit / Delete (permission-gated; destructive confirms) | None | Create connection | N/A | Save + Cancel | Yes | Ensures no authorization bypass exists on header/empty-state CTAs. | | Backup Sets | Admin → Backups | None (unchanged) | View backup set | Actions unchanged | None | None | N/A | N/A | N/A | Only enforcement helper consistency is affected. | | Restore Runs | Admin → Restores | None (unchanged) | View restore run | Actions unchanged | None | None | N/A | N/A | Yes | Restore operations must be observable via Monitoring. | | Backup Items | Under a backup set | None | View backup item details | Actions unchanged | None | None | N/A | N/A | N/A | Membership/404 checks must occur before capability/403 checks. | | Monitoring → Operations | Admin → Monitoring | None | View operation run details | None | None | None | N/A | N/A | Yes | Read-only view; must not call external services during render. | ## Success Criteria *(mandatory)* ### Measurable Outcomes - **SC-001**: 100% of assignment fetch and assignment restore executions appear in Monitoring with a completed outcome (success or failure) and timestamps. - **SC-002**: For failed runs, operators can identify a stable failure code and a readable failure message from Monitoring within 60 seconds, without checking server logs. - **SC-003**: Automated tests verify deny-as-not-found (404) vs forbidden (403) semantics for the affected surfaces. - **SC-004**: Duplicate active-run confusion is eliminated: repeated triggers produce a single active run per identity scope (tenant + operation type + operation target/scope) or equivalent deduped visibility.