TenantAtlas/specs/094-assignment-ops-observability-hardening/spec.md

138 lines
11 KiB
Markdown

# Feature Specification: Assignment Operations Observability Hardening
**Feature Branch**: `094-assignment-ops-observability-hardening`
**Created**: 2026-02-14
**Status**: Draft
**Input**: User description: "Harden assignment operation observability and close remaining authorization inconsistencies so operations are fully traceable, diagnosable, and correctly access-controlled."
## Spec Scope Fields *(mandatory)*
- **Scope**: canonical-view
- **Primary Routes**:
- Admin Monitoring → Operations (list + detail views)
- Provider Connections list
- Backup Sets list
- Restore Runs list
- Backup Items relationship view under a backup set
- **Data Ownership**: operational activity records must be scoped to the correct workspace and tenant context where applicable; no new domain entities are introduced.
- **RBAC**: all affected admin surfaces require workspace membership; actions that change configuration or trigger restores require explicit permissions as defined in the central capability registry.
### Canonical-view Constraints
- **Default behavior when tenant-context is active**: Monitoring → Operations defaults to showing operational records for the currently selected tenant context (if one is selected).
- **Entitlement checks**: Monitoring → Operations MUST not reveal tenant-owned operational records unless the actor is entitled to that tenant scope within the active workspace context.
## Clarifications
### Session 2026-02-14
- Q: For FR-005 / SC-004, what should “identity scope” mean for deduping “active” runs? → A: Dedupe per tenant and operation target/scope.
- Q: For FR-003 (“summary counters”) what exact semantics should total / processed / failed follow? → A: total = items attempted; processed = succeeded; failed = failed.
- Q: For FR-004 (“stable failure code”), which convention should we standardize on for assignment fetch/restore runs? → A: Operation-specific code namespaces + normalized reason_code for the cause.
- Q: For FR-006 (identity scope = tenant + operation type + target/scope), which identifiers should define “target/scope” for these two operations? → A: Fetch targets the backup item (or policy version) identifier; restore targets the restore run identifier.
- Q: For US1 / FR-002 (“restore is observable/auditable”), what audit log granularity do you want for assignment restore? → A: One audit log entry per assignment restore execution (per restore run).
## User Scenarios & Testing *(mandatory)*
### User Story 1 — Observe assignment operations end-to-end (Priority: P1)
Workspace administrators need to see assignment-related operations (both read-only fetch and destructive restore) in Monitoring so they can confirm what ran, what changed, and why something failed, without relying on server logs.
**Why this priority**: Assignment restore is high-risk; missing visibility creates operational and audit gaps.
**Independent Test**: Trigger both an assignment fetch and an assignment restore; verify each produces a monitoring-visible run record with correct lifecycle and failure details.
**Acceptance Scenarios**:
1. **Given** an administrator triggers an assignment fetch, **When** the operation starts and completes, **Then** Monitoring shows a run record with start/end timestamps, final outcome, and summary counters.
2. **Given** an administrator triggers an assignment restore, **When** the operation starts and completes, **Then** Monitoring shows a run record including a clear indication that it was a change-making operation.
- And exactly one audit log entry is written for the restore execution.
3. **Given** an assignment fetch or restore fails due to an external dependency error, **When** the run completes, **Then** Monitoring shows a stable failure code and a sanitized, user-readable message.
4. **Given** the same assignment operation is triggered multiple times concurrently for the same tenant and scope, **When** the system creates tracking records, **Then** the admin sees a single “active” run per identity (or an equivalent deduped representation).
---
### User Story 2 — Enforce correct access control semantics on affected admin surfaces (Priority: P2)
Workspace administrators and platform operators must not be able to cross authentication “planes” accidentally, and the admin UI must not expose bypasses that let users initiate sensitive actions without authorization.
**Why this priority**: Prevents cross-plane leakage and closes known authorization inconsistencies.
**Independent Test**: Attempt access with the wrong authentication plane and with insufficient permissions; verify outcomes are deny-as-not-found (404) or forbidden (403) per policy.
**Acceptance Scenarios**:
1. **Given** a user is authenticated in a different auth plane, **When** they attempt to access workspace-scoped admin routes, **Then** the response is deny-as-not-found (404).
2. **Given** a user is not a member of the workspace, **When** they attempt to view backup items under a backup set, **Then** the response is deny-as-not-found (404) and does not reveal record existence.
3. **Given** a user is a workspace member but lacks the required permission, **When** they attempt a protected action (such as managing provider connections), **Then** the response is forbidden (403).
---
### User Story 3 — Validate assignment operations safely in non-production contexts (Priority: P3)
Platform engineers need to validate that assignment operations behave correctly without requiring live external dependencies, so regressions can be caught early.
**Why this priority**: Improves reliability and reduces the risk of shipping changes that only fail when external services are slow/unavailable.
**Independent Test**: Run automated tests that simulate both successful and failing external interactions; verify monitoring records and authorization behaviors.
**Acceptance Scenarios**:
1. **Given** external interactions are simulated as “successful”, **When** the operation runs, **Then** the run is marked successful and includes expected summary counters.
2. **Given** external interactions are simulated as “failing”, **When** the operation runs, **Then** the run is marked failed with a stable failure code and sanitized message.
### Edge Cases
- External dependency timeouts, throttling, or transient failures.
- Retries and duplicate dispatches (ensure tracking remains coherent and non-spammy).
- Missing or inconsistent tenant/workspace context (must fail safely, not leak).
- Partial completion (some items processed, some failed): counters and failure details must remain interpretable.
## Requirements *(mandatory)*
### Functional Requirements
- **FR-001**: The system MUST create and maintain an operational tracking record (OperationRun) for every assignment fetch operation execution.
- **FR-002**: The system MUST create and maintain an operational tracking record (OperationRun) for every assignment restore operation execution.
- **FR-003**: Tracking records MUST include lifecycle state (queued/running/completed), timestamps, outcome, and summary counters sufficient to understand progress and results.
- Counter semantics: `total` = items attempted, `processed` = succeeded, `failed` = failed.
- **FR-004**: Failed runs MUST include a stable failure code and a sanitized, user-readable failure message.
- Failure convention: use operation-specific `code` namespaces for the run, and store the underlying cause as a normalized `reason_code`.
- **FR-005**: The system MUST prevent duplicate “active” runs for the same tenant and identity scope (tenant + operation type + operation target/scope), or otherwise present a deduped representation that avoids operator confusion.
- **FR-006**: Identity scope MUST be defined as:
- For assignment fetch operations: tenant + operation type + backup item identifier (or equivalent policy-version identifier).
- For assignment restore operations: tenant + operation type + restore run identifier.
- **FR-007**: Monitoring pages MUST render using persisted operational data only (no outbound calls during page render).
- **FR-008**: Cross-plane access MUST be deny-as-not-found (404) on affected routes.
- **FR-009**: Authorization MUST follow consistent semantics:
- non-member / not entitled → 404 (deny-as-not-found)
- member without required permission → 403
- **FR-010**: Any action that can change configuration or trigger a restore MUST be server-authorized; UI visibility MUST NOT be treated as authorization.
- **FR-011**: Sensitive or destructive-like actions MUST require explicit confirmation.
- **FR-012**: Each assignment restore execution MUST write exactly one audit log entry for the restore run.
### Dependencies & Assumptions
- Assignment fetch and assignment restore operations can be triggered by administrators or scheduled/queued execution paths.
- Monitoring users have access to the Operations area only when they have appropriate workspace membership and permissions.
- External dependency failures may occur and must be represented consistently (stable failure codes + sanitized messages).
## UI Action Matrix *(mandatory when the admin UI is changed)*
| Surface | Location | Header Actions | Inspect Affordance (List/Table) | Row Actions (max 2 visible) | Bulk Actions (grouped) | Empty-State CTA(s) | View Header Actions | Create/Edit Save+Cancel | Audit log? | Notes / Exemptions |
|---|---|---|---|---|---|---|---|---|---|---|
| Provider Connections | Admin → Provider Connections | Create connection (permission-gated) | View connection details | Edit / Delete (permission-gated; destructive confirms) | None | Create connection | N/A | Save + Cancel | Yes | Ensures no authorization bypass exists on header/empty-state CTAs. |
| Backup Sets | Admin → Backups | None (unchanged) | View backup set | Actions unchanged | None | None | N/A | N/A | N/A | Only enforcement helper consistency is affected. |
| Restore Runs | Admin → Restores | None (unchanged) | View restore run | Actions unchanged | None | None | N/A | N/A | Yes | Restore operations must be observable via Monitoring. |
| Backup Items | Under a backup set | None | View backup item details | Actions unchanged | None | None | N/A | N/A | N/A | Membership/404 checks must occur before capability/403 checks. |
| Monitoring → Operations | Admin → Monitoring | None | View operation run details | None | None | None | N/A | N/A | Yes | Read-only view; must not call external services during render. |
## Success Criteria *(mandatory)*
### Measurable Outcomes
- **SC-001**: 100% of assignment fetch and assignment restore executions appear in Monitoring with a completed outcome (success or failure) and timestamps.
- **SC-002**: For failed runs, operators can identify a stable failure code and a readable failure message from Monitoring within 60 seconds, without checking server logs.
- **SC-003**: Automated tests verify deny-as-not-found (404) vs forbidden (403) semantics for the affected surfaces.
- **SC-004**: Duplicate active-run confusion is eliminated: repeated triggers produce a single active run per identity scope (tenant + operation type + operation target/scope) or equivalent deduped visibility.