TenantAtlas/specs/054-unify-runs-suitewide/spec.md
2026-01-16 19:06:30 +01:00

149 lines
13 KiB
Markdown

# Feature Specification: Unified Operations Runs Suitewide (Except Restore Domain Model) (054)
**Feature Branch**: `feat/054-unify-operations-runs-suitewide`
**Created**: 2026-01-16
**Status**: Draft
**Input**: User description: "Eliminate run sprawl by adopting one canonical tenant-scoped operation run record for long-running actions across the product, surfaced consistently in Monitoring → Operations, while keeping restore as a separate domain workflow that is still visible via an adapter entry."
## Clarifications
### Session 2026-01-16
- Q: Welche Default-Retention soll 054 für canonical Operation Runs festlegen? → A: 90 days
- Q: Transition-Strategie in 054: schreiben wir canonical Runs parallel zu Legacy-Run-Tabellen, oder ersetzen wir sofort? → A: Parallel write (canonical + legacy)
- Q: For `restore.execute`, the spec mentions it acts as an "adapter entry" linking to the restore domain record. How should this be implemented? → A: Physical Row (Create a physical row in `operation_runs` that points to the restore record).
- Q: How should concurrency and deduplication (FR-009) be enforced at the database level? → A: Partial Unique Index (unique constraint on `tenant_id, run_identity_hash` where outcome is `queued` or `running`).
- Q: How should the `initiator` be modeled to support both users and system processes (FR-001)? → A: Nullable FK + Name Snapshot (`user_id` nullable FK + required `initiator_name` string).
## User Scenarios & Testing *(mandatory)*
### User Story 1 - See Every Supported Operation in Monitoring (Priority: P1)
As an operator, I want Monitoring → Operations to show all supported long-running operations for my tenant in one consistent list and detail view, so I can quickly answer what ran, who started it, whether it succeeded/partially succeeded/failed, and where to look next.
**Why this priority**: This is the core value: a single, tenant-scoped source of truth for operational visibility.
**Independent Test**: Trigger at least one run of each Phase 1 run producer, then verify each appears in Monitoring with consistent status/outcome semantics, safe failure summaries, and context links.
**Acceptance Scenarios**:
1. **Given** I am signed into tenant A, **When** I open Monitoring → Operations, **Then** I see only tenant A runs and can filter by run type, outcome bucket, time range, and initiator.
2. **Given** multiple run types exist, **When** I filter to `inventory.sync`, **Then** only inventory sync runs are shown.
3. **Given** a run exists, **When** I open its detail view, **Then** I can see initiator, run type, outcome bucket, timestamps, summary counts (if applicable), sanitized failures (if any), and links to relevant feature context/results.
4. **Given** restore execution exists, **When** I open Monitoring → Operations, **Then** I can see a `restore.execute` entry that links to the existing restore record (restore history remains owned by the restore domain record).
5. **Given** I am a `Readonly` user in tenant A, **When** I view Monitoring → Operations, **Then** I can view runs and details but I do not see any start/rerun/cancel/delete controls.
6. **Given** I attempt to access a run from another tenant (direct link or list), **When** I request it, **Then** access is denied and no run details are disclosed.
---
### User Story 2 - Start Operations Without Blocking (Priority: P2)
As an operator, when I start a supported operation, I want immediate confirmation and a “View run” link so I can continue working while the operation runs in the background.
**Why this priority**: Removes long-running requests/timeouts and standardizes how operations are started and observed.
**Independent Test**: Start each Phase 1 operation from its owning UI and confirm the start returns quickly, includes “View run”, and the run progresses through queued/running into a terminal outcome.
**Acceptance Scenarios**:
1. **Given** I have permission to start a Phase 1 operation in tenant A, **When** I start it, **Then** I receive immediate confirmation with a “View run” link and the run is visible as queued or running.
2. **Given** I am a `Readonly` user in tenant A, **When** I attempt to start any Phase 1 operation, **Then** the system denies the request and does not create a new run.
3. **Given** the run reaches a terminal outcome, **When** that occurs, **Then** the initiating user receives an in-app notification including a short summary and a “View run” link.
4. **Given** background processing is unavailable, **When** I attempt to start an operation, **Then** I receive a clear message and the system MUST NOT claim it was queued.
---
### User Story 3 - Duplicate Starts Reuse the Same Active Run (Priority: P3)
As an operator, I want accidental double-starts (double clicks, two admins, retries) to reuse the same active run so duplicate background work is avoided and results remain auditable.
**Why this priority**: Reduces load, prevents confusing duplicate outcomes, and makes operations safer under concurrency.
**Independent Test**: Start the same operation twice with identical effective inputs while the first is queued/running and verify the system reuses the active run.
**Acceptance Scenarios**:
1. **Given** an identical run is queued/running for a tenant, **When** another start request is made with the same effective inputs, **Then** the system reuses the existing run and does not start a second one.
2. **Given** two starts happen at nearly the same time, **When** the system resolves the race, **Then** at most one active run exists for that identity and both users are directed to it.
### Edge Cases
- Background execution unavailable: start fails fast with a clear message; the system MUST NOT create misleading “queued” runs.
- Partial processing: at least one success and at least one failure yields “partially succeeded”, with per-item failures when applicable.
- Large run history: Monitoring remains usable with filters and defaults (recent runs, last 30 days).
- Permissions revoked mid-run: the run continues; visibility is evaluated at time of access.
## Requirements *(mandatory)*
**Constitution alignment (required):** If this feature introduces any external tenant API calls or any write/change behavior,
the spec MUST describe contract registry updates, safety gates (preview/confirmation/audit), tenant isolation, and tests.
### Scope & Assumptions
**Phase 1 adoption set (must be implemented):**
- `inventory.sync` (Inventory “Sync now”)
- `policy.sync` (Policies “Sync now”)
- `directory_groups.sync` (Directory → Groups “Sync groups”)
- `drift.generate` (Drift “Generate drift now” / auto-on-open when eligible)
- `backup_set.add_policies` (Backup Sets “Add selected” / “Add policies”)
**Restore visibility (adapter only):**
- `restore.execute` appears as a canonical run entry that links to an existing restore domain record.
- Restore execution history remains owned by the restore domain record (not replaced in Phase 1).
**Out of scope for 054 (explicit):**
- Cross-tenant compare/promotion
- UI redesign/styling polish (separate UI polish work)
- Cancel/rerun/delete controls inside Monitoring hub (hub stays view-only)
- Replacing restore domain records with canonical runs
- A full settings UI for retention/notifications/etc.
**Assumptions (defaults to remove ambiguity in Phase 1):**
- Canonical run history retention defaults to 90 days, with no user-facing retention configuration in 054.
- System-initiated runs (if any) do not notify users by default in Phase 1.
- Transition strategy: write canonical runs in parallel with any existing legacy per-module run tables (where they exist); Monitoring uses canonical runs as the source of truth immediately.
### Functional Requirements
- **FR-001 Canonical Operation Run**: System MUST represent each supported operation execution as a canonical, tenant-scoped operation run record that captures initiator (nullable `user_id` FK + `initiator_name` string), run type, lifecycle status/timestamps, outcome bucket, summary counts (when applicable), safe failure summaries, an idempotency identity for dedupe, and a safe context payload referencing “what this run was about”.
- **FR-002 Run taxonomy**: Run type MUST be stable and follow `"<resource>.<action>"`.
- **FR-003 Phase 1 run types**: Phase 1 run types MUST include `inventory.sync`, `policy.sync`, `directory_groups.sync`, `drift.generate`, `backup_set.add_policies`, plus `restore.execute` implemented as a physical `operation_runs` record (adapter) pointing to the domain entity.
- **FR-004 Monitoring lists all canonical runs**: Monitoring → Operations MUST list canonical runs for the active tenant with filters for run type, outcome bucket, time range, and initiator; default sort is most recent first; default time window is last 30 days.
- **FR-005 Run detail**: Run detail MUST show initiator, run type, outcome bucket, timestamps (created/started/finished), summary counts (when applicable), sanitized failures (including per-item failures when applicable), and contextual links to owning feature surfaces/results.
- **FR-006 View-only hub**: Monitoring hub MUST be view-only (no start/rerun/cancel/delete controls) and MUST link back to owning feature surfaces.
- **FR-007 Start surfaces always enqueue**: Every Phase 1 start surface MUST authorize start, create/reuse a canonical run (dedupe), dispatch background execution, and return immediately with confirmation + “View run”.
- **FR-008 No remote work in interactive request**: Start surfaces MUST NOT perform remote work inline; long-running work happens in background execution.
- **FR-009 Deterministic idempotency**: For each run type, the system MUST define a deterministic identity for “identical run” based on tenant + effective inputs; initiator MUST NOT be part of identity. **Enforcement**: Uniqueness MUST be enforced via a partial unique index on `(tenant_id, run_identity_hash)` where outcome is `queued` or `running`.
- **FR-010 Phase 1 identity rules**: Identity rules MUST be defined at least as follows:
- `inventory.sync`: tenant + selection scope
- `policy.sync`: tenant + effective policy scope
- `directory_groups.sync`: tenant + selection (Phase 1 default: “all groups”)
- `backup_set.add_policies`: tenant + backup set + selected policies + option flags (if exposed)
- `drift.generate`: tenant + scope key + baseline/current comparison inputs
- **FR-011 Outcome buckets**: Monitoring MUST present consistent outcome buckets: `queued`, `running`, `succeeded`, `partially succeeded`, `failed`.
- **FR-012 Partial vs failed**: “Partially succeeded” means at least one success and at least one failure; “Failed” means zero successes or cannot proceed.
- **FR-013 Failure details are safe + useful**: Failures MUST be persisted and displayed as stable reason codes and short sanitized messages; failures MUST NOT include secrets/tokens/credentials/PII or full external payload dumps.
- **FR-014 Related links**: Run detail MUST include contextual links where applicable (e.g., drift findings, backup set, inventory results, directory groups, restore detail for `restore.execute`).
- **FR-015 Notifications**: System MUST emit in-app notifications for “queued” (after start) and terminal outcomes for Phase 1 runs; notifications MUST include a short summary and a “View run” link; recipients are the initiating user only.
- **FR-016 Tenant isolation**: All run list/detail access MUST be tenant-scoped; cross-tenant access MUST be denied without disclosing run details.
- **FR-017 No render-time remote calls**: Monitoring pages MUST be render-safe and MUST NOT depend on external service calls during render.
- **FR-018 Roles & permissions**: Roles `Owner`, `Manager`, `Operator`, and `Readonly` MUST be able to view runs; only `Owner`, `Manager`, `Operator` may start operations; `Readonly` is strictly view-only.
### Key Entities *(include if feature involves data)*
- **Canonical Operation Run**: A tenant-scoped record representing the lifecycle of a long-running operation, including run type, initiator (nullable `user_id` FK + `initiator_name` string), lifecycle state/timestamps, outcome bucket, summary counts, safe failure summaries, idempotency identity (uniqueness enforced by DB index on active runs), and safe context references.
- **Restore domain record (exception)**: Restore remains a domain workflow record with richer semantics and history. Monitoring shows restore activity through a physical `operation_runs` row (adapter) that links back to the restore record, without replacing it.
## Success Criteria *(mandatory)*
### Measurable Outcomes
- **SC-001**: Operators can answer “what ran, when, and did it succeed?” for any Phase 1 run in under 1 minute using Monitoring → Operations.
- **SC-002**: Starting a Phase 1 operation returns confirmation + “View run” link within 2 seconds under normal conditions.
- **SC-003**: Duplicate starts reuse the same active run in at least 99% of attempts under normal conditions.
- **SC-004**: No secrets/tokens/credentials/PII appear in persisted failures or notifications (verified by tests).