# Feature Specification: Unified Operations Runs Suitewide (Except Restore Domain Model) (054) **Feature Branch**: `feat/054-unify-operations-runs-suitewide` **Created**: 2026-01-16 **Status**: Draft **Input**: User description: "Eliminate run sprawl by adopting one canonical tenant-scoped operation run record for long-running actions across the product, surfaced consistently in Monitoring → Operations, while keeping restore as a separate domain workflow that is still visible via an adapter entry." ## Clarifications ### Session 2026-01-16 - Q: Welche Default-Retention soll 054 für canonical Operation Runs festlegen? → A: 90 days - Q: Transition-Strategie in 054: schreiben wir canonical Runs parallel zu Legacy-Run-Tabellen, oder ersetzen wir sofort? → A: Parallel write (canonical + legacy) - Q: For `restore.execute`, the spec mentions it acts as an "adapter entry" linking to the restore domain record. How should this be implemented? → A: Physical Row (Create a physical row in `operation_runs` that points to the restore record). - Q: How should concurrency and deduplication (FR-009) be enforced at the database level? → A: Partial Unique Index (unique constraint on `tenant_id, run_identity_hash` where status is `queued` or `running`). - Q: How should the `initiator` be modeled to support both users and system processes (FR-001)? → A: Nullable FK + Name Snapshot (`user_id` nullable FK + required `initiator_name` string). ### Session 2026-01-17 - Q: Sollen `backup_schedule.run_now` und `backup_schedule.retry` in 054 zur Phase-1-Adoption (must be implemented) gehören? → A: Yes — both are Phase 1 in 054 (OperationRun producers + worker tracking). - Q: Wenn Queue-Dispatch fehlschlägt (Background Processing unavailable), sollen wir trotzdem einen `OperationRun` anlegen und ihn sofort als fehlgeschlagen abschließen? → A: Yes — create an `OperationRun` and immediately complete it as `failed` (e.g., failure code `queue.dispatch_failed`); show a clear error and MAY include a “View run” link. - Q: Wenn ein Start deduped wird (Run wird wiederverwendet), wer soll die In‑App Notifications (“queued” + terminal outcome) bekommen? → A: Only the original initiator (`operation_runs.user_id`); no additional notifications are sent to the second starter on reuse. - Q: Für `restore.execute`: In welchen `RestoreRunStatus`-Phasen soll überhaupt ein `OperationRun`-Adapter‑Row erzeugt/angezeigt werden? → A: From `previewed` onwards (previewed + execution statuses); no adapter row for `draft`/`scoped`/`checked`. - Q: Wenn der `restore.execute` Adapter bereits ab `RestoreRunStatus=previewed` sichtbar ist: welchen `OperationRun`-State sollen wir für diese Phase setzen? → A: `status=queued`, `outcome=pending` (until `running`, then `completed` + terminal outcome). - Q: RBAC Wizard (`TenantResource`) – wie funktioniert Group Search? → A: Group search is delegated-Graph-based and the picker MUST be disabled without delegated auth. - Q: Restore Wizard (`RestoreRunResource`) – Group Mapping Phase: Graph oder DB-only? → A: DB-only via Directory Cache (`entra_groups`), no Graph calls during mapping; helper text is always shown (fallback included). ## User Scenarios & Testing *(mandatory)* ### User Story 1 - See Every Supported Operation in Monitoring (Priority: P1) As an operator, I want Monitoring → Operations to show all supported long-running operations for my tenant in one consistent list and detail view, so I can quickly answer what ran, who started it, whether it succeeded/partially succeeded/failed, and where to look next. **Why this priority**: This is the core value: a single, tenant-scoped source of truth for operational visibility. **Independent Test**: Trigger at least one run of each Phase 1 run producer, then verify each appears in Monitoring with consistent status/outcome semantics, safe failure summaries, and context links. **Acceptance Scenarios**: 1. **Given** I am signed into tenant A, **When** I open Monitoring → Operations, **Then** I see only tenant A runs and can filter by run type, run state (queued/running/terminal outcome), time range, and initiator. 2. **Given** multiple run types exist, **When** I filter to `inventory.sync`, **Then** only inventory sync runs are shown. 3. **Given** a run exists, **When** I open its detail view, **Then** I can see initiator, run type, run state (queued/running/terminal outcome), timestamps, summary counts (if applicable), sanitized failures (if any), and links to relevant feature context/results. 4. **Given** a restore run has reached `previewed` or later, **When** I open Monitoring → Operations, **Then** I can see a `restore.execute` entry that links to the existing restore record (restore history remains owned by the restore domain record). 5. **Given** I am a `Readonly` user in tenant A, **When** I view Monitoring → Operations, **Then** I can view runs and details but I do not see any start/rerun/cancel/delete controls. 6. **Given** I attempt to access a run from another tenant (direct link or list), **When** I request it, **Then** access is denied and no run details are disclosed. --- ### User Story 2 - Start Operations Without Blocking (Priority: P2) As an operator, when I start a supported operation, I want immediate confirmation and a “View run” link so I can continue working while the operation runs in the background. **Why this priority**: Removes long-running requests/timeouts and standardizes how operations are started and observed. **Independent Test**: Start each Phase 1 operation from its owning UI and confirm the start returns quickly, includes “View run”, and the run progresses through queued/running into a terminal outcome. **Acceptance Scenarios**: 1. **Given** I have permission to start a Phase 1 operation in tenant A, **When** I start it, **Then** I receive immediate confirmation with a “View run” link and the run is visible as queued or running. 2. **Given** I am a `Readonly` user in tenant A, **When** I attempt to start any Phase 1 operation, **Then** the system denies the request and does not create a new run. 3. **Given** the run reaches a terminal outcome, **When** that occurs, **Then** the initiating user receives an in-app notification including a short summary and a “View run” link. 4. **Given** background processing is unavailable, **When** I attempt to start an operation, **Then** I receive a clear message and the system MUST NOT claim it was queued. - If an `OperationRun` record was created during the attempt, it MUST be completed immediately with outcome `failed` (never left `queued`) and MAY be linked via “View run”. --- ### User Story 3 - Duplicate Starts Reuse the Same Active Run (Priority: P3) As an operator, I want accidental double-starts (double clicks, two admins, retries) to reuse the same active run so duplicate background work is avoided and results remain auditable. **Why this priority**: Reduces load, prevents confusing duplicate outcomes, and makes operations safer under concurrency. **Independent Test**: Start the same operation twice with identical effective inputs while the first is queued/running and verify the system reuses the active run. **Acceptance Scenarios**: 1. **Given** an identical run is queued/running for a tenant, **When** another start request is made with the same effective inputs, **Then** the system reuses the existing run and does not start a second one. 2. **Given** two starts happen at nearly the same time, **When** the system resolves the race, **Then** at most one active run exists for that identity and both users are directed to it. ### Edge Cases - Background execution unavailable: start fails fast with a clear message; if an `OperationRun` record was created, it MUST be immediately completed as `failed` (e.g., `queue.dispatch_failed`) and MUST NOT be left `queued`. - Partial processing: at least one success and at least one failure yields “partially succeeded”, with per-item failures when applicable. - Large run history: Monitoring remains usable with filters and defaults (recent runs, last 30 days). - Permissions revoked mid-run: the run continues; visibility is evaluated at time of access. ## Requirements *(mandatory)* **Constitution alignment (required):** If this feature introduces any external tenant API calls or any write/change behavior, the spec MUST describe contract registry updates, safety gates (preview/confirmation/audit), tenant isolation, and tests. ### Scope & Assumptions **Phase 1 adoption set (must be implemented):** - `inventory.sync` (Inventory “Sync now”) - `policy.sync` (Policies “Sync now”) - `directory_groups.sync` (Directory → Groups “Sync groups”) - `drift.generate` (Drift “Generate drift now” / auto-on-open when eligible) - `backup_set.add_policies` (Backup Sets “Add selected” / “Add policies”) - `backup_schedule.run_now` (Backup Schedules “Run now”) - `backup_schedule.retry` (Backup Schedules “Retry”) **Restore visibility (adapter only):** - `restore.execute` appears as a canonical run entry that links to an existing restore domain record. - The adapter row MUST be created/visible only once a restore run reaches `previewed` (or later) and MUST NOT be created for `draft`, `scoped`, or `checked`. - When the restore run is `previewed`, the adapter `OperationRun` MUST use `status=queued` and `outcome=pending`. - Restore execution history remains owned by the restore domain record (not replaced in Phase 1). **Out of scope for 054 (explicit):** - Cross-tenant compare/promotion - UI redesign/styling polish (separate UI polish work) - Cancel/rerun/delete controls inside Monitoring hub (hub stays view-only) - Replacing restore domain records with canonical runs - A full settings UI for retention/notifications/etc. - Implementing or validating `AuditLog` behavior for audit-only actions (FR-019) beyond actions explicitly changed by 054 **Assumptions (defaults to remove ambiguity in Phase 1):** - Canonical run history retention defaults to 90 days, with no user-facing retention configuration in 054. - System-initiated runs (if any) do not notify users by default in Phase 1. - Transition strategy: write canonical runs in parallel with any existing legacy per-module run tables (where they exist); Monitoring uses canonical runs as the source of truth immediately. **Run vs Audit-only Adoption Matrix (Phase 1):** | Feature Area | Action | Tracking | run_type / audit action | |-------------|--------|----------|--------------------------| | Policies | Sync now | OperationRun | `policy.sync` | | Policies | Ignore policy | Audit-only | `policy.ignore` | | Policies | Export to backup | OperationRun (queued) | `policy.export_backup` | | Policy Versions | Capture snapshot | OperationRun | `policy.capture_snapshot` | | Policy Versions | Prune versions | Audit-only | `policy_versions.prune` | | Policy Versions | Archive versions | Audit-only | `policy_versions.archive` | | Inventory | Sync now | OperationRun | `inventory.sync` | | Directory Groups | Sync groups | OperationRun | `directory_groups.sync` | | Drift | Generate drift | OperationRun | `drift.generate` | | Backup Sets | Add policies | OperationRun | `backup_set.add_policies` | | Backup Sets | Archive | Audit-only (DB-only) | `backup_set.archive` | | Backup Sets | Restore (bulk) | OperationRun | `backup_set.restore` | | Backup Sets | Force delete | Audit-only (admin-only) | `backup_set.force_delete` | | Backup Schedules | Run now | OperationRun | `backup_schedule.run_now` | | Backup Schedules | Retry | OperationRun | `backup_schedule.retry` | | Backup Schedules | Edit | Audit-only | `backup_schedule.edit` | | Backup Schedules | Delete | Audit-only | `backup_schedule.delete` | | Tenants | Sync tenant | OperationRun | `tenant.sync` | | Tenants | Admin consent | Audit-only | `tenant.admin_consent` | | Tenants | Verify configuration | Audit-only | `tenant.verify_config` | | Tenants | Setup Intune RBAC | Audit-only | `tenant.setup_rbac` | | Tenants | Deactivate | Audit-only | `tenant.deactivate` | | Restore | Execute restore | OperationRun (adapter) | `restore.execute` (context → `restore_run_id`) | **Rule**: If an action is queued/background, long-running, or requires remote/external calls (e.g., Microsoft Graph), it MUST be tracked as an OperationRun. Only fast DB-only changes MAY be Audit-only. ### Functional Requirements - **FR-001 Canonical Operation Run**: System MUST represent each supported operation execution as a canonical, tenant-scoped operation run record that captures initiator (nullable `user_id` FK + `initiator_name` string), run type, lifecycle status/timestamps, terminal outcome (pending while active), summary counts (when applicable), safe failure summaries, an idempotency identity for dedupe, and a safe context payload referencing “what this run was about”. - **Status semantics**: `status` represents lifecycle stage (`queued` → `running` → `completed`). - **Outcome semantics (stored tokens)**: `outcome` stores machine tokens: `pending` while active, otherwise `succeeded` / `partially_succeeded` / `failed`. - **UI labels**: Monitoring displays human labels derived from stored tokens (e.g., `partially_succeeded` → “Partially succeeded”). - **Reserved**: `cancelled` is reserved for future use and MUST NOT be produced by 054 (Monitoring hub has no cancel controls). - **Context safety**: `context` MUST be sanitized and MUST include only safe references (e.g., stable IDs, selection scope keys, correlation IDs). It MUST NOT include secrets/tokens/credentials, personal data, or full external payload dumps. - **FR-002 Run taxonomy**: Run type MUST be stable and follow `"."`. - **FR-003 Phase 1 run types**: Phase 1 run types MUST include `inventory.sync`, `policy.sync`, `directory_groups.sync`, `drift.generate`, `backup_set.add_policies`, `backup_schedule.run_now`, `backup_schedule.retry`, plus `restore.execute` implemented as a physical `operation_runs` record (adapter) pointing to the domain entity. - **FR-004 Monitoring lists all canonical runs**: Monitoring → Operations MUST list canonical runs for the active tenant with filters for run type, run state (queued/running/terminal outcome), time range, and initiator; default sort is most recent first; default time window is last 30 days. - **FR-005 Run detail**: Run detail MUST show initiator, run type, run state (queued/running/terminal outcome), timestamps (created/started/finished), summary counts (when applicable), sanitized failures (including per-item failures when applicable), and contextual links to owning feature surfaces/results. - **FR-006 View-only hub**: Monitoring hub MUST be view-only (no start/rerun/cancel/delete controls) and MUST link back to owning feature surfaces. - **FR-007 Start surfaces always enqueue**: Every Phase 1 start surface MUST authorize start, create/reuse a canonical run (dedupe), dispatch background execution, and return immediately with confirmation + “View run”. - **FR-008 No remote work in interactive request**: Start surfaces MUST NOT perform remote work inline; long-running work happens in background execution. - **FR-009 Deterministic idempotency**: For each run type, the system MUST define a deterministic identity for “identical run” based on tenant + effective inputs; initiator MUST NOT be part of identity. **Enforcement**: Uniqueness MUST be enforced via a partial unique index on `(tenant_id, run_identity_hash)` where status is `queued` or `running`. - **FR-010 Phase 1 identity rules**: Identity rules MUST be defined at least as follows: - `inventory.sync`: tenant + selection scope - `policy.sync`: tenant + effective policy scope - `directory_groups.sync`: tenant + selection (Phase 1 default: “all groups”) - `backup_set.add_policies`: tenant + backup set + selected policies + option flags (if exposed) - `backup_schedule.run_now`: tenant + backup schedule id - `backup_schedule.retry`: tenant + backup schedule id - `drift.generate`: tenant + scope key + baseline/current comparison inputs - **FR-011 Run state presentation**: Monitoring MUST present a consistent run state using a single display bucket derived from lifecycle status and terminal outcome: - If status is `queued` or `running`, display that status. - If status is `completed`, display the terminal outcome derived from the stored token (`succeeded`, `partially_succeeded`, or `failed`) using the UI label mapping. - **FR-012 Partial vs failed (terminal outcomes)**: “Partially succeeded” (`partially_succeeded`) means at least one success and at least one failure; “Failed” (`failed`) means zero successes or cannot proceed. - **FR-013 Failure details are safe + useful**: Failures MUST be persisted and displayed as stable reason codes and short sanitized messages; failures MUST NOT include secrets/tokens/credentials/PII or full external payload dumps. - **Reason codes** MUST be stable, machine-readable identifiers (lowercase, dot-separated), e.g. `graph.throttled`, `auth.forbidden`, `validation.invalid_input`, `unexpected.exception`. - **Messages** MUST be short (≤ 200 characters), sanitized, and written for operators (no secrets/tokens/credentials/PII; no raw external payloads). If needed, messages MAY include a non-sensitive correlation identifier. - **FR-014 Related links**: Run detail MUST include contextual links where applicable (e.g., drift findings, backup set, inventory results, directory groups, restore detail for `restore.execute`). - **FR-015 Notifications**: System MUST emit in-app notifications for “queued” (after start) and terminal outcomes for Phase 1 runs; notifications MUST include a short summary and a “View run” link; recipients are the initiating user only. - If a start request reuses an existing active run (dedupe), the run initiator (as stored on the `OperationRun`) remains the sole notification recipient; the second starter receives no additional notifications. - **FR-016 Tenant isolation**: All run list/detail access MUST be tenant-scoped; cross-tenant access MUST be denied without disclosing run details. - **FR-017 No render-time remote calls**: Monitoring pages MUST be render-safe and MUST NOT depend on external service calls during render. - **FR-018 Roles & permissions**: Roles `Owner`, `Manager`, `Operator`, and `Readonly` MUST be able to view runs; only `Owner`, `Manager`, `Operator` may start operations; `Readonly` is strictly view-only. - **FR-019 Audit-only actions (no OperationRun)**: Actions that are DB-only and complete within ≤2 seconds under normal conditions MAY be executed without an OperationRun, as long as they do not start long-running background execution and do not require any remote/external calls. - **054 scope note**: 054 does not implement or modify audit-only actions. If any audit-only action is touched as part of implementing 054 in the future, it MUST comply with this requirement and MUST be covered by tests. If such an action is security-relevant or changes operational behavior (e.g., “Ignore policy”, “Deactivate tenant”, “Admin consent”, “Prune versions”, “Force delete”), it MUST write exactly one tenant-scoped AuditLog entry with, at minimum: - `tenant_id` - `actor_user_id` - `action` (stable action identifier, e.g., `policy.ignore`) - `target_type`, `target_id` - `before` / `after` (sanitized JSON) **or** `diff` (sanitized JSON) - `created_at` **Trigger guidance (to make classification reviewable)**: - “Security-relevant” includes actions that grant/revoke access, change authorization posture, change admin consent, or otherwise modify who/what can read/write tenant data. - “Operational behavior change” includes actions that change what the system will do in future runs (e.g., ignore/exclude resources, enable/disable schedules, retention/prune/archive actions, force deletes). - If unclear whether an Audit-only action is security/ops-relevant, the default is to treat it as such and write an AuditLog entry. **Sanitization (AuditLog before/after/diff)**: - AuditLog payloads MUST include only the minimum fields needed to understand the change. - AuditLog payloads MUST NOT include secrets/tokens/credentials, personal data, or full external payload dumps. - If a field is sensitive, it MUST be omitted or replaced with a non-sensitive placeholder (e.g., `"[REDACTED]"`). Monitoring/Operations remains reserved for OperationRun-tracked long-running/queued operations. **Acceptance checks (testable)**: - Audit-only action creates no OperationRun. - Audit-only action creates exactly one AuditLog event containing the required fields. - Audit-only action is tenant-scoped; cross-tenant access is forbidden and MUST NOT create AuditLog entries. ### Key Entities *(include if feature involves data)* - **Canonical Operation Run**: A tenant-scoped record representing the lifecycle of a long-running operation, including run type, initiator (nullable `user_id` FK + `initiator_name` string), lifecycle state/timestamps, terminal outcome, summary counts, safe failure summaries, idempotency identity (uniqueness enforced by DB index on active runs), and safe context references. - **Restore domain record (exception)**: Restore remains a domain workflow record with richer semantics and history. Monitoring shows restore activity through a physical `operation_runs` row (adapter) that links back to the restore record, without replacing it. ## Success Criteria *(mandatory)* ### Measurable Outcomes - **SC-001**: Operators can answer “what ran, when, and did it succeed?” for any Phase 1 run in under 1 minute using Monitoring → Operations. - **SC-002**: Starting a Phase 1 operation returns confirmation + “View run” link within 2 seconds under normal conditions. - **SC-003**: Duplicate starts reuse the same active run in at least 99% of attempts under normal conditions. - **SC-003 Measurement scope (definition)**: An “attempt” counts when a start request is made for an operation with identical effective inputs while an identical run is already `queued` or `running`. The success condition is that the system reuses the existing active run reference rather than creating a second active run. “Normal conditions” exclude infrastructure outages (e.g., database unavailable) that prevent either run creation or dedupe evaluation. - **SC-004**: No secrets/tokens/credentials/PII appear in persisted failures or notifications (verified by tests).