TenantAtlas/specs/054-unify-runs-suitewide/spec.md

# Feature Specification: Unified Operations Runs Suitewide (Except Restore Domain Model) (054)

**Feature Branch**: `feat/054-unify-operations-runs-suitewide`
**Created**: 2026-01-16
**Status**: Draft
**Input**: User description: "Eliminate run sprawl by adopting one canonical tenant-scoped operation run record for long-running actions across the product, surfaced consistently in Monitoring → Operations, while keeping restore as a separate domain workflow that is still visible via an adapter entry."

## Clarifications

### Session 2026-01-16

- Q: Welche Default-Retention soll 054 für canonical Operation Runs festlegen? → A: 90 days
- Q: Transition-Strategie in 054: schreiben wir canonical Runs parallel zu Legacy-Run-Tabellen, oder ersetzen wir sofort? → A: Parallel write (canonical + legacy)
- Q: For `restore.execute`, the spec mentions it acts as an "adapter entry" linking to the restore domain record. How should this be implemented? → A: Physical Row (Create a physical row in `operation_runs` that points to the restore record).
- Q: How should concurrency and deduplication (FR-009) be enforced at the database level? → A: Partial Unique Index (unique constraint on `tenant_id, run_identity_hash` where status is `queued` or `running`).
- Q: How should the `initiator` be modeled to support both users and system processes (FR-001)? → A: Nullable FK + Name Snapshot (`user_id` nullable FK + required `initiator_name` string).

### Session 2026-01-17

- Q: Sollen `backup_schedule.run_now` und `backup_schedule.retry` in 054 zur Phase-1-Adoption (must be implemented) gehören? → A: Yes — both are Phase 1 in 054 (OperationRun producers + worker tracking).
- Q: Wenn Queue-Dispatch fehlschlägt (Background Processing unavailable), sollen wir trotzdem einen `OperationRun` anlegen und ihn sofort als fehlgeschlagen abschließen? → A: Yes — create an `OperationRun` and immediately complete it as `failed` (e.g., failure code `queue.dispatch_failed`); show a clear error and MAY include a “View run” link.
- Q: Wenn ein Start deduped wird (Run wird wiederverwendet), wer soll die In‑App Notifications (“queued” + terminal outcome) bekommen? → A: Only the original initiator (`operation_runs.user_id`); no additional notifications are sent to the second starter on reuse.
- Q: Für `restore.execute`: In welchen `RestoreRunStatus`-Phasen soll überhaupt ein `OperationRun`-Adapter‑Row erzeugt/angezeigt werden? → A: From `previewed` onwards (previewed + execution statuses); no adapter row for `draft`/`scoped`/`checked`.
- Q: Wenn der `restore.execute` Adapter bereits ab `RestoreRunStatus=previewed` sichtbar ist: welchen `OperationRun`-State sollen wir für diese Phase setzen? → A: `status=queued`, `outcome=pending` (until `running`, then `completed` + terminal outcome).
- Q: RBAC Wizard (`TenantResource`) – wie funktioniert Group Search? → A: Group search is delegated-Graph-based and the picker MUST be disabled without delegated auth.
- Q: Restore Wizard (`RestoreRunResource`) – Group Mapping Phase: Graph oder DB-only? → A: DB-only via Directory Cache (`entra_groups`), no Graph calls during mapping; helper text is always shown (fallback included).

## User Scenarios & Testing *(mandatory)*

### User Story 1 - See Every Supported Operation in Monitoring (Priority: P1)

As an operator, I want Monitoring → Operations to show all supported long-running operations for my tenant in one consistent list and detail view, so I can quickly answer what ran, who started it, whether it succeeded/partially succeeded/failed, and where to look next.

**Why this priority**: This is the core value: a single, tenant-scoped source of truth for operational visibility.

**Independent Test**: Trigger at least one run of each Phase 1 run producer, then verify each appears in Monitoring with consistent status/outcome semantics, safe failure summaries, and context links.

**Acceptance Scenarios**:

1. **Given** I am signed into tenant A, **When** I open Monitoring → Operations, **Then** I see only tenant A runs and can filter by run type, run state (queued/running/terminal outcome), time range, and initiator.
2. **Given** multiple run types exist, **When** I filter to `inventory.sync`, **Then** only inventory sync runs are shown.
3. **Given** a run exists, **When** I open its detail view, **Then** I can see initiator, run type, run state (queued/running/terminal outcome), timestamps, summary counts (if applicable), sanitized failures (if any), and links to relevant feature context/results.
4. **Given** a restore run has reached `previewed` or later, **When** I open Monitoring → Operations, **Then** I can see a `restore.execute` entry that links to the existing restore record (restore history remains owned by the restore domain record).
5. **Given** I am a `Readonly` user in tenant A, **When** I view Monitoring → Operations, **Then** I can view runs and details but I do not see any start/rerun/cancel/delete controls.
6. **Given** I attempt to access a run from another tenant (direct link or list), **When** I request it, **Then** access is denied and no run details are disclosed.

---

### User Story 2 - Start Operations Without Blocking (Priority: P2)

As an operator, when I start a supported operation, I want immediate confirmation and a “View run” link so I can continue working while the operation runs in the background.

**Why this priority**: Removes long-running requests/timeouts and standardizes how operations are started and observed.

**Independent Test**: Start each Phase 1 operation from its owning UI and confirm the start returns quickly, includes “View run”, and the run progresses through queued/running into a terminal outcome.

**Acceptance Scenarios**:

1. **Given** I have permission to start a Phase 1 operation in tenant A, **When** I start it, **Then** I receive immediate confirmation with a “View run” link and the run is visible as queued or running.
2. **Given** I am a `Readonly` user in tenant A, **When** I attempt to start any Phase 1 operation, **Then** the system denies the request and does not create a new run.
3. **Given** the run reaches a terminal outcome, **When** that occurs, **Then** the initiating user receives an in-app notification including a short summary and a “View run” link.
4. **Given** background processing is unavailable, **When** I attempt to start an operation, **Then** I receive a clear message and the system MUST NOT claim it was queued.
   - If an `OperationRun` record was created during the attempt, it MUST be completed immediately with outcome `failed` (never left `queued`) and MAY be linked via “View run”.

---

### User Story 3 - Duplicate Starts Reuse the Same Active Run (Priority: P3)

As an operator, I want accidental double-starts (double clicks, two admins, retries) to reuse the same active run so duplicate background work is avoided and results remain auditable.

**Why this priority**: Reduces load, prevents confusing duplicate outcomes, and makes operations safer under concurrency.

**Independent Test**: Start the same operation twice with identical effective inputs while the first is queued/running and verify the system reuses the active run.

**Acceptance Scenarios**:

1. **Given** an identical run is queued/running for a tenant, **When** another start request is made with the same effective inputs, **Then** the system reuses the existing run and does not start a second one.
2. **Given** two starts happen at nearly the same time, **When** the system resolves the race, **Then** at most one active run exists for that identity and both users are directed to it.

### Edge Cases

- Background execution unavailable: start fails fast with a clear message; if an `OperationRun` record was created, it MUST be immediately completed as `failed` (e.g., `queue.dispatch_failed`) and MUST NOT be left `queued`.
- Partial processing: at least one success and at least one failure yields “partially succeeded”, with per-item failures when applicable.
- Large run history: Monitoring remains usable with filters and defaults (recent runs, last 30 days).
- Permissions revoked mid-run: the run continues; visibility is evaluated at time of access.

## Requirements *(mandatory)*

**Constitution alignment (required):** If this feature introduces any external tenant API calls or any write/change behavior,
the spec MUST describe contract registry updates, safety gates (preview/confirmation/audit), tenant isolation, and tests.

### Scope & Assumptions

**Phase 1 adoption set (must be implemented):**

- `inventory.sync` (Inventory “Sync now”)
- `policy.sync` (Policies “Sync now”)
- `directory_groups.sync` (Directory → Groups “Sync groups”)
- `drift.generate` (Drift “Generate drift now” / auto-on-open when eligible)
- `backup_set.add_policies` (Backup Sets “Add selected” / “Add policies”)
- `backup_schedule.run_now` (Backup Schedules “Run now”)
- `backup_schedule.retry` (Backup Schedules “Retry”)

**Restore visibility (adapter only):**

- `restore.execute` appears as a canonical run entry that links to an existing restore domain record.
- The adapter row MUST be created/visible only once a restore run reaches `previewed` (or later) and MUST NOT be created for `draft`, `scoped`, or `checked`.
- When the restore run is `previewed`, the adapter `OperationRun` MUST use `status=queued` and `outcome=pending`.
- Restore execution history remains owned by the restore domain record (not replaced in Phase 1).

**Out of scope for 054 (explicit):**

- Cross-tenant compare/promotion
- UI redesign/styling polish (separate UI polish work)
- Cancel/rerun/delete controls inside Monitoring hub (hub stays view-only)
- Replacing restore domain records with canonical runs
- A full settings UI for retention/notifications/etc.
- Implementing or validating `AuditLog` behavior for audit-only actions (FR-019) beyond actions explicitly changed by 054

**Assumptions (defaults to remove ambiguity in Phase 1):**

- Canonical run history retention defaults to 90 days, with no user-facing retention configuration in 054.
- System-initiated runs (if any) do not notify users by default in Phase 1.
- Transition strategy: write canonical runs in parallel with any existing legacy per-module run tables (where they exist); Monitoring uses canonical runs as the source of truth immediately.

**Run vs Audit-only Adoption Matrix (Phase 1):**

| Feature Area | Action | Tracking | run_type / audit action |
|-------------|--------|----------|--------------------------|
| Policies | Sync now | OperationRun | `policy.sync` |
| Policies | Ignore policy | Audit-only | `policy.ignore` |
| Policies | Export to backup | OperationRun (queued) | `policy.export_backup` |
| Policy Versions | Capture snapshot | OperationRun | `policy.capture_snapshot` |
| Policy Versions | Prune versions | Audit-only | `policy_versions.prune` |
| Policy Versions | Archive versions | Audit-only | `policy_versions.archive` |
| Inventory | Sync now | OperationRun | `inventory.sync` |
| Directory Groups | Sync groups | OperationRun | `directory_groups.sync` |
| Drift | Generate drift | OperationRun | `drift.generate` |
| Backup Sets | Add policies | OperationRun | `backup_set.add_policies` |
| Backup Sets | Archive | Audit-only (DB-only) | `backup_set.archive` |
| Backup Sets | Restore (bulk) | OperationRun | `backup_set.restore` |
| Backup Sets | Force delete | Audit-only (admin-only) | `backup_set.force_delete` |
| Backup Schedules | Run now | OperationRun | `backup_schedule.run_now` |
| Backup Schedules | Retry | OperationRun | `backup_schedule.retry` |
| Backup Schedules | Edit | Audit-only | `backup_schedule.edit` |
| Backup Schedules | Delete | Audit-only | `backup_schedule.delete` |
| Tenants | Sync tenant | OperationRun | `tenant.sync` |
| Tenants | Admin consent | Audit-only | `tenant.admin_consent` |
| Tenants | Verify configuration | Audit-only | `tenant.verify_config` |
| Tenants | Setup Intune RBAC | Audit-only | `tenant.setup_rbac` |
| Tenants | Deactivate | Audit-only | `tenant.deactivate` |
| Restore | Execute restore | OperationRun (adapter) | `restore.execute` (context → `restore_run_id`) |

**Rule**: If an action is queued/background, long-running, or requires remote/external calls (e.g., Microsoft Graph),
it MUST be tracked as an OperationRun. Only fast DB-only changes MAY be Audit-only.

### Functional Requirements

- **FR-001 Canonical Operation Run**: System MUST represent each supported operation execution as a canonical, tenant-scoped operation run record that captures initiator (nullable `user_id` FK + `initiator_name` string), run type, lifecycle status/timestamps, terminal outcome (pending while active), summary counts (when applicable), safe failure summaries, an idempotency identity for dedupe, and a safe context payload referencing “what this run was about”.
  - **Status semantics**: `status` represents lifecycle stage (`queued` → `running` → `completed`).
  - **Outcome semantics (stored tokens)**: `outcome` stores machine tokens: `pending` while active, otherwise `succeeded` / `partially_succeeded` / `failed`.
    - **UI labels**: Monitoring displays human labels derived from stored tokens (e.g., `partially_succeeded` → “Partially succeeded”).
    - **Reserved**: `cancelled` is reserved for future use and MUST NOT be produced by 054 (Monitoring hub has no cancel controls).
  - **Context safety**: `context` MUST be sanitized and MUST include only safe references (e.g., stable IDs, selection scope keys, correlation IDs). It MUST NOT include secrets/tokens/credentials, personal data, or full external payload dumps.
- **FR-002 Run taxonomy**: Run type MUST be stable and follow `"<resource>.<action>"`.
- **FR-003 Phase 1 run types**: Phase 1 run types MUST include `inventory.sync`, `policy.sync`, `directory_groups.sync`, `drift.generate`, `backup_set.add_policies`, `backup_schedule.run_now`, `backup_schedule.retry`, plus `restore.execute` implemented as a physical `operation_runs` record (adapter) pointing to the domain entity.
- **FR-004 Monitoring lists all canonical runs**: Monitoring → Operations MUST list canonical runs for the active tenant with filters for run type, run state (queued/running/terminal outcome), time range, and initiator; default sort is most recent first; default time window is last 30 days.
- **FR-005 Run detail**: Run detail MUST show initiator, run type, run state (queued/running/terminal outcome), timestamps (created/started/finished), summary counts (when applicable), sanitized failures (including per-item failures when applicable), and contextual links to owning feature surfaces/results.
- **FR-006 View-only hub**: Monitoring hub MUST be view-only (no start/rerun/cancel/delete controls) and MUST link back to owning feature surfaces.
- **FR-007 Start surfaces always enqueue**: Every Phase 1 start surface MUST authorize start, create/reuse a canonical run (dedupe), dispatch background execution, and return immediately with confirmation + “View run”.
- **FR-008 No remote work in interactive request**: Start surfaces MUST NOT perform remote work inline; long-running work happens in background execution.
- **FR-009 Deterministic idempotency**: For each run type, the system MUST define a deterministic identity for “identical run” based on tenant + effective inputs; initiator MUST NOT be part of identity. **Enforcement**: Uniqueness MUST be enforced via a partial unique index on `(tenant_id, run_identity_hash)` where status is `queued` or `running`.
- **FR-010 Phase 1 identity rules**: Identity rules MUST be defined at least as follows:
  - `inventory.sync`: tenant + selection scope
  - `policy.sync`: tenant + effective policy scope
  - `directory_groups.sync`: tenant + selection (Phase 1 default: “all groups”)
  - `backup_set.add_policies`: tenant + backup set + selected policies + option flags (if exposed)
  - `backup_schedule.run_now`: tenant + backup schedule id
  - `backup_schedule.retry`: tenant + backup schedule id
  - `drift.generate`: tenant + scope key + baseline/current comparison inputs
- **FR-011 Run state presentation**: Monitoring MUST present a consistent run state using a single display bucket derived from lifecycle status and terminal outcome:
  - If status is `queued` or `running`, display that status.
  - If status is `completed`, display the terminal outcome derived from the stored token (`succeeded`, `partially_succeeded`, or `failed`) using the UI label mapping.
- **FR-012 Partial vs failed (terminal outcomes)**: “Partially succeeded” (`partially_succeeded`) means at least one success and at least one failure; “Failed” (`failed`) means zero successes or cannot proceed.
- **FR-013 Failure details are safe + useful**: Failures MUST be persisted and displayed as stable reason codes and short sanitized messages; failures MUST NOT include secrets/tokens/credentials/PII or full external payload dumps.
  - **Reason codes** MUST be stable, machine-readable identifiers (lowercase, dot-separated), e.g. `graph.throttled`, `auth.forbidden`, `validation.invalid_input`, `unexpected.exception`.
  - **Messages** MUST be short (≤ 200 characters), sanitized, and written for operators (no secrets/tokens/credentials/PII; no raw external payloads). If needed, messages MAY include a non-sensitive correlation identifier.
- **FR-014 Related links**: Run detail MUST include contextual links where applicable (e.g., drift findings, backup set, inventory results, directory groups, restore detail for `restore.execute`).
- **FR-015 Notifications**: System MUST emit in-app notifications for “queued” (after start) and terminal outcomes for Phase 1 runs; notifications MUST include a short summary and a “View run” link; recipients are the initiating user only.
  - If a start request reuses an existing active run (dedupe), the run initiator (as stored on the `OperationRun`) remains the sole notification recipient; the second starter receives no additional notifications.
- **FR-016 Tenant isolation**: All run list/detail access MUST be tenant-scoped; cross-tenant access MUST be denied without disclosing run details.
- **FR-017 No render-time remote calls**: Monitoring pages MUST be render-safe and MUST NOT depend on external service calls during render.
- **FR-018 Roles & permissions**: Roles `Owner`, `Manager`, `Operator`, and `Readonly` MUST be able to view runs; only `Owner`, `Manager`, `Operator` may start operations; `Readonly` is strictly view-only.
- **FR-019 Audit-only actions (no OperationRun)**: Actions that are DB-only and complete within ≤2 seconds under normal
  conditions MAY be executed without an OperationRun, as long as they do not start long-running background execution and
  do not require any remote/external calls.
  - **054 scope note**: 054 does not implement or modify audit-only actions. If any audit-only action is touched as part
    of implementing 054 in the future, it MUST comply with this requirement and MUST be covered by tests.
  If such an action is security-relevant or changes operational behavior (e.g., “Ignore policy”, “Deactivate tenant”,
  “Admin consent”, “Prune versions”, “Force delete”), it MUST write exactly one tenant-scoped AuditLog entry with, at minimum:
  - `tenant_id`
  - `actor_user_id`
  - `action` (stable action identifier, e.g., `policy.ignore`)
  - `target_type`, `target_id`
  - `before` / `after` (sanitized JSON) **or** `diff` (sanitized JSON)
  - `created_at`
  **Trigger guidance (to make classification reviewable)**:
  - “Security-relevant” includes actions that grant/revoke access, change authorization posture, change admin consent, or otherwise modify who/what can read/write tenant data.
  - “Operational behavior change” includes actions that change what the system will do in future runs (e.g., ignore/exclude resources, enable/disable schedules, retention/prune/archive actions, force deletes).
  - If unclear whether an Audit-only action is security/ops-relevant, the default is to treat it as such and write an AuditLog entry.
  **Sanitization (AuditLog before/after/diff)**:
  - AuditLog payloads MUST include only the minimum fields needed to understand the change.
  - AuditLog payloads MUST NOT include secrets/tokens/credentials, personal data, or full external payload dumps.
  - If a field is sensitive, it MUST be omitted or replaced with a non-sensitive placeholder (e.g., `"[REDACTED]"`).
  Monitoring/Operations remains reserved for OperationRun-tracked long-running/queued operations.
  **Acceptance checks (testable)**:
  - Audit-only action creates no OperationRun.
  - Audit-only action creates exactly one AuditLog event containing the required fields.
  - Audit-only action is tenant-scoped; cross-tenant access is forbidden and MUST NOT create AuditLog entries.

### Key Entities *(include if feature involves data)*

- **Canonical Operation Run**: A tenant-scoped record representing the lifecycle of a long-running operation, including run type, initiator (nullable `user_id` FK + `initiator_name` string), lifecycle state/timestamps, terminal outcome, summary counts, safe failure summaries, idempotency identity (uniqueness enforced by DB index on active runs), and safe context references.
- **Restore domain record (exception)**: Restore remains a domain workflow record with richer semantics and history. Monitoring shows restore activity through a physical `operation_runs` row (adapter) that links back to the restore record, without replacing it.

## Success Criteria *(mandatory)*

### Measurable Outcomes

- **SC-001**: Operators can answer “what ran, when, and did it succeed?” for any Phase 1 run in under 1 minute using Monitoring → Operations.
- **SC-002**: Starting a Phase 1 operation returns confirmation + “View run” link within 2 seconds under normal conditions.
- **SC-003**: Duplicate starts reuse the same active run in at least 99% of attempts under normal conditions.
- **SC-003 Measurement scope (definition)**: An “attempt” counts when a start request is made for an operation with identical effective inputs while an identical run is already `queued` or `running`. The success condition is that the system reuses the existing active run reference rather than creating a second active run. “Normal conditions” exclude infrastructure outages (e.g., database unavailable) that prevent either run creation or dedupe evaluation.
- **SC-004**: No secrets/tokens/credentials/PII appear in persisted failures or notifications (verified by tests).