TenantAtlas/specs/053-unify-runs-monitoring/spec.md

# Feature Specification: Unified Operations Runs + Monitoring Hub (053)

**Feature Branch**: `feat/053-unify-runs-monitoring`
**Created**: 2026-01-15
**Status**: Draft
**Input**: User description: "Unify long-running operations into a consistent, observable run model and provide a single Monitoring/Operations hub with tenant-scoped authorization, consistent status/count semantics, safe failure visibility, duplicate prevention, and Drift generation as a first-class tracked operation."

## Clarifications

### Session 2026-01-15

- Q: Who can view runs in Monitoring/Operations? → A: `Owner`, `Manager`, `Operator`, and `Readonly` (view-only; `Readonly` sees sanitized details but no action controls).
- Q: Which operations are included in Phase 1? → A: Monitoring/Operations hub + Drift generation + Backup Set “Add Policies”.
- Q: Should Monitoring/Operations include run management actions (rerun/cancel/delete)? → A: No; Monitoring/Operations is view-only in Phase 1 (start/re-run remain in feature UIs).
- Q: How is “partially succeeded” determined vs “failed”? → A: `partially succeeded` means both successes and failures occurred; `failed` means nothing succeeded (or the run could not proceed).
- Q: What failure detail should be stored and shown for runs? → A: Stable reason codes + short sanitized messages; itemized operations also include per-item failures (sanitized).

## User Scenarios & Testing *(mandatory)*

### User Story 1 - Monitor operations in one place (Priority: P1)

As an operator, I want a single Monitoring/Operations area where I can see what operations are queued/running/succeeded/partially succeeded/failed for my tenant and drill into details, so I can quickly answer “what is happening” without jumping between modules.

**Why this priority**: Provides immediate operational visibility, reduces time-to-diagnose failures, and prevents “run sprawl” (different screens/status meanings per module).

**Independent Test**: Seed multiple runs with different types/statuses and verify the hub lists them tenant-scoped, supports filtering, and a run detail view is available with status, timing, summary counts, and safe error summaries.

**Acceptance Scenarios**:

1. **Given** I am signed into tenant A, **When** I open Monitoring → Operations, **Then** I see runs for tenant A sorted most-recent-first and the list defaults to the last 30 days.
2. **Given** the list contains multiple run types and statuses, **When** I filter by run type and status, **Then** I only see matching runs and can open a run detail page for any result.
3. **Given** I need an older or narrower timeframe, **When** I set a custom time range, **Then** the list updates to match the selected range.
4. **Given** I am a `Readonly` user in tenant A, **When** I open a run detail page, **Then** I can view status and sanitized failure information but I do not see controls to start, rerun, cancel, or delete runs.
5. **Given** I can view Monitoring/Operations, **When** I use the hub, **Then** it is view-only (no controls to start, rerun, cancel, or delete runs).
6. **Given** an itemized run has failures, **When** I open the run detail page, **Then** I can see which items failed with stable reason codes and short sanitized messages.
7. **Given** I attempt to access a run from another tenant (via list or direct link), **When** I request it, **Then** access is denied and no run data is disclosed.
8. **Given** a run has both successes and failures, **When** I view it, **Then** its outcome is “partially succeeded”; **And given** it has zero successes (or cannot proceed), **Then** its outcome is “failed”.
9. **Given** a run was initiated by the system, **When** I view it, **Then** the initiator is shown as “System”.

---

### User Story 2 - Start long-running actions without waiting (Priority: P2)

As an operator, when I trigger a long-running action, I want an immediate confirmation with a “View run” link so I can keep working while the operation completes in the background.

**Why this priority**: Prevents timeouts and long waits, improves reliability under retries/double-clicks, and standardizes the start experience across modules.

**Independent Test**: Start a supported operation (e.g., drift generation) and verify an operation run is created/reused, the UI returns quickly with a “View run” link, and the run progresses to a terminal status.

**Acceptance Scenarios**:

1. **Given** I have permission to start a supported operation in tenant A, **When** I start it, **Then** I receive immediate confirmation with a “View run” link and the run shows as queued or running.
2. **Given** I am a `Readonly` user in tenant A, **When** I attempt to start a supported operation, **Then** the system denies the request and no new run is created.
3. **Given** I start Backup Set “Add Policies”, **When** I submit the selection, **Then** I receive immediate confirmation with a “View run” link and the run is visible in Monitoring/Operations.
4. **Given** I start a supported operation successfully, **When** it is queued, **Then** I receive a “Queued” notification with a “View run” link to the run detail view.
5. **Given** my run reaches a terminal outcome, **When** that occurs, **Then** I receive a completion notification stating `succeeded`, `partially succeeded`, or `failed`, including summary counts when applicable.
6. **Given** background execution is unavailable, **When** I attempt to start a supported operation, **Then** I receive a clear message that the operation cannot be queued and I do not receive a misleading “queued” confirmation.

---

### User Story 3 - Drift generation is observable like other operations (Priority: P3)

As an operator, I want drift generation to behave like any other operation: it creates a run, shows progress and outcomes, and links to results so drift is easy to monitor and troubleshoot.

**Why this priority**: Drift is an operator-facing workflow; making it observable avoids confusion and reduces support burden when it fails or takes time.

**Independent Test**: Trigger drift generation for a defined scope, verify it appears in the monitoring hub, reaches a terminal status, and provides either a results link or a safe failure summary.

**Acceptance Scenarios**:

1. **Given** drift generation is available for a selected scope, **When** I trigger “Generate drift now”, **Then** I see a drift run in Monitoring and can open its details.
2. **Given** drift generation is already queued/running for the same scope, **When** I trigger it again, **Then** the system reuses the existing run and does not start a duplicate.
3. **Given** drift generation succeeds for a scope, **When** I open the run detail view, **Then** I see a link to the drift findings produced for that scope.
4. **Given** drift generation fails, **When** I open the run detail view, **Then** I see a safe failure summary (reason code + short sanitized message) and no sensitive data.
5. **Given** drift generation is requested but there is not enough eligible data to compare, **When** I trigger it, **Then** the system produces no findings and communicates a clear, actionable reason (for example, “insufficient data”).

---

### Edge Cases

- If drift generation is requested for a scope without enough eligible data to compare, the system MUST refuse to start the operation or MUST complete the run as `failed` with a stable reason code (for example, `insufficient_data`) and a short, actionable message.
- If repeated start attempts occur for the same tenant + run type + scope while an existing run is `queued` or `running`, the system MUST reuse the existing run and MUST NOT start duplicate background work.
- If the initiator’s permissions change while a run is in progress, the run continues; visibility is evaluated at time of access; and completion notifications are delivered only if the recipient remains authorized to view the run.
- Failure details MUST remain sanitized even when underlying errors contain sensitive data: only stable reason codes and short sanitized messages are shown; secrets/tokens/raw payload dumps are never shown.
- For very large scopes, the system MAY summarize or truncate per-item failure listings in the UI, but it MUST preserve accurate summary counts and MUST indicate when failure details are truncated.

## Requirements *(mandatory)*

**Constitution alignment (required):** If this feature introduces any new external tenant reads/writes or any write/change behavior,
the spec MUST describe safety gates (preview/confirmation/audit), tenant isolation, auditability, and tests. This feature is primarily about observability and standardization; monitoring views MUST be read-only and MUST NOT trigger external data collection.

### Scope & Assumptions

- This feature standardizes tracking and monitoring for long-running operations that already exist in the product.
- **Phase 1 supported operations** are: Drift generation; Backup Set “Add Policies”. All other candidate operations are explicitly deferred to Phase 2+ adoption (for example: inventory sync, directory group sync, snapshot/backup capture, restore execution, cross-tenant comparison/promotion).
- System-initiated runs may exist (for example, scheduled operations); they appear in Monitoring/Operations with initiator shown as “System”. Notification routing for system-initiated runs is deferred to Phase 2 (Owner: Product) to avoid unintended noise.
- **Roles in this spec**: `Owner`, `Manager`, and `Operator` can view Monitoring/Operations and can start Phase 1 supported operations; `Readonly` can view Monitoring/Operations but cannot start or manage runs and must not see those controls.
- Cross-tenant monitoring aggregation is out of scope; monitoring is tenant-scoped.
- Advanced dashboards (charts, badges, progress widgets) are out of scope; focus is on consistent run tracking, filtering, and drill-down.
- Run retention horizon and scale targets (volume per tenant, archiving/export) are deferred to Phase 2 (Owner: Product) to align with real usage data and storage constraints. Phase 1 focuses on operational clarity and defaults the list view to a recent time window.
- This feature assumes background execution is enabled in each environment so long-running operations can complete outside of interactive user sessions; when it is unavailable, the system must communicate clearly and must not mislead operators into thinking work is queued.
- Monitoring/Operations is view-only in Phase 1; run start/re-run controls remain in their respective feature areas.
- Auditability for Phase 1 is achieved via the run record (initiator, timestamps, outcome, counts, safe failure summary) and lifecycle notifications. Auditing “who viewed which run” is deferred to Phase 2 (Owner: Product).

### Functional Requirements

- **FR-001**: System MUST track each supported long-running operation as a tenant-scoped run with: run type, scope/target (when applicable), status, timestamps (created/started/finished), initiator (user or system), and a human-readable label. The label MUST be a stable operator-facing description combining run type and scope/target (English-only in Phase 1; localization deferred to Phase 2).
- **FR-002**: System MUST provide a Monitoring/Operations area that lists runs for the current tenant, sorted most-recent-first by default, defaulting to a recent time window (last 30 days), and supporting filtering by run type, status, and time range. Status filter values MUST include: `queued`, `running`, `succeeded`, `partially succeeded`, `failed`. Run type filtering MUST include the Phase 1 supported operations (Drift generation; Backup Set “Add Policies”).
- **FR-003**: System MUST provide a run detail view that shows status/outcome, timing, summary counts, and safe failure summaries. For itemized operations (operations that process a set of items), counts MUST include `total`, `succeeded`, `failed`, and `skipped` (if applicable).
- **FR-004**: System MUST use consistent run status semantics across run types using the Phase 1 status set: `queued`, `running`, `succeeded`, `partially succeeded`, `failed`. Status meanings MUST be unambiguous: `partially succeeded` indicates at least one success and at least one failure; `failed` indicates zero successes (or the run could not proceed). Cancellation/abort outcomes are deferred to Phase 2.
- **FR-005**: When an operator starts a supported long-running operation, the system MUST provide immediate confirmation and a “View run” link that opens the run detail view without blocking on completion. If background execution is unavailable, the system MUST provide a clear error and MUST NOT present a misleading “queued” confirmation.
- **FR-006**: System MUST avoid duplicate runs for the same tenant + run type + scope when an identical run is already `queued` or `running` by reusing the existing run. “Identical” means the same tenant, the same run type, the same scope/target, and the same effective inputs (for example: the same drift scope selection; the same backup set and selected policies). The initiator MUST NOT be part of the identity for duplicate prevention.
- **FR-007**: Drift generation MUST be tracked as a run and MUST surface completion status and either a link to produced findings or an actionable, safe failure summary.
- **FR-008**: Run list, run view, and run start actions MUST be tenant-scoped and forbidden cross-tenant. Tenant scoping MUST be applied before any filtering or lookup to prevent cross-tenant data leakage, and cross-tenant access attempts MUST NOT disclose run existence or details.
- **FR-009**: Run visibility and run start actions MUST be permission-gated by run type (least privilege). By default, `Owner`, `Manager`, `Operator`, and `Readonly` can view runs, but `Readonly` MUST NOT be able to start or manage runs (and must not see those controls).
- **FR-010**: Failure information stored and displayed for runs MUST be sanitized and minimized; it MUST NOT include secrets, credentials, tokens, PII, or full external payload dumps. Runs MUST store stable reason codes with short sanitized messages. Itemized operations MUST additionally store a sanitized per-item failures list that identifies the affected item using a non-sensitive reference (for example, an item name and/or ID that is safe to display) plus reason code and short message.
- **FR-011**: System MUST provide consistent user notifications for run lifecycle events: when queued and when reaching a terminal outcome (`succeeded` / `partially succeeded` / `failed`). Notifications MUST include a “View run” link that opens the run detail view. Phase 1 notifications MUST be delivered to the initiating user; notification routing for system-initiated runs is deferred to Phase 2. If a notification cannot be delivered, Monitoring/Operations remains the source of truth for run status.
- **FR-012**: Where a run produces or relates to a separate business artifact (for example, drift findings), the run detail view MUST provide a link to that artifact.
- **FR-013**: Backup Set “Add Policies” MUST be tracked as a run and MUST surface terminal outcome status, summary counts, and a link back to the related backup set context.
- **FR-014**: Monitoring/Operations MUST be view-only in Phase 1 and MUST NOT offer controls to start, rerun, cancel, or delete runs.

### Key Entities *(include if feature involves data)*

- **Run**: A tenant-scoped record representing the execution of a long-running operation, including status, timestamps, summary counts, and safe failure information.
- **Failure Detail**: For a run, a stable reason code plus a short sanitized message; for itemized operations, a sanitized list of per-item failures to identify affected items without exposing sensitive data.
- **Run Type**: A classification of what the run does (e.g., inventory sync, backup capture, restore, drift generation) used for filtering, permissions, and consistent status/count semantics.
- **Run Scope/Target**: The object or scope the run applies to (e.g., a backup set, a policy, a drift scope), used for duplicate prevention and navigation.
- **Related Artifact**: A separate business record produced by an operation (e.g., restore details, drift findings) that can be linked from the run.

## Success Criteria *(mandatory)*

### Measurable Outcomes

- **SC-001**: An operator can determine the current state (`queued` / `running` / `succeeded` / `partially succeeded` / `failed`) of any recent operation in under 30 seconds using Monitoring/Operations (measured via timed operator walkthroughs using Monitoring/Operations only).
- **SC-002**: Starting a Phase 1 supported long-running operation provides user-visible confirmation and a “View run” link within 2 seconds under normal conditions (normal conditions: no active service degradation and typical tenant dataset sizes; excludes maintenance/outage windows; measured via timed start flows during operator walkthroughs).
- **SC-003**: For identical Phase 1 operation requests (same tenant + run type + scope + effective inputs), the system creates no more than one active run at a time (`queued` or `running`) in at least 99% of repeated-start attempts over a rolling 30-day window (measured by sampling repeated-start attempts and counting resulting active runs).
- **SC-004**: For Phase 1 terminal runs, operators can identify a clear outcome (`succeeded` / `partially succeeded` / `failed`) and a short, non-sensitive failure reason (when applicable) using Monitoring/Operations without inspecting server logs in at least 95% of cases over a rolling 30-day window (measured via support/operator triage review of run records).
- **SC-005**: Operator- or support-reported incidents caused by “unknown/stuck status” long-running operations decrease by at least 50% within one release cycle after Phase 1 adoption (measured via support ticket tagging/categorization).