TenantAtlas/specs/053-unify-runs-monitoring/spec.md
ahmido 30ad57baab feat/053-unify-runs-monitoring (#60)
Summary

This PR introduces Unified Operations Runs + Monitoring Hub (053).

Goal: Standardize how long-running operations are tracked and monitored using the existing tenant-scoped run record (BulkOperationRun) as the canonical “operation run”, and surface it in a single Monitoring → Operations hub (view-only, tenant-scoped, role-aware).

Phase 1 adoption scope (per spec):
	•	Drift generation (drift.generate)
	•	Backup Set “Add Policies” (backup_set.add_policies)

Note: This PR does not convert every run type yet (e.g. GroupSyncRuns / InventorySyncRuns remain separate for now). This is intentionally incremental.

⸻

What changed

Monitoring / Operations hub
	•	Moved/organized run monitoring under Monitoring → Operations
	•	Added:
	•	status buckets (queued / running / succeeded / partially succeeded / failed)
	•	filters (run type, status bucket, time range)
	•	run detail “Related” links (e.g. Drift findings, Backup Set context)
	•	All hub pages are DB-only and view-only (no rerun/cancel/delete actions)

Canonical run semantics
	•	Added canonical helpers on BulkOperationRun:
	•	runType() (resource.action)
	•	statusBucket() derived from status + counts (testable semantics)

Drift integration (Phase 1)
	•	Drift generation start behavior now:
	•	creates/reuses a BulkOperationRun with drift context payload (scope_key + baseline/current run ids)
	•	dispatches generation job
	•	emits DB notifications including “View run” link
	•	On generation failure: stores sanitized failure entries + sends failure notification

Permissions / tenant isolation
	•	Monitoring run list/view is tenant-scoped and returns 403 for cross-tenant access
	•	Readonly can view runs but cannot start drift generation

⸻

Tests

Added/updated Pest coverage:
	•	BulkOperationRunStatusBucketTest.php
	•	DriftGenerationDispatchTest.php
	•	GenerateDriftFindingsJobNotificationTest.php
	•	RunAuthorizationTenantIsolationTest.php

Validation run locally:
	•	./vendor/bin/pint --dirty
	•	targeted tests from feature quickstart / drift monitoring tests

⸻

Manual QA
	1.	Go to Monitoring → Operations
	•	verify filters (run type / status / time range)
	•	verify run detail shows counts + sanitized failures + “Related” links
	2.	Open Drift Landing
	•	with >=2 successful inventory runs for scope: should queue drift generation + show notification with “View run”
	•	as readonly: should not start generation
	3.	Run detail
	•	drift.generate runs show “Drift findings” related link
	•	failure entries are sanitized (no secrets/tokens/raw payload dumps)

⸻

Notes / Ops
	•	Queue workers must be restarted after deploy so they load the new code:
	•	php artisan queue:restart (or Sail equivalent)
	•	This PR standardizes monitoring for Phase 1 producers only; follow-ups will migrate additional run types into the unified pattern.

⸻

Spec / Docs
	•	SpecKit artifacts added under specs/053-unify-runs-monitoring/
	•	Checklists are complete:
	•	requirements checklist PASS
	•	writing checklist PASS

Co-authored-by: Ahmed Darrazi <ahmeddarrazi@adsmac.local>
Reviewed-on: #60
2026-01-16 15:10:31 +00:00

139 lines
18 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Feature Specification: Unified Operations Runs + Monitoring Hub (053)
**Feature Branch**: `feat/053-unify-runs-monitoring`
**Created**: 2026-01-15
**Status**: Draft
**Input**: User description: "Unify long-running operations into a consistent, observable run model and provide a single Monitoring/Operations hub with tenant-scoped authorization, consistent status/count semantics, safe failure visibility, duplicate prevention, and Drift generation as a first-class tracked operation."
## Clarifications
### Session 2026-01-15
- Q: Who can view runs in Monitoring/Operations? → A: `Owner`, `Manager`, `Operator`, and `Readonly` (view-only; `Readonly` sees sanitized details but no action controls).
- Q: Which operations are included in Phase 1? → A: Monitoring/Operations hub + Drift generation + Backup Set “Add Policies”.
- Q: Should Monitoring/Operations include run management actions (rerun/cancel/delete)? → A: No; Monitoring/Operations is view-only in Phase 1 (start/re-run remain in feature UIs).
- Q: How is “partially succeeded” determined vs “failed”? → A: `partially succeeded` means both successes and failures occurred; `failed` means nothing succeeded (or the run could not proceed).
- Q: What failure detail should be stored and shown for runs? → A: Stable reason codes + short sanitized messages; itemized operations also include per-item failures (sanitized).
## User Scenarios & Testing *(mandatory)*
### User Story 1 - Monitor operations in one place (Priority: P1)
As an operator, I want a single Monitoring/Operations area where I can see what operations are queued/running/succeeded/partially succeeded/failed for my tenant and drill into details, so I can quickly answer “what is happening” without jumping between modules.
**Why this priority**: Provides immediate operational visibility, reduces time-to-diagnose failures, and prevents “run sprawl” (different screens/status meanings per module).
**Independent Test**: Seed multiple runs with different types/statuses and verify the hub lists them tenant-scoped, supports filtering, and a run detail view is available with status, timing, summary counts, and safe error summaries.
**Acceptance Scenarios**:
1. **Given** I am signed into tenant A, **When** I open Monitoring → Operations, **Then** I see runs for tenant A sorted most-recent-first and the list defaults to the last 30 days.
2. **Given** the list contains multiple run types and statuses, **When** I filter by run type and status, **Then** I only see matching runs and can open a run detail page for any result.
3. **Given** I need an older or narrower timeframe, **When** I set a custom time range, **Then** the list updates to match the selected range.
4. **Given** I am a `Readonly` user in tenant A, **When** I open a run detail page, **Then** I can view status and sanitized failure information but I do not see controls to start, rerun, cancel, or delete runs.
5. **Given** I can view Monitoring/Operations, **When** I use the hub, **Then** it is view-only (no controls to start, rerun, cancel, or delete runs).
6. **Given** an itemized run has failures, **When** I open the run detail page, **Then** I can see which items failed with stable reason codes and short sanitized messages.
7. **Given** I attempt to access a run from another tenant (via list or direct link), **When** I request it, **Then** access is denied and no run data is disclosed.
8. **Given** a run has both successes and failures, **When** I view it, **Then** its outcome is “partially succeeded”; **And given** it has zero successes (or cannot proceed), **Then** its outcome is “failed”.
9. **Given** a run was initiated by the system, **When** I view it, **Then** the initiator is shown as “System”.
---
### User Story 2 - Start long-running actions without waiting (Priority: P2)
As an operator, when I trigger a long-running action, I want an immediate confirmation with a “View run” link so I can keep working while the operation completes in the background.
**Why this priority**: Prevents timeouts and long waits, improves reliability under retries/double-clicks, and standardizes the start experience across modules.
**Independent Test**: Start a supported operation (e.g., drift generation) and verify an operation run is created/reused, the UI returns quickly with a “View run” link, and the run progresses to a terminal status.
**Acceptance Scenarios**:
1. **Given** I have permission to start a supported operation in tenant A, **When** I start it, **Then** I receive immediate confirmation with a “View run” link and the run shows as queued or running.
2. **Given** I am a `Readonly` user in tenant A, **When** I attempt to start a supported operation, **Then** the system denies the request and no new run is created.
3. **Given** I start Backup Set “Add Policies”, **When** I submit the selection, **Then** I receive immediate confirmation with a “View run” link and the run is visible in Monitoring/Operations.
4. **Given** I start a supported operation successfully, **When** it is queued, **Then** I receive a “Queued” notification with a “View run” link to the run detail view.
5. **Given** my run reaches a terminal outcome, **When** that occurs, **Then** I receive a completion notification stating `succeeded`, `partially succeeded`, or `failed`, including summary counts when applicable.
6. **Given** background execution is unavailable, **When** I attempt to start a supported operation, **Then** I receive a clear message that the operation cannot be queued and I do not receive a misleading “queued” confirmation.
---
### User Story 3 - Drift generation is observable like other operations (Priority: P3)
As an operator, I want drift generation to behave like any other operation: it creates a run, shows progress and outcomes, and links to results so drift is easy to monitor and troubleshoot.
**Why this priority**: Drift is an operator-facing workflow; making it observable avoids confusion and reduces support burden when it fails or takes time.
**Independent Test**: Trigger drift generation for a defined scope, verify it appears in the monitoring hub, reaches a terminal status, and provides either a results link or a safe failure summary.
**Acceptance Scenarios**:
1. **Given** drift generation is available for a selected scope, **When** I trigger “Generate drift now”, **Then** I see a drift run in Monitoring and can open its details.
2. **Given** drift generation is already queued/running for the same scope, **When** I trigger it again, **Then** the system reuses the existing run and does not start a duplicate.
3. **Given** drift generation succeeds for a scope, **When** I open the run detail view, **Then** I see a link to the drift findings produced for that scope.
4. **Given** drift generation fails, **When** I open the run detail view, **Then** I see a safe failure summary (reason code + short sanitized message) and no sensitive data.
5. **Given** drift generation is requested but there is not enough eligible data to compare, **When** I trigger it, **Then** the system produces no findings and communicates a clear, actionable reason (for example, “insufficient data”).
---
### Edge Cases
- If drift generation is requested for a scope without enough eligible data to compare, the system MUST refuse to start the operation or MUST complete the run as `failed` with a stable reason code (for example, `insufficient_data`) and a short, actionable message.
- If repeated start attempts occur for the same tenant + run type + scope while an existing run is `queued` or `running`, the system MUST reuse the existing run and MUST NOT start duplicate background work.
- If the initiators permissions change while a run is in progress, the run continues; visibility is evaluated at time of access; and completion notifications are delivered only if the recipient remains authorized to view the run.
- Failure details MUST remain sanitized even when underlying errors contain sensitive data: only stable reason codes and short sanitized messages are shown; secrets/tokens/raw payload dumps are never shown.
- For very large scopes, the system MAY summarize or truncate per-item failure listings in the UI, but it MUST preserve accurate summary counts and MUST indicate when failure details are truncated.
## Requirements *(mandatory)*
**Constitution alignment (required):** If this feature introduces any new external tenant reads/writes or any write/change behavior,
the spec MUST describe safety gates (preview/confirmation/audit), tenant isolation, auditability, and tests. This feature is primarily about observability and standardization; monitoring views MUST be read-only and MUST NOT trigger external data collection.
### Scope & Assumptions
- This feature standardizes tracking and monitoring for long-running operations that already exist in the product.
- **Phase 1 supported operations** are: Drift generation; Backup Set “Add Policies”. All other candidate operations are explicitly deferred to Phase 2+ adoption (for example: inventory sync, directory group sync, snapshot/backup capture, restore execution, cross-tenant comparison/promotion).
- System-initiated runs may exist (for example, scheduled operations); they appear in Monitoring/Operations with initiator shown as “System”. Notification routing for system-initiated runs is deferred to Phase 2 (Owner: Product) to avoid unintended noise.
- **Roles in this spec**: `Owner`, `Manager`, and `Operator` can view Monitoring/Operations and can start Phase 1 supported operations; `Readonly` can view Monitoring/Operations but cannot start or manage runs and must not see those controls.
- Cross-tenant monitoring aggregation is out of scope; monitoring is tenant-scoped.
- Advanced dashboards (charts, badges, progress widgets) are out of scope; focus is on consistent run tracking, filtering, and drill-down.
- Run retention horizon and scale targets (volume per tenant, archiving/export) are deferred to Phase 2 (Owner: Product) to align with real usage data and storage constraints. Phase 1 focuses on operational clarity and defaults the list view to a recent time window.
- This feature assumes background execution is enabled in each environment so long-running operations can complete outside of interactive user sessions; when it is unavailable, the system must communicate clearly and must not mislead operators into thinking work is queued.
- Monitoring/Operations is view-only in Phase 1; run start/re-run controls remain in their respective feature areas.
- Auditability for Phase 1 is achieved via the run record (initiator, timestamps, outcome, counts, safe failure summary) and lifecycle notifications. Auditing “who viewed which run” is deferred to Phase 2 (Owner: Product).
### Functional Requirements
- **FR-001**: System MUST track each supported long-running operation as a tenant-scoped run with: run type, scope/target (when applicable), status, timestamps (created/started/finished), initiator (user or system), and a human-readable label. The label MUST be a stable operator-facing description combining run type and scope/target (English-only in Phase 1; localization deferred to Phase 2).
- **FR-002**: System MUST provide a Monitoring/Operations area that lists runs for the current tenant, sorted most-recent-first by default, defaulting to a recent time window (last 30 days), and supporting filtering by run type, status, and time range. Status filter values MUST include: `queued`, `running`, `succeeded`, `partially succeeded`, `failed`. Run type filtering MUST include the Phase 1 supported operations (Drift generation; Backup Set “Add Policies”).
- **FR-003**: System MUST provide a run detail view that shows status/outcome, timing, summary counts, and safe failure summaries. For itemized operations (operations that process a set of items), counts MUST include `total`, `succeeded`, `failed`, and `skipped` (if applicable).
- **FR-004**: System MUST use consistent run status semantics across run types using the Phase 1 status set: `queued`, `running`, `succeeded`, `partially succeeded`, `failed`. Status meanings MUST be unambiguous: `partially succeeded` indicates at least one success and at least one failure; `failed` indicates zero successes (or the run could not proceed). Cancellation/abort outcomes are deferred to Phase 2.
- **FR-005**: When an operator starts a supported long-running operation, the system MUST provide immediate confirmation and a “View run” link that opens the run detail view without blocking on completion. If background execution is unavailable, the system MUST provide a clear error and MUST NOT present a misleading “queued” confirmation.
- **FR-006**: System MUST avoid duplicate runs for the same tenant + run type + scope when an identical run is already `queued` or `running` by reusing the existing run. “Identical” means the same tenant, the same run type, the same scope/target, and the same effective inputs (for example: the same drift scope selection; the same backup set and selected policies). The initiator MUST NOT be part of the identity for duplicate prevention.
- **FR-007**: Drift generation MUST be tracked as a run and MUST surface completion status and either a link to produced findings or an actionable, safe failure summary.
- **FR-008**: Run list, run view, and run start actions MUST be tenant-scoped and forbidden cross-tenant. Tenant scoping MUST be applied before any filtering or lookup to prevent cross-tenant data leakage, and cross-tenant access attempts MUST NOT disclose run existence or details.
- **FR-009**: Run visibility and run start actions MUST be permission-gated by run type (least privilege). By default, `Owner`, `Manager`, `Operator`, and `Readonly` can view runs, but `Readonly` MUST NOT be able to start or manage runs (and must not see those controls).
- **FR-010**: Failure information stored and displayed for runs MUST be sanitized and minimized; it MUST NOT include secrets, credentials, tokens, PII, or full external payload dumps. Runs MUST store stable reason codes with short sanitized messages. Itemized operations MUST additionally store a sanitized per-item failures list that identifies the affected item using a non-sensitive reference (for example, an item name and/or ID that is safe to display) plus reason code and short message.
- **FR-011**: System MUST provide consistent user notifications for run lifecycle events: when queued and when reaching a terminal outcome (`succeeded` / `partially succeeded` / `failed`). Notifications MUST include a “View run” link that opens the run detail view. Phase 1 notifications MUST be delivered to the initiating user; notification routing for system-initiated runs is deferred to Phase 2. If a notification cannot be delivered, Monitoring/Operations remains the source of truth for run status.
- **FR-012**: Where a run produces or relates to a separate business artifact (for example, drift findings), the run detail view MUST provide a link to that artifact.
- **FR-013**: Backup Set “Add Policies” MUST be tracked as a run and MUST surface terminal outcome status, summary counts, and a link back to the related backup set context.
- **FR-014**: Monitoring/Operations MUST be view-only in Phase 1 and MUST NOT offer controls to start, rerun, cancel, or delete runs.
### Key Entities *(include if feature involves data)*
- **Run**: A tenant-scoped record representing the execution of a long-running operation, including status, timestamps, summary counts, and safe failure information.
- **Failure Detail**: For a run, a stable reason code plus a short sanitized message; for itemized operations, a sanitized list of per-item failures to identify affected items without exposing sensitive data.
- **Run Type**: A classification of what the run does (e.g., inventory sync, backup capture, restore, drift generation) used for filtering, permissions, and consistent status/count semantics.
- **Run Scope/Target**: The object or scope the run applies to (e.g., a backup set, a policy, a drift scope), used for duplicate prevention and navigation.
- **Related Artifact**: A separate business record produced by an operation (e.g., restore details, drift findings) that can be linked from the run.
## Success Criteria *(mandatory)*
### Measurable Outcomes
- **SC-001**: An operator can determine the current state (`queued` / `running` / `succeeded` / `partially succeeded` / `failed`) of any recent operation in under 30 seconds using Monitoring/Operations (measured via timed operator walkthroughs using Monitoring/Operations only).
- **SC-002**: Starting a Phase 1 supported long-running operation provides user-visible confirmation and a “View run” link within 2 seconds under normal conditions (normal conditions: no active service degradation and typical tenant dataset sizes; excludes maintenance/outage windows; measured via timed start flows during operator walkthroughs).
- **SC-003**: For identical Phase 1 operation requests (same tenant + run type + scope + effective inputs), the system creates no more than one active run at a time (`queued` or `running`) in at least 99% of repeated-start attempts over a rolling 30-day window (measured by sampling repeated-start attempts and counting resulting active runs).
- **SC-004**: For Phase 1 terminal runs, operators can identify a clear outcome (`succeeded` / `partially succeeded` / `failed`) and a short, non-sensitive failure reason (when applicable) using Monitoring/Operations without inspecting server logs in at least 95% of cases over a rolling 30-day window (measured via support/operator triage review of run records).
- **SC-005**: Operator- or support-reported incidents caused by “unknown/stuck status” long-running operations decrease by at least 50% within one release cycle after Phase 1 adoption (measured via support ticket tagging/categorization).