Summary This PR introduces Unified Operations Runs + Monitoring Hub (053). Goal: Standardize how long-running operations are tracked and monitored using the existing tenant-scoped run record (BulkOperationRun) as the canonical “operation run”, and surface it in a single Monitoring → Operations hub (view-only, tenant-scoped, role-aware). Phase 1 adoption scope (per spec): • Drift generation (drift.generate) • Backup Set “Add Policies” (backup_set.add_policies) Note: This PR does not convert every run type yet (e.g. GroupSyncRuns / InventorySyncRuns remain separate for now). This is intentionally incremental. ⸻ What changed Monitoring / Operations hub • Moved/organized run monitoring under Monitoring → Operations • Added: • status buckets (queued / running / succeeded / partially succeeded / failed) • filters (run type, status bucket, time range) • run detail “Related” links (e.g. Drift findings, Backup Set context) • All hub pages are DB-only and view-only (no rerun/cancel/delete actions) Canonical run semantics • Added canonical helpers on BulkOperationRun: • runType() (resource.action) • statusBucket() derived from status + counts (testable semantics) Drift integration (Phase 1) • Drift generation start behavior now: • creates/reuses a BulkOperationRun with drift context payload (scope_key + baseline/current run ids) • dispatches generation job • emits DB notifications including “View run” link • On generation failure: stores sanitized failure entries + sends failure notification Permissions / tenant isolation • Monitoring run list/view is tenant-scoped and returns 403 for cross-tenant access • Readonly can view runs but cannot start drift generation ⸻ Tests Added/updated Pest coverage: • BulkOperationRunStatusBucketTest.php • DriftGenerationDispatchTest.php • GenerateDriftFindingsJobNotificationTest.php • RunAuthorizationTenantIsolationTest.php Validation run locally: • ./vendor/bin/pint --dirty • targeted tests from feature quickstart / drift monitoring tests ⸻ Manual QA 1. Go to Monitoring → Operations • verify filters (run type / status / time range) • verify run detail shows counts + sanitized failures + “Related” links 2. Open Drift Landing • with >=2 successful inventory runs for scope: should queue drift generation + show notification with “View run” • as readonly: should not start generation 3. Run detail • drift.generate runs show “Drift findings” related link • failure entries are sanitized (no secrets/tokens/raw payload dumps) ⸻ Notes / Ops • Queue workers must be restarted after deploy so they load the new code: • php artisan queue:restart (or Sail equivalent) • This PR standardizes monitoring for Phase 1 producers only; follow-ups will migrate additional run types into the unified pattern. ⸻ Spec / Docs • SpecKit artifacts added under specs/053-unify-runs-monitoring/ • Checklists are complete: • requirements checklist PASS • writing checklist PASS Co-authored-by: Ahmed Darrazi <ahmeddarrazi@adsmac.local> Reviewed-on: #60
139 lines
18 KiB
Markdown
139 lines
18 KiB
Markdown
# Feature Specification: Unified Operations Runs + Monitoring Hub (053)
|
||
|
||
**Feature Branch**: `feat/053-unify-runs-monitoring`
|
||
**Created**: 2026-01-15
|
||
**Status**: Draft
|
||
**Input**: User description: "Unify long-running operations into a consistent, observable run model and provide a single Monitoring/Operations hub with tenant-scoped authorization, consistent status/count semantics, safe failure visibility, duplicate prevention, and Drift generation as a first-class tracked operation."
|
||
|
||
## Clarifications
|
||
|
||
### Session 2026-01-15
|
||
|
||
- Q: Who can view runs in Monitoring/Operations? → A: `Owner`, `Manager`, `Operator`, and `Readonly` (view-only; `Readonly` sees sanitized details but no action controls).
|
||
- Q: Which operations are included in Phase 1? → A: Monitoring/Operations hub + Drift generation + Backup Set “Add Policies”.
|
||
- Q: Should Monitoring/Operations include run management actions (rerun/cancel/delete)? → A: No; Monitoring/Operations is view-only in Phase 1 (start/re-run remain in feature UIs).
|
||
- Q: How is “partially succeeded” determined vs “failed”? → A: `partially succeeded` means both successes and failures occurred; `failed` means nothing succeeded (or the run could not proceed).
|
||
- Q: What failure detail should be stored and shown for runs? → A: Stable reason codes + short sanitized messages; itemized operations also include per-item failures (sanitized).
|
||
|
||
## User Scenarios & Testing *(mandatory)*
|
||
|
||
### User Story 1 - Monitor operations in one place (Priority: P1)
|
||
|
||
As an operator, I want a single Monitoring/Operations area where I can see what operations are queued/running/succeeded/partially succeeded/failed for my tenant and drill into details, so I can quickly answer “what is happening” without jumping between modules.
|
||
|
||
**Why this priority**: Provides immediate operational visibility, reduces time-to-diagnose failures, and prevents “run sprawl” (different screens/status meanings per module).
|
||
|
||
**Independent Test**: Seed multiple runs with different types/statuses and verify the hub lists them tenant-scoped, supports filtering, and a run detail view is available with status, timing, summary counts, and safe error summaries.
|
||
|
||
**Acceptance Scenarios**:
|
||
|
||
1. **Given** I am signed into tenant A, **When** I open Monitoring → Operations, **Then** I see runs for tenant A sorted most-recent-first and the list defaults to the last 30 days.
|
||
2. **Given** the list contains multiple run types and statuses, **When** I filter by run type and status, **Then** I only see matching runs and can open a run detail page for any result.
|
||
3. **Given** I need an older or narrower timeframe, **When** I set a custom time range, **Then** the list updates to match the selected range.
|
||
4. **Given** I am a `Readonly` user in tenant A, **When** I open a run detail page, **Then** I can view status and sanitized failure information but I do not see controls to start, rerun, cancel, or delete runs.
|
||
5. **Given** I can view Monitoring/Operations, **When** I use the hub, **Then** it is view-only (no controls to start, rerun, cancel, or delete runs).
|
||
6. **Given** an itemized run has failures, **When** I open the run detail page, **Then** I can see which items failed with stable reason codes and short sanitized messages.
|
||
7. **Given** I attempt to access a run from another tenant (via list or direct link), **When** I request it, **Then** access is denied and no run data is disclosed.
|
||
8. **Given** a run has both successes and failures, **When** I view it, **Then** its outcome is “partially succeeded”; **And given** it has zero successes (or cannot proceed), **Then** its outcome is “failed”.
|
||
9. **Given** a run was initiated by the system, **When** I view it, **Then** the initiator is shown as “System”.
|
||
|
||
---
|
||
|
||
### User Story 2 - Start long-running actions without waiting (Priority: P2)
|
||
|
||
As an operator, when I trigger a long-running action, I want an immediate confirmation with a “View run” link so I can keep working while the operation completes in the background.
|
||
|
||
**Why this priority**: Prevents timeouts and long waits, improves reliability under retries/double-clicks, and standardizes the start experience across modules.
|
||
|
||
**Independent Test**: Start a supported operation (e.g., drift generation) and verify an operation run is created/reused, the UI returns quickly with a “View run” link, and the run progresses to a terminal status.
|
||
|
||
**Acceptance Scenarios**:
|
||
|
||
1. **Given** I have permission to start a supported operation in tenant A, **When** I start it, **Then** I receive immediate confirmation with a “View run” link and the run shows as queued or running.
|
||
2. **Given** I am a `Readonly` user in tenant A, **When** I attempt to start a supported operation, **Then** the system denies the request and no new run is created.
|
||
3. **Given** I start Backup Set “Add Policies”, **When** I submit the selection, **Then** I receive immediate confirmation with a “View run” link and the run is visible in Monitoring/Operations.
|
||
4. **Given** I start a supported operation successfully, **When** it is queued, **Then** I receive a “Queued” notification with a “View run” link to the run detail view.
|
||
5. **Given** my run reaches a terminal outcome, **When** that occurs, **Then** I receive a completion notification stating `succeeded`, `partially succeeded`, or `failed`, including summary counts when applicable.
|
||
6. **Given** background execution is unavailable, **When** I attempt to start a supported operation, **Then** I receive a clear message that the operation cannot be queued and I do not receive a misleading “queued” confirmation.
|
||
|
||
---
|
||
|
||
### User Story 3 - Drift generation is observable like other operations (Priority: P3)
|
||
|
||
As an operator, I want drift generation to behave like any other operation: it creates a run, shows progress and outcomes, and links to results so drift is easy to monitor and troubleshoot.
|
||
|
||
**Why this priority**: Drift is an operator-facing workflow; making it observable avoids confusion and reduces support burden when it fails or takes time.
|
||
|
||
**Independent Test**: Trigger drift generation for a defined scope, verify it appears in the monitoring hub, reaches a terminal status, and provides either a results link or a safe failure summary.
|
||
|
||
**Acceptance Scenarios**:
|
||
|
||
1. **Given** drift generation is available for a selected scope, **When** I trigger “Generate drift now”, **Then** I see a drift run in Monitoring and can open its details.
|
||
2. **Given** drift generation is already queued/running for the same scope, **When** I trigger it again, **Then** the system reuses the existing run and does not start a duplicate.
|
||
3. **Given** drift generation succeeds for a scope, **When** I open the run detail view, **Then** I see a link to the drift findings produced for that scope.
|
||
4. **Given** drift generation fails, **When** I open the run detail view, **Then** I see a safe failure summary (reason code + short sanitized message) and no sensitive data.
|
||
5. **Given** drift generation is requested but there is not enough eligible data to compare, **When** I trigger it, **Then** the system produces no findings and communicates a clear, actionable reason (for example, “insufficient data”).
|
||
|
||
---
|
||
|
||
### Edge Cases
|
||
|
||
- If drift generation is requested for a scope without enough eligible data to compare, the system MUST refuse to start the operation or MUST complete the run as `failed` with a stable reason code (for example, `insufficient_data`) and a short, actionable message.
|
||
- If repeated start attempts occur for the same tenant + run type + scope while an existing run is `queued` or `running`, the system MUST reuse the existing run and MUST NOT start duplicate background work.
|
||
- If the initiator’s permissions change while a run is in progress, the run continues; visibility is evaluated at time of access; and completion notifications are delivered only if the recipient remains authorized to view the run.
|
||
- Failure details MUST remain sanitized even when underlying errors contain sensitive data: only stable reason codes and short sanitized messages are shown; secrets/tokens/raw payload dumps are never shown.
|
||
- For very large scopes, the system MAY summarize or truncate per-item failure listings in the UI, but it MUST preserve accurate summary counts and MUST indicate when failure details are truncated.
|
||
|
||
## Requirements *(mandatory)*
|
||
|
||
**Constitution alignment (required):** If this feature introduces any new external tenant reads/writes or any write/change behavior,
|
||
the spec MUST describe safety gates (preview/confirmation/audit), tenant isolation, auditability, and tests. This feature is primarily about observability and standardization; monitoring views MUST be read-only and MUST NOT trigger external data collection.
|
||
|
||
### Scope & Assumptions
|
||
|
||
- This feature standardizes tracking and monitoring for long-running operations that already exist in the product.
|
||
- **Phase 1 supported operations** are: Drift generation; Backup Set “Add Policies”. All other candidate operations are explicitly deferred to Phase 2+ adoption (for example: inventory sync, directory group sync, snapshot/backup capture, restore execution, cross-tenant comparison/promotion).
|
||
- System-initiated runs may exist (for example, scheduled operations); they appear in Monitoring/Operations with initiator shown as “System”. Notification routing for system-initiated runs is deferred to Phase 2 (Owner: Product) to avoid unintended noise.
|
||
- **Roles in this spec**: `Owner`, `Manager`, and `Operator` can view Monitoring/Operations and can start Phase 1 supported operations; `Readonly` can view Monitoring/Operations but cannot start or manage runs and must not see those controls.
|
||
- Cross-tenant monitoring aggregation is out of scope; monitoring is tenant-scoped.
|
||
- Advanced dashboards (charts, badges, progress widgets) are out of scope; focus is on consistent run tracking, filtering, and drill-down.
|
||
- Run retention horizon and scale targets (volume per tenant, archiving/export) are deferred to Phase 2 (Owner: Product) to align with real usage data and storage constraints. Phase 1 focuses on operational clarity and defaults the list view to a recent time window.
|
||
- This feature assumes background execution is enabled in each environment so long-running operations can complete outside of interactive user sessions; when it is unavailable, the system must communicate clearly and must not mislead operators into thinking work is queued.
|
||
- Monitoring/Operations is view-only in Phase 1; run start/re-run controls remain in their respective feature areas.
|
||
- Auditability for Phase 1 is achieved via the run record (initiator, timestamps, outcome, counts, safe failure summary) and lifecycle notifications. Auditing “who viewed which run” is deferred to Phase 2 (Owner: Product).
|
||
|
||
### Functional Requirements
|
||
|
||
- **FR-001**: System MUST track each supported long-running operation as a tenant-scoped run with: run type, scope/target (when applicable), status, timestamps (created/started/finished), initiator (user or system), and a human-readable label. The label MUST be a stable operator-facing description combining run type and scope/target (English-only in Phase 1; localization deferred to Phase 2).
|
||
- **FR-002**: System MUST provide a Monitoring/Operations area that lists runs for the current tenant, sorted most-recent-first by default, defaulting to a recent time window (last 30 days), and supporting filtering by run type, status, and time range. Status filter values MUST include: `queued`, `running`, `succeeded`, `partially succeeded`, `failed`. Run type filtering MUST include the Phase 1 supported operations (Drift generation; Backup Set “Add Policies”).
|
||
- **FR-003**: System MUST provide a run detail view that shows status/outcome, timing, summary counts, and safe failure summaries. For itemized operations (operations that process a set of items), counts MUST include `total`, `succeeded`, `failed`, and `skipped` (if applicable).
|
||
- **FR-004**: System MUST use consistent run status semantics across run types using the Phase 1 status set: `queued`, `running`, `succeeded`, `partially succeeded`, `failed`. Status meanings MUST be unambiguous: `partially succeeded` indicates at least one success and at least one failure; `failed` indicates zero successes (or the run could not proceed). Cancellation/abort outcomes are deferred to Phase 2.
|
||
- **FR-005**: When an operator starts a supported long-running operation, the system MUST provide immediate confirmation and a “View run” link that opens the run detail view without blocking on completion. If background execution is unavailable, the system MUST provide a clear error and MUST NOT present a misleading “queued” confirmation.
|
||
- **FR-006**: System MUST avoid duplicate runs for the same tenant + run type + scope when an identical run is already `queued` or `running` by reusing the existing run. “Identical” means the same tenant, the same run type, the same scope/target, and the same effective inputs (for example: the same drift scope selection; the same backup set and selected policies). The initiator MUST NOT be part of the identity for duplicate prevention.
|
||
- **FR-007**: Drift generation MUST be tracked as a run and MUST surface completion status and either a link to produced findings or an actionable, safe failure summary.
|
||
- **FR-008**: Run list, run view, and run start actions MUST be tenant-scoped and forbidden cross-tenant. Tenant scoping MUST be applied before any filtering or lookup to prevent cross-tenant data leakage, and cross-tenant access attempts MUST NOT disclose run existence or details.
|
||
- **FR-009**: Run visibility and run start actions MUST be permission-gated by run type (least privilege). By default, `Owner`, `Manager`, `Operator`, and `Readonly` can view runs, but `Readonly` MUST NOT be able to start or manage runs (and must not see those controls).
|
||
- **FR-010**: Failure information stored and displayed for runs MUST be sanitized and minimized; it MUST NOT include secrets, credentials, tokens, PII, or full external payload dumps. Runs MUST store stable reason codes with short sanitized messages. Itemized operations MUST additionally store a sanitized per-item failures list that identifies the affected item using a non-sensitive reference (for example, an item name and/or ID that is safe to display) plus reason code and short message.
|
||
- **FR-011**: System MUST provide consistent user notifications for run lifecycle events: when queued and when reaching a terminal outcome (`succeeded` / `partially succeeded` / `failed`). Notifications MUST include a “View run” link that opens the run detail view. Phase 1 notifications MUST be delivered to the initiating user; notification routing for system-initiated runs is deferred to Phase 2. If a notification cannot be delivered, Monitoring/Operations remains the source of truth for run status.
|
||
- **FR-012**: Where a run produces or relates to a separate business artifact (for example, drift findings), the run detail view MUST provide a link to that artifact.
|
||
- **FR-013**: Backup Set “Add Policies” MUST be tracked as a run and MUST surface terminal outcome status, summary counts, and a link back to the related backup set context.
|
||
- **FR-014**: Monitoring/Operations MUST be view-only in Phase 1 and MUST NOT offer controls to start, rerun, cancel, or delete runs.
|
||
|
||
### Key Entities *(include if feature involves data)*
|
||
|
||
- **Run**: A tenant-scoped record representing the execution of a long-running operation, including status, timestamps, summary counts, and safe failure information.
|
||
- **Failure Detail**: For a run, a stable reason code plus a short sanitized message; for itemized operations, a sanitized list of per-item failures to identify affected items without exposing sensitive data.
|
||
- **Run Type**: A classification of what the run does (e.g., inventory sync, backup capture, restore, drift generation) used for filtering, permissions, and consistent status/count semantics.
|
||
- **Run Scope/Target**: The object or scope the run applies to (e.g., a backup set, a policy, a drift scope), used for duplicate prevention and navigation.
|
||
- **Related Artifact**: A separate business record produced by an operation (e.g., restore details, drift findings) that can be linked from the run.
|
||
|
||
## Success Criteria *(mandatory)*
|
||
|
||
### Measurable Outcomes
|
||
|
||
- **SC-001**: An operator can determine the current state (`queued` / `running` / `succeeded` / `partially succeeded` / `failed`) of any recent operation in under 30 seconds using Monitoring/Operations (measured via timed operator walkthroughs using Monitoring/Operations only).
|
||
- **SC-002**: Starting a Phase 1 supported long-running operation provides user-visible confirmation and a “View run” link within 2 seconds under normal conditions (normal conditions: no active service degradation and typical tenant dataset sizes; excludes maintenance/outage windows; measured via timed start flows during operator walkthroughs).
|
||
- **SC-003**: For identical Phase 1 operation requests (same tenant + run type + scope + effective inputs), the system creates no more than one active run at a time (`queued` or `running`) in at least 99% of repeated-start attempts over a rolling 30-day window (measured by sampling repeated-start attempts and counting resulting active runs).
|
||
- **SC-004**: For Phase 1 terminal runs, operators can identify a clear outcome (`succeeded` / `partially succeeded` / `failed`) and a short, non-sensitive failure reason (when applicable) using Monitoring/Operations without inspecting server logs in at least 95% of cases over a rolling 30-day window (measured via support/operator triage review of run records).
|
||
- **SC-005**: Operator- or support-reported incidents caused by “unknown/stuck status” long-running operations decrease by at least 50% within one release cycle after Phase 1 adoption (measured via support ticket tagging/categorization).
|