Summary This PR introduces Unified Operations Runs + Monitoring Hub (053). Goal: Standardize how long-running operations are tracked and monitored using the existing tenant-scoped run record (BulkOperationRun) as the canonical “operation run”, and surface it in a single Monitoring → Operations hub (view-only, tenant-scoped, role-aware). Phase 1 adoption scope (per spec): • Drift generation (drift.generate) • Backup Set “Add Policies” (backup_set.add_policies) Note: This PR does not convert every run type yet (e.g. GroupSyncRuns / InventorySyncRuns remain separate for now). This is intentionally incremental. ⸻ What changed Monitoring / Operations hub • Moved/organized run monitoring under Monitoring → Operations • Added: • status buckets (queued / running / succeeded / partially succeeded / failed) • filters (run type, status bucket, time range) • run detail “Related” links (e.g. Drift findings, Backup Set context) • All hub pages are DB-only and view-only (no rerun/cancel/delete actions) Canonical run semantics • Added canonical helpers on BulkOperationRun: • runType() (resource.action) • statusBucket() derived from status + counts (testable semantics) Drift integration (Phase 1) • Drift generation start behavior now: • creates/reuses a BulkOperationRun with drift context payload (scope_key + baseline/current run ids) • dispatches generation job • emits DB notifications including “View run” link • On generation failure: stores sanitized failure entries + sends failure notification Permissions / tenant isolation • Monitoring run list/view is tenant-scoped and returns 403 for cross-tenant access • Readonly can view runs but cannot start drift generation ⸻ Tests Added/updated Pest coverage: • BulkOperationRunStatusBucketTest.php • DriftGenerationDispatchTest.php • GenerateDriftFindingsJobNotificationTest.php • RunAuthorizationTenantIsolationTest.php Validation run locally: • ./vendor/bin/pint --dirty • targeted tests from feature quickstart / drift monitoring tests ⸻ Manual QA 1. Go to Monitoring → Operations • verify filters (run type / status / time range) • verify run detail shows counts + sanitized failures + “Related” links 2. Open Drift Landing • with >=2 successful inventory runs for scope: should queue drift generation + show notification with “View run” • as readonly: should not start generation 3. Run detail • drift.generate runs show “Drift findings” related link • failure entries are sanitized (no secrets/tokens/raw payload dumps) ⸻ Notes / Ops • Queue workers must be restarted after deploy so they load the new code: • php artisan queue:restart (or Sail equivalent) • This PR standardizes monitoring for Phase 1 producers only; follow-ups will migrate additional run types into the unified pattern. ⸻ Spec / Docs • SpecKit artifacts added under specs/053-unify-runs-monitoring/ • Checklists are complete: • requirements checklist PASS • writing checklist PASS Co-authored-by: Ahmed Darrazi <ahmeddarrazi@adsmac.local> Reviewed-on: #60
18 KiB
Feature Specification: Unified Operations Runs + Monitoring Hub (053)
Feature Branch: feat/053-unify-runs-monitoring
Created: 2026-01-15
Status: Draft
Input: User description: "Unify long-running operations into a consistent, observable run model and provide a single Monitoring/Operations hub with tenant-scoped authorization, consistent status/count semantics, safe failure visibility, duplicate prevention, and Drift generation as a first-class tracked operation."
Clarifications
Session 2026-01-15
- Q: Who can view runs in Monitoring/Operations? → A:
Owner,Manager,Operator, andReadonly(view-only;Readonlysees sanitized details but no action controls). - Q: Which operations are included in Phase 1? → A: Monitoring/Operations hub + Drift generation + Backup Set “Add Policies”.
- Q: Should Monitoring/Operations include run management actions (rerun/cancel/delete)? → A: No; Monitoring/Operations is view-only in Phase 1 (start/re-run remain in feature UIs).
- Q: How is “partially succeeded” determined vs “failed”? → A:
partially succeededmeans both successes and failures occurred;failedmeans nothing succeeded (or the run could not proceed). - Q: What failure detail should be stored and shown for runs? → A: Stable reason codes + short sanitized messages; itemized operations also include per-item failures (sanitized).
User Scenarios & Testing (mandatory)
User Story 1 - Monitor operations in one place (Priority: P1)
As an operator, I want a single Monitoring/Operations area where I can see what operations are queued/running/succeeded/partially succeeded/failed for my tenant and drill into details, so I can quickly answer “what is happening” without jumping between modules.
Why this priority: Provides immediate operational visibility, reduces time-to-diagnose failures, and prevents “run sprawl” (different screens/status meanings per module).
Independent Test: Seed multiple runs with different types/statuses and verify the hub lists them tenant-scoped, supports filtering, and a run detail view is available with status, timing, summary counts, and safe error summaries.
Acceptance Scenarios:
- Given I am signed into tenant A, When I open Monitoring → Operations, Then I see runs for tenant A sorted most-recent-first and the list defaults to the last 30 days.
- Given the list contains multiple run types and statuses, When I filter by run type and status, Then I only see matching runs and can open a run detail page for any result.
- Given I need an older or narrower timeframe, When I set a custom time range, Then the list updates to match the selected range.
- Given I am a
Readonlyuser in tenant A, When I open a run detail page, Then I can view status and sanitized failure information but I do not see controls to start, rerun, cancel, or delete runs. - Given I can view Monitoring/Operations, When I use the hub, Then it is view-only (no controls to start, rerun, cancel, or delete runs).
- Given an itemized run has failures, When I open the run detail page, Then I can see which items failed with stable reason codes and short sanitized messages.
- Given I attempt to access a run from another tenant (via list or direct link), When I request it, Then access is denied and no run data is disclosed.
- Given a run has both successes and failures, When I view it, Then its outcome is “partially succeeded”; And given it has zero successes (or cannot proceed), Then its outcome is “failed”.
- Given a run was initiated by the system, When I view it, Then the initiator is shown as “System”.
User Story 2 - Start long-running actions without waiting (Priority: P2)
As an operator, when I trigger a long-running action, I want an immediate confirmation with a “View run” link so I can keep working while the operation completes in the background.
Why this priority: Prevents timeouts and long waits, improves reliability under retries/double-clicks, and standardizes the start experience across modules.
Independent Test: Start a supported operation (e.g., drift generation) and verify an operation run is created/reused, the UI returns quickly with a “View run” link, and the run progresses to a terminal status.
Acceptance Scenarios:
- Given I have permission to start a supported operation in tenant A, When I start it, Then I receive immediate confirmation with a “View run” link and the run shows as queued or running.
- Given I am a
Readonlyuser in tenant A, When I attempt to start a supported operation, Then the system denies the request and no new run is created. - Given I start Backup Set “Add Policies”, When I submit the selection, Then I receive immediate confirmation with a “View run” link and the run is visible in Monitoring/Operations.
- Given I start a supported operation successfully, When it is queued, Then I receive a “Queued” notification with a “View run” link to the run detail view.
- Given my run reaches a terminal outcome, When that occurs, Then I receive a completion notification stating
succeeded,partially succeeded, orfailed, including summary counts when applicable. - Given background execution is unavailable, When I attempt to start a supported operation, Then I receive a clear message that the operation cannot be queued and I do not receive a misleading “queued” confirmation.
User Story 3 - Drift generation is observable like other operations (Priority: P3)
As an operator, I want drift generation to behave like any other operation: it creates a run, shows progress and outcomes, and links to results so drift is easy to monitor and troubleshoot.
Why this priority: Drift is an operator-facing workflow; making it observable avoids confusion and reduces support burden when it fails or takes time.
Independent Test: Trigger drift generation for a defined scope, verify it appears in the monitoring hub, reaches a terminal status, and provides either a results link or a safe failure summary.
Acceptance Scenarios:
- Given drift generation is available for a selected scope, When I trigger “Generate drift now”, Then I see a drift run in Monitoring and can open its details.
- Given drift generation is already queued/running for the same scope, When I trigger it again, Then the system reuses the existing run and does not start a duplicate.
- Given drift generation succeeds for a scope, When I open the run detail view, Then I see a link to the drift findings produced for that scope.
- Given drift generation fails, When I open the run detail view, Then I see a safe failure summary (reason code + short sanitized message) and no sensitive data.
- Given drift generation is requested but there is not enough eligible data to compare, When I trigger it, Then the system produces no findings and communicates a clear, actionable reason (for example, “insufficient data”).
Edge Cases
- If drift generation is requested for a scope without enough eligible data to compare, the system MUST refuse to start the operation or MUST complete the run as
failedwith a stable reason code (for example,insufficient_data) and a short, actionable message. - If repeated start attempts occur for the same tenant + run type + scope while an existing run is
queuedorrunning, the system MUST reuse the existing run and MUST NOT start duplicate background work. - If the initiator’s permissions change while a run is in progress, the run continues; visibility is evaluated at time of access; and completion notifications are delivered only if the recipient remains authorized to view the run.
- Failure details MUST remain sanitized even when underlying errors contain sensitive data: only stable reason codes and short sanitized messages are shown; secrets/tokens/raw payload dumps are never shown.
- For very large scopes, the system MAY summarize or truncate per-item failure listings in the UI, but it MUST preserve accurate summary counts and MUST indicate when failure details are truncated.
Requirements (mandatory)
Constitution alignment (required): If this feature introduces any new external tenant reads/writes or any write/change behavior, the spec MUST describe safety gates (preview/confirmation/audit), tenant isolation, auditability, and tests. This feature is primarily about observability and standardization; monitoring views MUST be read-only and MUST NOT trigger external data collection.
Scope & Assumptions
- This feature standardizes tracking and monitoring for long-running operations that already exist in the product.
- Phase 1 supported operations are: Drift generation; Backup Set “Add Policies”. All other candidate operations are explicitly deferred to Phase 2+ adoption (for example: inventory sync, directory group sync, snapshot/backup capture, restore execution, cross-tenant comparison/promotion).
- System-initiated runs may exist (for example, scheduled operations); they appear in Monitoring/Operations with initiator shown as “System”. Notification routing for system-initiated runs is deferred to Phase 2 (Owner: Product) to avoid unintended noise.
- Roles in this spec:
Owner,Manager, andOperatorcan view Monitoring/Operations and can start Phase 1 supported operations;Readonlycan view Monitoring/Operations but cannot start or manage runs and must not see those controls. - Cross-tenant monitoring aggregation is out of scope; monitoring is tenant-scoped.
- Advanced dashboards (charts, badges, progress widgets) are out of scope; focus is on consistent run tracking, filtering, and drill-down.
- Run retention horizon and scale targets (volume per tenant, archiving/export) are deferred to Phase 2 (Owner: Product) to align with real usage data and storage constraints. Phase 1 focuses on operational clarity and defaults the list view to a recent time window.
- This feature assumes background execution is enabled in each environment so long-running operations can complete outside of interactive user sessions; when it is unavailable, the system must communicate clearly and must not mislead operators into thinking work is queued.
- Monitoring/Operations is view-only in Phase 1; run start/re-run controls remain in their respective feature areas.
- Auditability for Phase 1 is achieved via the run record (initiator, timestamps, outcome, counts, safe failure summary) and lifecycle notifications. Auditing “who viewed which run” is deferred to Phase 2 (Owner: Product).
Functional Requirements
- FR-001: System MUST track each supported long-running operation as a tenant-scoped run with: run type, scope/target (when applicable), status, timestamps (created/started/finished), initiator (user or system), and a human-readable label. The label MUST be a stable operator-facing description combining run type and scope/target (English-only in Phase 1; localization deferred to Phase 2).
- FR-002: System MUST provide a Monitoring/Operations area that lists runs for the current tenant, sorted most-recent-first by default, defaulting to a recent time window (last 30 days), and supporting filtering by run type, status, and time range. Status filter values MUST include:
queued,running,succeeded,partially succeeded,failed. Run type filtering MUST include the Phase 1 supported operations (Drift generation; Backup Set “Add Policies”). - FR-003: System MUST provide a run detail view that shows status/outcome, timing, summary counts, and safe failure summaries. For itemized operations (operations that process a set of items), counts MUST include
total,succeeded,failed, andskipped(if applicable). - FR-004: System MUST use consistent run status semantics across run types using the Phase 1 status set:
queued,running,succeeded,partially succeeded,failed. Status meanings MUST be unambiguous:partially succeededindicates at least one success and at least one failure;failedindicates zero successes (or the run could not proceed). Cancellation/abort outcomes are deferred to Phase 2. - FR-005: When an operator starts a supported long-running operation, the system MUST provide immediate confirmation and a “View run” link that opens the run detail view without blocking on completion. If background execution is unavailable, the system MUST provide a clear error and MUST NOT present a misleading “queued” confirmation.
- FR-006: System MUST avoid duplicate runs for the same tenant + run type + scope when an identical run is already
queuedorrunningby reusing the existing run. “Identical” means the same tenant, the same run type, the same scope/target, and the same effective inputs (for example: the same drift scope selection; the same backup set and selected policies). The initiator MUST NOT be part of the identity for duplicate prevention. - FR-007: Drift generation MUST be tracked as a run and MUST surface completion status and either a link to produced findings or an actionable, safe failure summary.
- FR-008: Run list, run view, and run start actions MUST be tenant-scoped and forbidden cross-tenant. Tenant scoping MUST be applied before any filtering or lookup to prevent cross-tenant data leakage, and cross-tenant access attempts MUST NOT disclose run existence or details.
- FR-009: Run visibility and run start actions MUST be permission-gated by run type (least privilege). By default,
Owner,Manager,Operator, andReadonlycan view runs, butReadonlyMUST NOT be able to start or manage runs (and must not see those controls). - FR-010: Failure information stored and displayed for runs MUST be sanitized and minimized; it MUST NOT include secrets, credentials, tokens, PII, or full external payload dumps. Runs MUST store stable reason codes with short sanitized messages. Itemized operations MUST additionally store a sanitized per-item failures list that identifies the affected item using a non-sensitive reference (for example, an item name and/or ID that is safe to display) plus reason code and short message.
- FR-011: System MUST provide consistent user notifications for run lifecycle events: when queued and when reaching a terminal outcome (
succeeded/partially succeeded/failed). Notifications MUST include a “View run” link that opens the run detail view. Phase 1 notifications MUST be delivered to the initiating user; notification routing for system-initiated runs is deferred to Phase 2. If a notification cannot be delivered, Monitoring/Operations remains the source of truth for run status. - FR-012: Where a run produces or relates to a separate business artifact (for example, drift findings), the run detail view MUST provide a link to that artifact.
- FR-013: Backup Set “Add Policies” MUST be tracked as a run and MUST surface terminal outcome status, summary counts, and a link back to the related backup set context.
- FR-014: Monitoring/Operations MUST be view-only in Phase 1 and MUST NOT offer controls to start, rerun, cancel, or delete runs.
Key Entities (include if feature involves data)
- Run: A tenant-scoped record representing the execution of a long-running operation, including status, timestamps, summary counts, and safe failure information.
- Failure Detail: For a run, a stable reason code plus a short sanitized message; for itemized operations, a sanitized list of per-item failures to identify affected items without exposing sensitive data.
- Run Type: A classification of what the run does (e.g., inventory sync, backup capture, restore, drift generation) used for filtering, permissions, and consistent status/count semantics.
- Run Scope/Target: The object or scope the run applies to (e.g., a backup set, a policy, a drift scope), used for duplicate prevention and navigation.
- Related Artifact: A separate business record produced by an operation (e.g., restore details, drift findings) that can be linked from the run.
Success Criteria (mandatory)
Measurable Outcomes
- SC-001: An operator can determine the current state (
queued/running/succeeded/partially succeeded/failed) of any recent operation in under 30 seconds using Monitoring/Operations (measured via timed operator walkthroughs using Monitoring/Operations only). - SC-002: Starting a Phase 1 supported long-running operation provides user-visible confirmation and a “View run” link within 2 seconds under normal conditions (normal conditions: no active service degradation and typical tenant dataset sizes; excludes maintenance/outage windows; measured via timed start flows during operator walkthroughs).
- SC-003: For identical Phase 1 operation requests (same tenant + run type + scope + effective inputs), the system creates no more than one active run at a time (
queuedorrunning) in at least 99% of repeated-start attempts over a rolling 30-day window (measured by sampling repeated-start attempts and counting resulting active runs). - SC-004: For Phase 1 terminal runs, operators can identify a clear outcome (
succeeded/partially succeeded/failed) and a short, non-sensitive failure reason (when applicable) using Monitoring/Operations without inspecting server logs in at least 95% of cases over a rolling 30-day window (measured via support/operator triage review of run records). - SC-005: Operator- or support-reported incidents caused by “unknown/stuck status” long-running operations decrease by at least 50% within one release cycle after Phase 1 adoption (measured via support ticket tagging/categorization).