ahmido 30ad57baab feat/053-unify-runs-monitoring (#60 )

Summary

This PR introduces Unified Operations Runs + Monitoring Hub (053).

Goal: Standardize how long-running operations are tracked and monitored using the existing tenant-scoped run record (BulkOperationRun) as the canonical “operation run”, and surface it in a single Monitoring → Operations hub (view-only, tenant-scoped, role-aware).

Phase 1 adoption scope (per spec):
	•	Drift generation (drift.generate)
	•	Backup Set “Add Policies” (backup_set.add_policies)

Note: This PR does not convert every run type yet (e.g. GroupSyncRuns / InventorySyncRuns remain separate for now). This is intentionally incremental.

⸻

What changed

Monitoring / Operations hub
	•	Moved/organized run monitoring under Monitoring → Operations
	•	Added:
	•	status buckets (queued / running / succeeded / partially succeeded / failed)
	•	filters (run type, status bucket, time range)
	•	run detail “Related” links (e.g. Drift findings, Backup Set context)
	•	All hub pages are DB-only and view-only (no rerun/cancel/delete actions)

Canonical run semantics
	•	Added canonical helpers on BulkOperationRun:
	•	runType() (resource.action)
	•	statusBucket() derived from status + counts (testable semantics)

Drift integration (Phase 1)
	•	Drift generation start behavior now:
	•	creates/reuses a BulkOperationRun with drift context payload (scope_key + baseline/current run ids)
	•	dispatches generation job
	•	emits DB notifications including “View run” link
	•	On generation failure: stores sanitized failure entries + sends failure notification

Permissions / tenant isolation
	•	Monitoring run list/view is tenant-scoped and returns 403 for cross-tenant access
	•	Readonly can view runs but cannot start drift generation

⸻

Tests

Added/updated Pest coverage:
	•	BulkOperationRunStatusBucketTest.php
	•	DriftGenerationDispatchTest.php
	•	GenerateDriftFindingsJobNotificationTest.php
	•	RunAuthorizationTenantIsolationTest.php

Validation run locally:
	•	./vendor/bin/pint --dirty
	•	targeted tests from feature quickstart / drift monitoring tests

⸻

Manual QA
	1.	Go to Monitoring → Operations
	•	verify filters (run type / status / time range)
	•	verify run detail shows counts + sanitized failures + “Related” links
	2.	Open Drift Landing
	•	with >=2 successful inventory runs for scope: should queue drift generation + show notification with “View run”
	•	as readonly: should not start generation
	3.	Run detail
	•	drift.generate runs show “Drift findings” related link
	•	failure entries are sanitized (no secrets/tokens/raw payload dumps)

⸻

Notes / Ops
	•	Queue workers must be restarted after deploy so they load the new code:
	•	php artisan queue:restart (or Sail equivalent)
	•	This PR standardizes monitoring for Phase 1 producers only; follow-ups will migrate additional run types into the unified pattern.

⸻

Spec / Docs
	•	SpecKit artifacts added under specs/053-unify-runs-monitoring/
	•	Checklists are complete:
	•	requirements checklist PASS
	•	writing checklist PASS

Co-authored-by: Ahmed Darrazi <ahmeddarrazi@adsmac.local>
Reviewed-on: #60

2026-01-16 15:10:31 +00:00

5.2 KiB

Raw Blame History

Research: Unified Operations Runs + Monitoring Hub (053)

This document resolves Phase 0 open questions and records design choices for Feature 053.

Decisions

1) Canonical run record (Phase 1)

Decision: Reuse the existing bulk_operation_runs / App\Models\BulkOperationRun as the canonical “operation run” record for Phase 1.

Rationale:

The codebase already uses BulkOperationRun for long-running background work (including Drift generation and Backup Set “Add Policies”).
It already supports tenant scoping, initiator attribution, counts, and safe failure persistence.
Avoids a high-risk cross-feature migration before we have proven consistent semantics across modules.

Alternatives considered:

Create a new generic operation_runs (+ optional operation_run_items) model and migrate all producers to it.
- Rejected (Phase 1): higher schema + refactor cost, higher coordination risk, and would slow down delivering the Monitoring hub.

2) Monitoring/Operations hub surface

Decision: Implement the Monitoring/Operations hub by evolving the existing Filament BulkOperationRunResource (navigation group/label + filters), rather than creating a new custom monitoring page in Phase 1.

Rationale:

The resource already provides a tenant-scoped list and a run detail view.
Small changes deliver high value quickly and reduce risk.

Alternatives considered:

New “Monitoring → Operations” Filament Page + bespoke table/detail.
- Rejected (Phase 1): duplicates existing capabilities and increases maintenance.

3) View-only guardrail and viewer roles

Decision: Monitoring/Operations is view-only in Phase 1 and is visible to tenant roles Owner, Manager, Operator, and Readonly. Start/re-run controls remain in the respective feature UIs.

Rationale:

Adding run management actions implies introducing cancellation semantics, locks, permission matrices, and race handling across producers.
View-only delivers the primary value (transparency + auditability) without expanding scope.

Alternatives considered:

Add Rerun / Cancel actions in the hub.
- Rejected (Phase 1): scope expansion into “run management”.
Restrict viewing to non-Readonly roles.
- Rejected: increases “what happened?” support loops; viewing is safe when sanitized.

4) Status semantics and mapping

Decision: Standardize UI-level status semantics as queued → running → (succeeded | partially succeeded | failed) while allowing underlying storage to keep its current status vocabulary.

partially succeeded = at least one success and at least one failure.
failed = zero successes (or the run could not proceed).
BulkOperationRun.status mapping: pending→queued, running→running, completed→succeeded, completed_with_errors→partially succeeded, failed/aborted→failed.

Rationale:

Keeps the operator-facing meaning consistent and testable without forcing a broad “rename statuses everywhere” refactor.

Alternatives considered:

Normalize all run status values across all run tables immediately.
- Rejected (Phase 1): broad blast radius across many features and tests.

5) Failure detail storage

Decision: Persist stable reason codes and short sanitized messages for failures; itemized operations also store a sanitized per-item failures list.

Rationale:

Operators and support should understand failures without reading server logs.
Per-item failures avoid rerunning large operations just to identify the affected item.

Alternatives considered:

Summary-only failure storage.
- Rejected: loses actionable “which item failed?” detail for itemized runs.
Logs-only (no persisted failure detail).
- Rejected: weaker observability and not aligned with “safe, actionable failures”.

6) Idempotency & de-duplication

Decision: Use deterministic idempotency keys and active-run reuse as the primary dedupe mechanism:

Key builder: App\Support\RunIdempotency::buildKey(...) with stable, sorted context.
Active-run lookup: reuse when status is active (pending/running).
Race reduction: rely on the existing partial unique index for active runs and handle collisions by finding and reusing the existing run.

Rationale:

Aligns with the constitution (“Automation must be Idempotent & Observable”).
Durable across restarts and observable in the database.

Alternatives considered:

Cache-only locks without persisted keys.
- Rejected: less observable and easier to break across deploys/restarts.

7) Phase 1 producer scope

Decision: Phase 1 adopts the unified monitoring semantics for:

Drift generation (drift.generate)
Backup Set “Add Policies” (backup_set.add_policies)

Rationale:

Both are already using BulkOperationRun and provide immediate value in the Monitoring hub.
Keeps Phase 1 bounded while proving the pattern across two modules.

Alternatives considered:

Include every long-running producer in one pass.
- Rejected (Phase 1): larger blast radius and higher coordination cost.

Notes

Retention/purge policy for run history should follow existing platform retention controls (defer to planning if changes are required).

5.2 KiB Raw Blame History

Research: Unified Operations Runs + Monitoring Hub (053)

Decisions

1) Canonical run record (Phase 1)

2) Monitoring/Operations hub surface

3) View-only guardrail and viewer roles

4) Status semantics and mapping

5) Failure detail storage

6) Idempotency & de-duplication

7) Phase 1 producer scope

Notes

5.2 KiB

Raw Blame History