Summary This PR introduces Unified Operations Runs + Monitoring Hub (053). Goal: Standardize how long-running operations are tracked and monitored using the existing tenant-scoped run record (BulkOperationRun) as the canonical “operation run”, and surface it in a single Monitoring → Operations hub (view-only, tenant-scoped, role-aware). Phase 1 adoption scope (per spec): • Drift generation (drift.generate) • Backup Set “Add Policies” (backup_set.add_policies) Note: This PR does not convert every run type yet (e.g. GroupSyncRuns / InventorySyncRuns remain separate for now). This is intentionally incremental. ⸻ What changed Monitoring / Operations hub • Moved/organized run monitoring under Monitoring → Operations • Added: • status buckets (queued / running / succeeded / partially succeeded / failed) • filters (run type, status bucket, time range) • run detail “Related” links (e.g. Drift findings, Backup Set context) • All hub pages are DB-only and view-only (no rerun/cancel/delete actions) Canonical run semantics • Added canonical helpers on BulkOperationRun: • runType() (resource.action) • statusBucket() derived from status + counts (testable semantics) Drift integration (Phase 1) • Drift generation start behavior now: • creates/reuses a BulkOperationRun with drift context payload (scope_key + baseline/current run ids) • dispatches generation job • emits DB notifications including “View run” link • On generation failure: stores sanitized failure entries + sends failure notification Permissions / tenant isolation • Monitoring run list/view is tenant-scoped and returns 403 for cross-tenant access • Readonly can view runs but cannot start drift generation ⸻ Tests Added/updated Pest coverage: • BulkOperationRunStatusBucketTest.php • DriftGenerationDispatchTest.php • GenerateDriftFindingsJobNotificationTest.php • RunAuthorizationTenantIsolationTest.php Validation run locally: • ./vendor/bin/pint --dirty • targeted tests from feature quickstart / drift monitoring tests ⸻ Manual QA 1. Go to Monitoring → Operations • verify filters (run type / status / time range) • verify run detail shows counts + sanitized failures + “Related” links 2. Open Drift Landing • with >=2 successful inventory runs for scope: should queue drift generation + show notification with “View run” • as readonly: should not start generation 3. Run detail • drift.generate runs show “Drift findings” related link • failure entries are sanitized (no secrets/tokens/raw payload dumps) ⸻ Notes / Ops • Queue workers must be restarted after deploy so they load the new code: • php artisan queue:restart (or Sail equivalent) • This PR standardizes monitoring for Phase 1 producers only; follow-ups will migrate additional run types into the unified pattern. ⸻ Spec / Docs • SpecKit artifacts added under specs/053-unify-runs-monitoring/ • Checklists are complete: • requirements checklist PASS • writing checklist PASS Co-authored-by: Ahmed Darrazi <ahmeddarrazi@adsmac.local> Reviewed-on: #60
5.2 KiB
Research: Unified Operations Runs + Monitoring Hub (053)
This document resolves Phase 0 open questions and records design choices for Feature 053.
Decisions
1) Canonical run record (Phase 1)
Decision: Reuse the existing bulk_operation_runs / App\Models\BulkOperationRun as the canonical “operation run” record for Phase 1.
Rationale:
- The codebase already uses
BulkOperationRunfor long-running background work (including Drift generation and Backup Set “Add Policies”). - It already supports tenant scoping, initiator attribution, counts, and safe failure persistence.
- Avoids a high-risk cross-feature migration before we have proven consistent semantics across modules.
Alternatives considered:
- Create a new generic
operation_runs(+ optionaloperation_run_items) model and migrate all producers to it.- Rejected (Phase 1): higher schema + refactor cost, higher coordination risk, and would slow down delivering the Monitoring hub.
2) Monitoring/Operations hub surface
Decision: Implement the Monitoring/Operations hub by evolving the existing Filament BulkOperationRunResource (navigation group/label + filters), rather than creating a new custom monitoring page in Phase 1.
Rationale:
- The resource already provides a tenant-scoped list and a run detail view.
- Small changes deliver high value quickly and reduce risk.
Alternatives considered:
- New “Monitoring → Operations” Filament Page + bespoke table/detail.
- Rejected (Phase 1): duplicates existing capabilities and increases maintenance.
3) View-only guardrail and viewer roles
Decision: Monitoring/Operations is view-only in Phase 1 and is visible to tenant roles Owner, Manager, Operator, and Readonly. Start/re-run controls remain in the respective feature UIs.
Rationale:
- Adding run management actions implies introducing cancellation semantics, locks, permission matrices, and race handling across producers.
- View-only delivers the primary value (transparency + auditability) without expanding scope.
Alternatives considered:
- Add
Rerun/Cancelactions in the hub.- Rejected (Phase 1): scope expansion into “run management”.
- Restrict viewing to non-Readonly roles.
- Rejected: increases “what happened?” support loops; viewing is safe when sanitized.
4) Status semantics and mapping
Decision: Standardize UI-level status semantics as queued → running → (succeeded | partially succeeded | failed) while allowing underlying storage to keep its current status vocabulary.
partially succeeded= at least one success and at least one failure.failed= zero successes (or the run could not proceed).BulkOperationRun.statusmapping:pending→queued,running→running,completed→succeeded,completed_with_errors→partially succeeded,failed/aborted→failed.
Rationale:
- Keeps the operator-facing meaning consistent and testable without forcing a broad “rename statuses everywhere” refactor.
Alternatives considered:
- Normalize all run status values across all run tables immediately.
- Rejected (Phase 1): broad blast radius across many features and tests.
5) Failure detail storage
Decision: Persist stable reason codes and short sanitized messages for failures; itemized operations also store a sanitized per-item failures list.
Rationale:
- Operators and support should understand failures without reading server logs.
- Per-item failures avoid rerunning large operations just to identify the affected item.
Alternatives considered:
- Summary-only failure storage.
- Rejected: loses actionable “which item failed?” detail for itemized runs.
- Logs-only (no persisted failure detail).
- Rejected: weaker observability and not aligned with “safe, actionable failures”.
6) Idempotency & de-duplication
Decision: Use deterministic idempotency keys and active-run reuse as the primary dedupe mechanism:
- Key builder:
App\Support\RunIdempotency::buildKey(...)with stable, sorted context. - Active-run lookup: reuse when status is active (
pending/running). - Race reduction: rely on the existing partial unique index for active runs and handle collisions by finding and reusing the existing run.
Rationale:
- Aligns with the constitution (“Automation must be Idempotent & Observable”).
- Durable across restarts and observable in the database.
Alternatives considered:
- Cache-only locks without persisted keys.
- Rejected: less observable and easier to break across deploys/restarts.
7) Phase 1 producer scope
Decision: Phase 1 adopts the unified monitoring semantics for:
- Drift generation (
drift.generate) - Backup Set “Add Policies” (
backup_set.add_policies)
Rationale:
- Both are already using
BulkOperationRunand provide immediate value in the Monitoring hub. - Keeps Phase 1 bounded while proving the pattern across two modules.
Alternatives considered:
- Include every long-running producer in one pass.
- Rejected (Phase 1): larger blast radius and higher coordination cost.
Notes
- Retention/purge policy for run history should follow existing platform retention controls (defer to planning if changes are required).