TenantAtlas/specs/053-unify-runs-monitoring/plan.md
ahmido 30ad57baab feat/053-unify-runs-monitoring (#60)
Summary

This PR introduces Unified Operations Runs + Monitoring Hub (053).

Goal: Standardize how long-running operations are tracked and monitored using the existing tenant-scoped run record (BulkOperationRun) as the canonical “operation run”, and surface it in a single Monitoring → Operations hub (view-only, tenant-scoped, role-aware).

Phase 1 adoption scope (per spec):
	•	Drift generation (drift.generate)
	•	Backup Set “Add Policies” (backup_set.add_policies)

Note: This PR does not convert every run type yet (e.g. GroupSyncRuns / InventorySyncRuns remain separate for now). This is intentionally incremental.

⸻

What changed

Monitoring / Operations hub
	•	Moved/organized run monitoring under Monitoring → Operations
	•	Added:
	•	status buckets (queued / running / succeeded / partially succeeded / failed)
	•	filters (run type, status bucket, time range)
	•	run detail “Related” links (e.g. Drift findings, Backup Set context)
	•	All hub pages are DB-only and view-only (no rerun/cancel/delete actions)

Canonical run semantics
	•	Added canonical helpers on BulkOperationRun:
	•	runType() (resource.action)
	•	statusBucket() derived from status + counts (testable semantics)

Drift integration (Phase 1)
	•	Drift generation start behavior now:
	•	creates/reuses a BulkOperationRun with drift context payload (scope_key + baseline/current run ids)
	•	dispatches generation job
	•	emits DB notifications including “View run” link
	•	On generation failure: stores sanitized failure entries + sends failure notification

Permissions / tenant isolation
	•	Monitoring run list/view is tenant-scoped and returns 403 for cross-tenant access
	•	Readonly can view runs but cannot start drift generation

⸻

Tests

Added/updated Pest coverage:
	•	BulkOperationRunStatusBucketTest.php
	•	DriftGenerationDispatchTest.php
	•	GenerateDriftFindingsJobNotificationTest.php
	•	RunAuthorizationTenantIsolationTest.php

Validation run locally:
	•	./vendor/bin/pint --dirty
	•	targeted tests from feature quickstart / drift monitoring tests

⸻

Manual QA
	1.	Go to Monitoring → Operations
	•	verify filters (run type / status / time range)
	•	verify run detail shows counts + sanitized failures + “Related” links
	2.	Open Drift Landing
	•	with >=2 successful inventory runs for scope: should queue drift generation + show notification with “View run”
	•	as readonly: should not start generation
	3.	Run detail
	•	drift.generate runs show “Drift findings” related link
	•	failure entries are sanitized (no secrets/tokens/raw payload dumps)

⸻

Notes / Ops
	•	Queue workers must be restarted after deploy so they load the new code:
	•	php artisan queue:restart (or Sail equivalent)
	•	This PR standardizes monitoring for Phase 1 producers only; follow-ups will migrate additional run types into the unified pattern.

⸻

Spec / Docs
	•	SpecKit artifacts added under specs/053-unify-runs-monitoring/
	•	Checklists are complete:
	•	requirements checklist PASS
	•	writing checklist PASS

Co-authored-by: Ahmed Darrazi <ahmeddarrazi@adsmac.local>
Reviewed-on: #60
2026-01-16 15:10:31 +00:00

7.8 KiB

Implementation Plan: Unified Operations Runs + Monitoring Hub (053)

Branch: feat/053-unify-runs-monitoring | Date: 2026-01-16 | Spec: /Users/ahmeddarrazi/Documents/projects/TenantAtlas/specs/053-unify-runs-monitoring/spec.md (spec.md) Input: Feature specification from /Users/ahmeddarrazi/Documents/projects/TenantAtlas/specs/053-unify-runs-monitoring/spec.md

Note: This plan is filled in by the /speckit.plan command. See /Users/ahmeddarrazi/Documents/projects/TenantAtlas/.specify/scripts/ for helper scripts.

Summary

Unify how long-running operations are tracked and monitored by using the existing tenant-scoped run record (BulkOperationRun) as the canonical “operation run”, surfacing it in a single Monitoring/Operations hub (view-only, tenant-scoped, role-aware), and standardizing status semantics, notifications, failure detail minimization, and idempotent de-duplication.

Phase 1 adoption scope (per clarifications): Drift generation + Backup Set “Add Policies”.

Technical Context

Language/Version: PHP 8.4.15 (Laravel 12)
Primary Dependencies: Filament v4, Livewire v3
Storage: PostgreSQL (JSONB for run item_ids and failures)
Testing: Pest v4 (PHPUnit v12)
Target Platform: Web application (Sail-first locally, Dokploy-first deploy)
Project Type: web
Performance Goals: Monitoring/Operations index renders in <1s for the most recent ~250 runs; start actions confirm and provide a “View run” link within 2 seconds (aligns with SC-002).
Constraints: Monitoring pages are DB-only and view-only; strict tenant isolation; no secrets/tokens stored; run failures use stable reason codes + short sanitized messages; itemized runs store per-item failures (sanitized).
Scale/Scope: Tenant-scoped run history across multiple modules; Phase 1 covers drift generation + backup set “add policies”, with more run producers added in later phases.

Constitution Check

GATE: Must pass before Phase 0 research. Re-check after Phase 1 design.

  • Inventory-first: PASS. Monitoring uses persisted run records; drift generation is based on inventory sync “last observed” state and stores findings (not raw snapshots).
  • Read/write separation: PASS. Monitoring/Operations is view-only (no start/rerun/cancel/delete). Write actions remain in their feature UIs and already use queued jobs + auditability.
  • Graph contract path: PASS (no new Graph calls introduced by Monitoring). Existing Graph calls remain behind existing abstractions and must not occur during Monitoring page render.
  • Deterministic capabilities: N/A for this feature (no new capability resolver). Existing idempotency key builder remains deterministic.
  • Tenant isolation: PASS. Run list/view/start remain tenant-scoped; cross-tenant access is forbidden (403).
  • Automation: PASS. Active-run de-duplication uses deterministic idempotency keys + partial unique indexes; runs remain observable with status + counts + safe errors.
  • Data minimization: PASS. Failures are sanitized/minimized; no secrets/tokens/raw external payload dumps stored or displayed.

Gate status (pre-Phase 0): PASS (no violations).

Gate status (post-Phase 1): PASS (design artifacts present: research.md, data-model.md, contracts/*, quickstart.md).

Project Structure

Documentation (this feature)

/Users/ahmeddarrazi/Documents/projects/TenantAtlas/specs/053-unify-runs-monitoring/
├── plan.md                     # This file (/speckit.plan command output)
├── spec.md                     # Feature specification (input)
├── checklists/
│   └── requirements.md         # Spec quality checklist
├── research.md                 # Phase 0 output
├── data-model.md               # Phase 1 output
├── quickstart.md               # Phase 1 output
├── contracts/                  # Phase 1 output
└── tasks.md                    # Phase 2 output (/speckit.tasks command - NOT created by /speckit.plan)

Source Code (repository root)

app/
├── Filament/
│   ├── Pages/
│   └── Resources/
├── Jobs/
├── Models/
├── Policies/
├── Services/
└── Support/

config/

database/
└── migrations/

routes/

tests/
├── Feature/
└── Unit/

Structure Decision: Laravel web application. Implement Monitoring/Operations primarily via Filament (Resources/Pages) and reuse existing run-record primitives (bulk_operation_runs) with tenant-scoped policies and Pest tests.

Key Implementation Decisions (Pinned)

  • Phase 1 scope: Monitoring/Operations hub + Drift generation + Backup Set “Add Policies”.
  • Monitoring permissions: Owner, Manager, Operator, Readonly can view. Readonly is strictly view-only.
  • Monitoring guardrail: Monitoring/Operations is view-only in Phase 1 (no start/rerun/cancel/delete).
  • Status semantics: UI-level statuses are consistent and testable:
    • partially succeeded = at least one success and at least one failure
    • failed = zero successes (or the run could not proceed)
  • Failure detail: Stable reason codes + short sanitized messages; itemized operations include per-item failures (sanitized).

Execution Model

Run record primitive

  • Canonical run record: App\Models\BulkOperationRun (tenant-scoped) for Phase 1.
  • Producers in Phase 1:
    • Drift generation: resource=drift, action=generate
    • Backup Set “Add Policies”: resource=backup_set, action=add_policies (or existing canonical action naming)

Status mapping (storage ↔ UI semantics)

The UI MUST present consistent meanings while allowing storage to keep existing vocabulary:

  • pendingqueued
  • runningrunning
  • completedsucceeded
  • completed_with_errorspartially succeeded
  • failed / abortedfailed

Idempotency & de-duplication

  • Deterministic idempotency key per tenant + operation type + scope via App\Support\RunIdempotency.
  • Active-run reuse: if an identical run is pending or running, reuse it (return the existing run and link to it).
  • Race reduction: rely on the existing partial unique index for active runs and handle collisions by “find existing and reuse”.

Notifications

  • Use DB notifications for “queued” and “completed” lifecycle events, linking to the run detail page.
  • Notifications and persisted run failures must remain sanitized (no secrets/tokens/raw payloads).

Monitoring/Operations hub

  • Central list + filters for the active tenant:
    • filter by resource/action, status bucket (queued/running/succeeded/partial/failed), and time range
    • drill-down to run detail (status + counts + sanitized failures + item identifiers)
  • View-only: no hub actions to start, rerun, cancel, or delete runs.

Definition of Done (ends at Phase 2 planning)

Phase 2 (MVP implementation readiness)

  • Monitoring/Operations navigation exists and lists tenant runs with the required filters and drill-down.
  • Role guardrail enforced: Readonly can view list + detail but has no action controls.
  • Status bucket semantics are consistent and testable (including partial vs failed).
  • Drift generation and Backup Set “Add Policies” runs are visible and linkable from their feature entry points and from Monitoring/Operations.
  • Design artifacts exist and are referenced by this plan:
    • /Users/ahmeddarrazi/Documents/projects/TenantAtlas/specs/053-unify-runs-monitoring/research.md
    • /Users/ahmeddarrazi/Documents/projects/TenantAtlas/specs/053-unify-runs-monitoring/data-model.md
    • /Users/ahmeddarrazi/Documents/projects/TenantAtlas/specs/053-unify-runs-monitoring/contracts/
    • /Users/ahmeddarrazi/Documents/projects/TenantAtlas/specs/053-unify-runs-monitoring/quickstart.md

Complexity Tracking

Fill ONLY if Constitution Check has violations that must be justified

N/A (no constitution violations anticipated)