TenantAtlas/specs/053-unify-runs-monitoring/research.md
ahmido 3030dd9af2 054-unify-runs-suitewide (#63)
Summary

Kurz: Implementiert Feature 054 — canonical OperationRun-flow, Monitoring UI, dispatch-safety, notifications, dedupe, plus small UX safety clarifications (RBAC group search delegated; Restore group mapping DB-only).
What Changed

Core service: OperationRun lifecycle, dedupe and dispatch helpers — OperationRunService.php.
Model + migration: OperationRun model and migration — OperationRun.php, 2026_01_16_180642_create_operation_runs_table.php.
Notifications: queued + terminal DB notifications (initiator-only) — OperationRunQueued.php, OperationRunCompleted.php.
Monitoring UI: Filament list/detail + Livewire pieces (DB-only render) — OperationRunResource.php and related pages/views.
Start surfaces / Jobs: instrumented start surfaces, job middleware, and job updates to use canonical runs — multiple app/Jobs/* and app/Filament/* updates (see tests for full coverage).
RBAC + Restore UX clarifications: RBAC group search is delegated-Graph-based and disabled without delegated token; Restore group mapping remains DB-only (directory cache) and helper text always visible — TenantResource.php, RestoreRunResource.php.
Specs / Constitution: updated spec & quickstart and added one-line constitution guideline about Graph usage:
spec.md
quickstart.md
constitution.md
Tests & Verification

Unit / Feature tests added/updated for run lifecycle, notifications, idempotency, and UI guards: see tests/Feature/* (notably OperationRunServiceTest, MonitoringOperationsTest, OperationRunNotificationTest, and various Filament feature tests).
Full test run locally: ./vendor/bin/sail artisan test → 587 passed, 5 skipped.
Migrations

Adds create_operation_runs_table migration; run php artisan migrate in staging after review.
Notes / Rationale

Monitoring pages are explicitly DB-only at render time (no Graph calls). Start surfaces enqueue work only and return a “View run” link.
Delegated Graph access is used only for explicit user actions (RBAC group search); restore mapping intentionally uses cached DB data only to avoid render-time Graph calls.
Dispatch wrapper marks runs failed immediately if background dispatch throws synchronously to avoid misleading “queued” states.
Upgrade / Deploy Considerations

Run migrations: ./vendor/bin/sail artisan migrate.
Background workers should be running to process queued jobs (recommended to monitor queue health during rollout).
No secret or token persistence changes.
PR checklist

 Tests updated/added for changed behavior
 Specs updated: 054-unify-runs-suitewide docs + quickstart
 Constitution note added (.specify)
 Pint formatting applied

Co-authored-by: Ahmed Darrazi <ahmeddarrazi@adsmac.local>
Reviewed-on: #63
2026-01-17 22:25:00 +00:00

5.2 KiB

Research: Unified Operations Runs + Monitoring Hub (053)

This document resolves Phase 0 open questions and records design choices for Feature 053.

Decisions

1) Canonical run record (Phase 1)

Decision: Reuse the existing bulk_operation_runs / App\Models\BulkOperationRun as the canonical “operation run” record for Phase 1.

Rationale:

  • The codebase already uses BulkOperationRun for long-running background work (including Drift generation and Backup Set “Add Policies”).
  • It already supports tenant scoping, initiator attribution, counts, and safe failure persistence.
  • Avoids a high-risk cross-feature migration before we have proven consistent semantics across modules.

Alternatives considered:

  • Create a new generic operation_runs (+ optional operation_run_items) model and migrate all producers to it.
    • Rejected (Phase 1): higher schema + refactor cost, higher coordination risk, and would slow down delivering the Monitoring hub.

2) Monitoring/Operations hub surface

Decision: Implement the Monitoring/Operations hub by evolving the existing Filament BulkOperationRunResource (navigation group/label + filters), rather than creating a new custom monitoring page in Phase 1.

Rationale:

  • The resource already provides a tenant-scoped list and a run detail view.
  • Small changes deliver high value quickly and reduce risk.

Alternatives considered:

  • New “Monitoring → Operations” Filament Page + bespoke table/detail.
    • Rejected (Phase 1): duplicates existing capabilities and increases maintenance.

3) View-only guardrail and viewer roles

Decision: Monitoring/Operations is view-only in Phase 1 and is visible to tenant roles Owner, Manager, Operator, and Readonly. Start/re-run controls remain in the respective feature UIs.

Rationale:

  • Adding run management actions implies introducing cancellation semantics, locks, permission matrices, and race handling across producers.
  • View-only delivers the primary value (transparency + auditability) without expanding scope.

Alternatives considered:

  • Add Rerun / Cancel actions in the hub.
    • Rejected (Phase 1): scope expansion into “run management”.
  • Restrict viewing to non-Readonly roles.
    • Rejected: increases “what happened?” support loops; viewing is safe when sanitized.

4) Status semantics and mapping

Decision: Standardize UI-level status semantics as queued → running → (succeeded | partially succeeded | failed) while allowing underlying storage to keep its current status vocabulary.

  • partially succeeded = at least one success and at least one failure.
  • failed = zero successes (or the run could not proceed).
  • BulkOperationRun.status mapping: pending→queued, running→running, completed→succeeded, completed_with_errors→partially succeeded, failed/aborted→failed.

Rationale:

  • Keeps the operator-facing meaning consistent and testable without forcing a broad “rename statuses everywhere” refactor.

Alternatives considered:

  • Normalize all run status values across all run tables immediately.
    • Rejected (Phase 1): broad blast radius across many features and tests.

5) Failure detail storage

Decision: Persist stable reason codes and short sanitized messages for failures; itemized operations also store a sanitized per-item failures list.

Rationale:

  • Operators and support should understand failures without reading server logs.
  • Per-item failures avoid rerunning large operations just to identify the affected item.

Alternatives considered:

  • Summary-only failure storage.
    • Rejected: loses actionable “which item failed?” detail for itemized runs.
  • Logs-only (no persisted failure detail).
    • Rejected: weaker observability and not aligned with “safe, actionable failures”.

6) Idempotency & de-duplication

Decision: Use deterministic idempotency keys and active-run reuse as the primary dedupe mechanism:

  • Key builder: App\Support\RunIdempotency::buildKey(...) with stable, sorted context.
  • Active-run lookup: reuse when status is active (pending/running).
  • Race reduction: rely on the existing partial unique index for active runs and handle collisions by finding and reusing the existing run.

Rationale:

  • Aligns with the constitution (“Operations / Run Observability Standard”).
  • Durable across restarts and observable in the database.

Alternatives considered:

  • Cache-only locks without persisted keys.
    • Rejected: less observable and easier to break across deploys/restarts.

7) Phase 1 producer scope

Decision: Phase 1 adopts the unified monitoring semantics for:

  • Drift generation (drift.generate)
  • Backup Set “Add Policies” (backup_set.add_policies)

Rationale:

  • Both are already using BulkOperationRun and provide immediate value in the Monitoring hub.
  • Keeps Phase 1 bounded while proving the pattern across two modules.

Alternatives considered:

  • Include every long-running producer in one pass.
    • Rejected (Phase 1): larger blast radius and higher coordination cost.

Notes

  • Retention/purge policy for run history should follow existing platform retention controls (defer to planning if changes are required).