Summary Kurz: Implementiert Feature 054 — canonical OperationRun-flow, Monitoring UI, dispatch-safety, notifications, dedupe, plus small UX safety clarifications (RBAC group search delegated; Restore group mapping DB-only). What Changed Core service: OperationRun lifecycle, dedupe and dispatch helpers — OperationRunService.php. Model + migration: OperationRun model and migration — OperationRun.php, 2026_01_16_180642_create_operation_runs_table.php. Notifications: queued + terminal DB notifications (initiator-only) — OperationRunQueued.php, OperationRunCompleted.php. Monitoring UI: Filament list/detail + Livewire pieces (DB-only render) — OperationRunResource.php and related pages/views. Start surfaces / Jobs: instrumented start surfaces, job middleware, and job updates to use canonical runs — multiple app/Jobs/* and app/Filament/* updates (see tests for full coverage). RBAC + Restore UX clarifications: RBAC group search is delegated-Graph-based and disabled without delegated token; Restore group mapping remains DB-only (directory cache) and helper text always visible — TenantResource.php, RestoreRunResource.php. Specs / Constitution: updated spec & quickstart and added one-line constitution guideline about Graph usage: spec.md quickstart.md constitution.md Tests & Verification Unit / Feature tests added/updated for run lifecycle, notifications, idempotency, and UI guards: see tests/Feature/* (notably OperationRunServiceTest, MonitoringOperationsTest, OperationRunNotificationTest, and various Filament feature tests). Full test run locally: ./vendor/bin/sail artisan test → 587 passed, 5 skipped. Migrations Adds create_operation_runs_table migration; run php artisan migrate in staging after review. Notes / Rationale Monitoring pages are explicitly DB-only at render time (no Graph calls). Start surfaces enqueue work only and return a “View run” link. Delegated Graph access is used only for explicit user actions (RBAC group search); restore mapping intentionally uses cached DB data only to avoid render-time Graph calls. Dispatch wrapper marks runs failed immediately if background dispatch throws synchronously to avoid misleading “queued” states. Upgrade / Deploy Considerations Run migrations: ./vendor/bin/sail artisan migrate. Background workers should be running to process queued jobs (recommended to monitor queue health during rollout). No secret or token persistence changes. PR checklist Tests updated/added for changed behavior Specs updated: 054-unify-runs-suitewide docs + quickstart Constitution note added (.specify) Pint formatting applied Co-authored-by: Ahmed Darrazi <ahmeddarrazi@adsmac.local> Reviewed-on: #63
107 lines
5.2 KiB
Markdown
107 lines
5.2 KiB
Markdown
# Research: Unified Operations Runs + Monitoring Hub (053)
|
|
|
|
This document resolves Phase 0 open questions and records design choices for Feature 053.
|
|
|
|
## Decisions
|
|
|
|
### 1) Canonical run record (Phase 1)
|
|
|
|
**Decision:** Reuse the existing `bulk_operation_runs` / `App\Models\BulkOperationRun` as the canonical “operation run” record for Phase 1.
|
|
|
|
**Rationale:**
|
|
- The codebase already uses `BulkOperationRun` for long-running background work (including Drift generation and Backup Set “Add Policies”).
|
|
- It already supports tenant scoping, initiator attribution, counts, and safe failure persistence.
|
|
- Avoids a high-risk cross-feature migration before we have proven consistent semantics across modules.
|
|
|
|
**Alternatives considered:**
|
|
- **Create a new generic `operation_runs` (+ optional `operation_run_items`) model** and migrate all producers to it.
|
|
- Rejected (Phase 1): higher schema + refactor cost, higher coordination risk, and would slow down delivering the Monitoring hub.
|
|
|
|
### 2) Monitoring/Operations hub surface
|
|
|
|
**Decision:** Implement the Monitoring/Operations hub by evolving the existing Filament `BulkOperationRunResource` (navigation group/label + filters), rather than creating a new custom monitoring page in Phase 1.
|
|
|
|
**Rationale:**
|
|
- The resource already provides a tenant-scoped list and a run detail view.
|
|
- Small changes deliver high value quickly and reduce risk.
|
|
|
|
**Alternatives considered:**
|
|
- **New “Monitoring → Operations” Filament Page + bespoke table/detail.**
|
|
- Rejected (Phase 1): duplicates existing capabilities and increases maintenance.
|
|
|
|
### 3) View-only guardrail and viewer roles
|
|
|
|
**Decision:** Monitoring/Operations is view-only in Phase 1 and is visible to tenant roles `Owner`, `Manager`, `Operator`, and `Readonly`. Start/re-run controls remain in the respective feature UIs.
|
|
|
|
**Rationale:**
|
|
- Adding run management actions implies introducing cancellation semantics, locks, permission matrices, and race handling across producers.
|
|
- View-only delivers the primary value (transparency + auditability) without expanding scope.
|
|
|
|
**Alternatives considered:**
|
|
- **Add `Rerun` / `Cancel` actions in the hub.**
|
|
- Rejected (Phase 1): scope expansion into “run management”.
|
|
- **Restrict viewing to non-Readonly roles.**
|
|
- Rejected: increases “what happened?” support loops; viewing is safe when sanitized.
|
|
|
|
### 4) Status semantics and mapping
|
|
|
|
**Decision:** Standardize UI-level status semantics as `queued → running → (succeeded | partially succeeded | failed)` while allowing underlying storage to keep its current status vocabulary.
|
|
|
|
- `partially succeeded` = at least one success and at least one failure.
|
|
- `failed` = zero successes (or the run could not proceed).
|
|
- `BulkOperationRun.status` mapping: `pending→queued`, `running→running`, `completed→succeeded`, `completed_with_errors→partially succeeded`, `failed/aborted→failed`.
|
|
|
|
**Rationale:**
|
|
- Keeps the operator-facing meaning consistent and testable without forcing a broad “rename statuses everywhere” refactor.
|
|
|
|
**Alternatives considered:**
|
|
- **Normalize all run status values across all run tables immediately.**
|
|
- Rejected (Phase 1): broad blast radius across many features and tests.
|
|
|
|
### 5) Failure detail storage
|
|
|
|
**Decision:** Persist stable reason codes and short sanitized messages for failures; itemized operations also store a sanitized per-item failures list.
|
|
|
|
**Rationale:**
|
|
- Operators and support should understand failures without reading server logs.
|
|
- Per-item failures avoid rerunning large operations just to identify the affected item.
|
|
|
|
**Alternatives considered:**
|
|
- **Summary-only failure storage.**
|
|
- Rejected: loses actionable “which item failed?” detail for itemized runs.
|
|
- **Logs-only (no persisted failure detail).**
|
|
- Rejected: weaker observability and not aligned with “safe, actionable failures”.
|
|
|
|
### 6) Idempotency & de-duplication
|
|
|
|
**Decision:** Use deterministic idempotency keys and active-run reuse as the primary dedupe mechanism:
|
|
- Key builder: `App\Support\RunIdempotency::buildKey(...)` with stable, sorted context.
|
|
- Active-run lookup: reuse when status is active (`pending`/`running`).
|
|
- Race reduction: rely on the existing partial unique index for active runs and handle collisions by finding and reusing the existing run.
|
|
|
|
**Rationale:**
|
|
- Aligns with the constitution (“Operations / Run Observability Standard”).
|
|
- Durable across restarts and observable in the database.
|
|
|
|
**Alternatives considered:**
|
|
- **Cache-only locks without persisted keys.**
|
|
- Rejected: less observable and easier to break across deploys/restarts.
|
|
|
|
### 7) Phase 1 producer scope
|
|
|
|
**Decision:** Phase 1 adopts the unified monitoring semantics for:
|
|
- Drift generation (`drift.generate`)
|
|
- Backup Set “Add Policies” (`backup_set.add_policies`)
|
|
|
|
**Rationale:**
|
|
- Both are already using `BulkOperationRun` and provide immediate value in the Monitoring hub.
|
|
- Keeps Phase 1 bounded while proving the pattern across two modules.
|
|
|
|
**Alternatives considered:**
|
|
- **Include every long-running producer in one pass.**
|
|
- Rejected (Phase 1): larger blast radius and higher coordination cost.
|
|
|
|
## Notes
|
|
|
|
- Retention/purge policy for run history should follow existing platform retention controls (defer to planning if changes are required).
|