# Research: Unified Operations Runs + Monitoring Hub (053) This document resolves Phase 0 open questions and records design choices for Feature 053. ## Decisions ### 1) Canonical run record (Phase 1) **Decision:** Reuse the existing `bulk_operation_runs` / `App\Models\BulkOperationRun` as the canonical “operation run” record for Phase 1. **Rationale:** - The codebase already uses `BulkOperationRun` for long-running background work (including Drift generation and Backup Set “Add Policies”). - It already supports tenant scoping, initiator attribution, counts, and safe failure persistence. - Avoids a high-risk cross-feature migration before we have proven consistent semantics across modules. **Alternatives considered:** - **Create a new generic `operation_runs` (+ optional `operation_run_items`) model** and migrate all producers to it. - Rejected (Phase 1): higher schema + refactor cost, higher coordination risk, and would slow down delivering the Monitoring hub. ### 2) Monitoring/Operations hub surface **Decision:** Implement the Monitoring/Operations hub by evolving the existing Filament `BulkOperationRunResource` (navigation group/label + filters), rather than creating a new custom monitoring page in Phase 1. **Rationale:** - The resource already provides a tenant-scoped list and a run detail view. - Small changes deliver high value quickly and reduce risk. **Alternatives considered:** - **New “Monitoring → Operations” Filament Page + bespoke table/detail.** - Rejected (Phase 1): duplicates existing capabilities and increases maintenance. ### 3) View-only guardrail and viewer roles **Decision:** Monitoring/Operations is view-only in Phase 1 and is visible to tenant roles `Owner`, `Manager`, `Operator`, and `Readonly`. Start/re-run controls remain in the respective feature UIs. **Rationale:** - Adding run management actions implies introducing cancellation semantics, locks, permission matrices, and race handling across producers. - View-only delivers the primary value (transparency + auditability) without expanding scope. **Alternatives considered:** - **Add `Rerun` / `Cancel` actions in the hub.** - Rejected (Phase 1): scope expansion into “run management”. - **Restrict viewing to non-Readonly roles.** - Rejected: increases “what happened?” support loops; viewing is safe when sanitized. ### 4) Status semantics and mapping **Decision:** Standardize UI-level status semantics as `queued → running → (succeeded | partially succeeded | failed)` while allowing underlying storage to keep its current status vocabulary. - `partially succeeded` = at least one success and at least one failure. - `failed` = zero successes (or the run could not proceed). - `BulkOperationRun.status` mapping: `pending→queued`, `running→running`, `completed→succeeded`, `completed_with_errors→partially succeeded`, `failed/aborted→failed`. **Rationale:** - Keeps the operator-facing meaning consistent and testable without forcing a broad “rename statuses everywhere” refactor. **Alternatives considered:** - **Normalize all run status values across all run tables immediately.** - Rejected (Phase 1): broad blast radius across many features and tests. ### 5) Failure detail storage **Decision:** Persist stable reason codes and short sanitized messages for failures; itemized operations also store a sanitized per-item failures list. **Rationale:** - Operators and support should understand failures without reading server logs. - Per-item failures avoid rerunning large operations just to identify the affected item. **Alternatives considered:** - **Summary-only failure storage.** - Rejected: loses actionable “which item failed?” detail for itemized runs. - **Logs-only (no persisted failure detail).** - Rejected: weaker observability and not aligned with “safe, actionable failures”. ### 6) Idempotency & de-duplication **Decision:** Use deterministic idempotency keys and active-run reuse as the primary dedupe mechanism: - Key builder: `App\Support\RunIdempotency::buildKey(...)` with stable, sorted context. - Active-run lookup: reuse when status is active (`pending`/`running`). - Race reduction: rely on the existing partial unique index for active runs and handle collisions by finding and reusing the existing run. **Rationale:** - Aligns with the constitution (“Operations / Run Observability Standard”). - Durable across restarts and observable in the database. **Alternatives considered:** - **Cache-only locks without persisted keys.** - Rejected: less observable and easier to break across deploys/restarts. ### 7) Phase 1 producer scope **Decision:** Phase 1 adopts the unified monitoring semantics for: - Drift generation (`drift.generate`) - Backup Set “Add Policies” (`backup_set.add_policies`) **Rationale:** - Both are already using `BulkOperationRun` and provide immediate value in the Monitoring hub. - Keeps Phase 1 bounded while proving the pattern across two modules. **Alternatives considered:** - **Include every long-running producer in one pass.** - Rejected (Phase 1): larger blast radius and higher coordination cost. ## Notes - Retention/purge policy for run history should follow existing platform retention controls (defer to planning if changes are required).