Summary Kurz: Implementiert Feature 054 — canonical OperationRun-flow, Monitoring UI, dispatch-safety, notifications, dedupe, plus small UX safety clarifications (RBAC group search delegated; Restore group mapping DB-only). What Changed Core service: OperationRun lifecycle, dedupe and dispatch helpers — OperationRunService.php. Model + migration: OperationRun model and migration — OperationRun.php, 2026_01_16_180642_create_operation_runs_table.php. Notifications: queued + terminal DB notifications (initiator-only) — OperationRunQueued.php, OperationRunCompleted.php. Monitoring UI: Filament list/detail + Livewire pieces (DB-only render) — OperationRunResource.php and related pages/views. Start surfaces / Jobs: instrumented start surfaces, job middleware, and job updates to use canonical runs — multiple app/Jobs/* and app/Filament/* updates (see tests for full coverage). RBAC + Restore UX clarifications: RBAC group search is delegated-Graph-based and disabled without delegated token; Restore group mapping remains DB-only (directory cache) and helper text always visible — TenantResource.php, RestoreRunResource.php. Specs / Constitution: updated spec & quickstart and added one-line constitution guideline about Graph usage: spec.md quickstart.md constitution.md Tests & Verification Unit / Feature tests added/updated for run lifecycle, notifications, idempotency, and UI guards: see tests/Feature/* (notably OperationRunServiceTest, MonitoringOperationsTest, OperationRunNotificationTest, and various Filament feature tests). Full test run locally: ./vendor/bin/sail artisan test → 587 passed, 5 skipped. Migrations Adds create_operation_runs_table migration; run php artisan migrate in staging after review. Notes / Rationale Monitoring pages are explicitly DB-only at render time (no Graph calls). Start surfaces enqueue work only and return a “View run” link. Delegated Graph access is used only for explicit user actions (RBAC group search); restore mapping intentionally uses cached DB data only to avoid render-time Graph calls. Dispatch wrapper marks runs failed immediately if background dispatch throws synchronously to avoid misleading “queued” states. Upgrade / Deploy Considerations Run migrations: ./vendor/bin/sail artisan migrate. Background workers should be running to process queued jobs (recommended to monitor queue health during rollout). No secret or token persistence changes. PR checklist Tests updated/added for changed behavior Specs updated: 054-unify-runs-suitewide docs + quickstart Constitution note added (.specify) Pint formatting applied Co-authored-by: Ahmed Darrazi <ahmeddarrazi@adsmac.local> Reviewed-on: #63
78 lines
5.3 KiB
Markdown
78 lines
5.3 KiB
Markdown
# Research: Unified Operations Runs Suitewide
|
|
|
|
## 1. Technical Context & Unknowns
|
|
|
|
**Unknowns Resolved**:
|
|
- **Transition Strategy**: Parallel write. We will maintain existing legacy tables (e.g., `inventory_sync_runs`, `restore_runs`) for now but strictly use `operation_runs` for the Monitoring UI.
|
|
- **Restore Adapter**: `RestoreRun` remains the domain source of truth. An `OperationRun` adapter row will be created once a restore run reaches `previewed` (and later statuses), and will be kept in sync via `RestoreRun` lifecycle events or service-layer wrapping.
|
|
- **Run Logic Location**: Existing jobs like `RunInventorySyncJob` will be updated to manage the `OperationRun` state.
|
|
- **Concurrency**: Enforced by partial unique index on `(tenant_id, run_identity_hash)` where status is active (`queued`, `running`).
|
|
- **Dispatch Failure Semantics**: If queue dispatch fails, the system will immediately complete the run as `failed` (e.g., `queue.dispatch_failed`) and show a clear UI message (never leaving misleading queued runs).
|
|
- **Notifications on Dedupe**: Only the original initiator (`operation_runs.user_id`) receives queued/terminal notifications; reusers of an active run do not get additional notifications.
|
|
|
|
## 2. Technology Choices
|
|
|
|
| Area | Decision | Rationale | Alternatives |
|
|
|------|----------|-----------|--------------|
|
|
| **Schema** | `operation_runs` table | Centralized table allows simple, performant Monitoring queries without complex UNIONs across disparate legacy tables. | Virtual UNION view (Complex, harder to paginate/sort efficiently). |
|
|
| **Restore Integration** | Physical Adapter Row | Decouples Monitoring from Restore domain specifics. Allows uniform "list all runs" queries. The `context` JSON column will store `{ "restore_run_id": ... }`. | Polymorphic relation (Overhead for a single exception). |
|
|
| **Idempotency** | DB Partial Unique Index | Hard guarantee against race conditions. Simpler than distributed locks (Redis) which can expire or fail. | Redis Lock (Soft guarantee), Application check (Race prone). |
|
|
| **Initiator** | Nullable FK + Name | Handles both Users (FK) and System/Scheduler (Name "System") uniformly. | Polymorphic relation (Overkill for simple auditing). |
|
|
|
|
## 3. Implementation Patterns
|
|
|
|
### Canonical Run Lifecycle
|
|
1. **Start Request**:
|
|
- Compute `run_identity_hash` from inputs.
|
|
- Attempt `INSERT` into `operation_runs` (idempotent; enforced by partial unique index for active runs).
|
|
- If an active run exists, return it (Idempotency).
|
|
- If new, dispatch the background Job.
|
|
- If dispatch fails, immediately mark the run `status=completed`, `outcome=failed` with a safe failure code such as `queue.dispatch_failed`.
|
|
2. **Job Execution**:
|
|
- Update status to `running`.
|
|
- Perform work.
|
|
- Update status to `completed` and set terminal outcome (`succeeded` / `partially_succeeded` / `failed` / `cancelled`).
|
|
3. **Restore Adapter**:
|
|
- Create the adapter row only once `RestoreRunStatus=previewed` (or later) is reached.
|
|
- Map `RestoreRunStatus=previewed` to `OperationRun.status=queued` and `OperationRun.outcome=pending`.
|
|
- Keep the adapter updated as the restore progresses:
|
|
- `queued` → `status=queued`, `outcome=pending`
|
|
- `running` → `status=running`, `outcome=pending`
|
|
- `completed` → `status=completed`, `outcome=succeeded`
|
|
- `partial` → `status=completed`, `outcome=partially_succeeded`
|
|
- `failed` → `status=completed`, `outcome=failed`
|
|
- `cancelled` → `status=completed`, `outcome=cancelled`
|
|
|
|
### Data Model
|
|
```sql
|
|
CREATE TABLE operation_runs (
|
|
id BIGSERIAL PRIMARY KEY,
|
|
tenant_id BIGINT NOT NULL REFERENCES tenants(id),
|
|
user_id BIGINT NULL REFERENCES users(id), -- Initiator
|
|
initiator_name VARCHAR(255) NOT NULL, -- "John Doe" or "System"
|
|
type VARCHAR(255) NOT NULL, -- "inventory.sync"
|
|
status VARCHAR(50) NOT NULL, -- queued, running, completed
|
|
outcome VARCHAR(50) NOT NULL, -- pending, succeeded, partially_succeeded, failed, cancelled
|
|
run_identity_hash VARCHAR(64) NOT NULL, -- SHA256(tenant_id + inputs)
|
|
summary_counts JSONB DEFAULT '{}', -- { success: 10, failed: 2 }
|
|
failure_summary JSONB DEFAULT '[]', -- [{ code: "ERR_TIMEOUT", message: "..." }]
|
|
context JSONB DEFAULT '{}', -- { selection: [...], restore_run_id: 123 }
|
|
started_at TIMESTAMP NULL,
|
|
completed_at TIMESTAMP NULL,
|
|
created_at TIMESTAMP,
|
|
updated_at TIMESTAMP
|
|
);
|
|
|
|
CREATE UNIQUE INDEX operation_runs_active_unique
|
|
ON operation_runs (tenant_id, run_identity_hash)
|
|
WHERE status IN ('queued', 'running');
|
|
```
|
|
|
|
## 4. Risks & Mitigations
|
|
- **Risk**: Desync between `RestoreRun` and `OperationRun`.
|
|
- **Mitigation**: Use model observers or service-layer wrapping to ensure atomic-like updates, or accept slight eventual consistency (Monitoring might lag ms behind Restore UI).
|
|
- **Risk**: Legacy runs not appearing.
|
|
- **Mitigation**: We are NOT backfilling legacy runs. Only new runs after deployment will appear in the new Monitoring UI. This is acceptable for "Phase 1".
|
|
- **Risk**: Confusion about `queued` for restore `previewed`.
|
|
- **Mitigation**: Document that `restore.execute` appears from `previewed` onward and uses `queued/pending` until execution begins; Monitoring remains view-only and links to the restore domain detail.
|