TenantAtlas/specs/054-unify-runs-suitewide/research.md
ahmido 3030dd9af2 054-unify-runs-suitewide (#63)
Summary

Kurz: Implementiert Feature 054 — canonical OperationRun-flow, Monitoring UI, dispatch-safety, notifications, dedupe, plus small UX safety clarifications (RBAC group search delegated; Restore group mapping DB-only).
What Changed

Core service: OperationRun lifecycle, dedupe and dispatch helpers — OperationRunService.php.
Model + migration: OperationRun model and migration — OperationRun.php, 2026_01_16_180642_create_operation_runs_table.php.
Notifications: queued + terminal DB notifications (initiator-only) — OperationRunQueued.php, OperationRunCompleted.php.
Monitoring UI: Filament list/detail + Livewire pieces (DB-only render) — OperationRunResource.php and related pages/views.
Start surfaces / Jobs: instrumented start surfaces, job middleware, and job updates to use canonical runs — multiple app/Jobs/* and app/Filament/* updates (see tests for full coverage).
RBAC + Restore UX clarifications: RBAC group search is delegated-Graph-based and disabled without delegated token; Restore group mapping remains DB-only (directory cache) and helper text always visible — TenantResource.php, RestoreRunResource.php.
Specs / Constitution: updated spec & quickstart and added one-line constitution guideline about Graph usage:
spec.md
quickstart.md
constitution.md
Tests & Verification

Unit / Feature tests added/updated for run lifecycle, notifications, idempotency, and UI guards: see tests/Feature/* (notably OperationRunServiceTest, MonitoringOperationsTest, OperationRunNotificationTest, and various Filament feature tests).
Full test run locally: ./vendor/bin/sail artisan test → 587 passed, 5 skipped.
Migrations

Adds create_operation_runs_table migration; run php artisan migrate in staging after review.
Notes / Rationale

Monitoring pages are explicitly DB-only at render time (no Graph calls). Start surfaces enqueue work only and return a “View run” link.
Delegated Graph access is used only for explicit user actions (RBAC group search); restore mapping intentionally uses cached DB data only to avoid render-time Graph calls.
Dispatch wrapper marks runs failed immediately if background dispatch throws synchronously to avoid misleading “queued” states.
Upgrade / Deploy Considerations

Run migrations: ./vendor/bin/sail artisan migrate.
Background workers should be running to process queued jobs (recommended to monitor queue health during rollout).
No secret or token persistence changes.
PR checklist

 Tests updated/added for changed behavior
 Specs updated: 054-unify-runs-suitewide docs + quickstart
 Constitution note added (.specify)
 Pint formatting applied

Co-authored-by: Ahmed Darrazi <ahmeddarrazi@adsmac.local>
Reviewed-on: #63
2026-01-17 22:25:00 +00:00

5.3 KiB

Research: Unified Operations Runs Suitewide

1. Technical Context & Unknowns

Unknowns Resolved:

  • Transition Strategy: Parallel write. We will maintain existing legacy tables (e.g., inventory_sync_runs, restore_runs) for now but strictly use operation_runs for the Monitoring UI.
  • Restore Adapter: RestoreRun remains the domain source of truth. An OperationRun adapter row will be created once a restore run reaches previewed (and later statuses), and will be kept in sync via RestoreRun lifecycle events or service-layer wrapping.
  • Run Logic Location: Existing jobs like RunInventorySyncJob will be updated to manage the OperationRun state.
  • Concurrency: Enforced by partial unique index on (tenant_id, run_identity_hash) where status is active (queued, running).
  • Dispatch Failure Semantics: If queue dispatch fails, the system will immediately complete the run as failed (e.g., queue.dispatch_failed) and show a clear UI message (never leaving misleading queued runs).
  • Notifications on Dedupe: Only the original initiator (operation_runs.user_id) receives queued/terminal notifications; reusers of an active run do not get additional notifications.

2. Technology Choices

Area Decision Rationale Alternatives
Schema operation_runs table Centralized table allows simple, performant Monitoring queries without complex UNIONs across disparate legacy tables. Virtual UNION view (Complex, harder to paginate/sort efficiently).
Restore Integration Physical Adapter Row Decouples Monitoring from Restore domain specifics. Allows uniform "list all runs" queries. The context JSON column will store { "restore_run_id": ... }. Polymorphic relation (Overhead for a single exception).
Idempotency DB Partial Unique Index Hard guarantee against race conditions. Simpler than distributed locks (Redis) which can expire or fail. Redis Lock (Soft guarantee), Application check (Race prone).
Initiator Nullable FK + Name Handles both Users (FK) and System/Scheduler (Name "System") uniformly. Polymorphic relation (Overkill for simple auditing).

3. Implementation Patterns

Canonical Run Lifecycle

  1. Start Request:
    • Compute run_identity_hash from inputs.
    • Attempt INSERT into operation_runs (idempotent; enforced by partial unique index for active runs).
    • If an active run exists, return it (Idempotency).
    • If new, dispatch the background Job.
    • If dispatch fails, immediately mark the run status=completed, outcome=failed with a safe failure code such as queue.dispatch_failed.
  2. Job Execution:
    • Update status to running.
    • Perform work.
    • Update status to completed and set terminal outcome (succeeded / partially_succeeded / failed / cancelled).
  3. Restore Adapter:
    • Create the adapter row only once RestoreRunStatus=previewed (or later) is reached.
    • Map RestoreRunStatus=previewed to OperationRun.status=queued and OperationRun.outcome=pending.
    • Keep the adapter updated as the restore progresses:
      • queuedstatus=queued, outcome=pending
      • runningstatus=running, outcome=pending
      • completedstatus=completed, outcome=succeeded
      • partialstatus=completed, outcome=partially_succeeded
      • failedstatus=completed, outcome=failed
      • cancelledstatus=completed, outcome=cancelled

Data Model

CREATE TABLE operation_runs (
    id BIGSERIAL PRIMARY KEY,
    tenant_id BIGINT NOT NULL REFERENCES tenants(id),
    user_id BIGINT NULL REFERENCES users(id), -- Initiator
    initiator_name VARCHAR(255) NOT NULL, -- "John Doe" or "System"
    type VARCHAR(255) NOT NULL, -- "inventory.sync"
    status VARCHAR(50) NOT NULL, -- queued, running, completed
    outcome VARCHAR(50) NOT NULL, -- pending, succeeded, partially_succeeded, failed, cancelled
    run_identity_hash VARCHAR(64) NOT NULL, -- SHA256(tenant_id + inputs)
    summary_counts JSONB DEFAULT '{}', -- { success: 10, failed: 2 }
    failure_summary JSONB DEFAULT '[]', -- [{ code: "ERR_TIMEOUT", message: "..." }]
    context JSONB DEFAULT '{}', -- { selection: [...], restore_run_id: 123 }
    started_at TIMESTAMP NULL,
    completed_at TIMESTAMP NULL,
    created_at TIMESTAMP,
    updated_at TIMESTAMP
);

CREATE UNIQUE INDEX operation_runs_active_unique 
ON operation_runs (tenant_id, run_identity_hash) 
WHERE status IN ('queued', 'running');

4. Risks & Mitigations

  • Risk: Desync between RestoreRun and OperationRun.
    • Mitigation: Use model observers or service-layer wrapping to ensure atomic-like updates, or accept slight eventual consistency (Monitoring might lag ms behind Restore UI).
  • Risk: Legacy runs not appearing.
    • Mitigation: We are NOT backfilling legacy runs. Only new runs after deployment will appear in the new Monitoring UI. This is acceptable for "Phase 1".
  • Risk: Confusion about queued for restore previewed.
    • Mitigation: Document that restore.execute appears from previewed onward and uses queued/pending until execution begins; Monitoring remains view-only and links to the restore domain detail.