TenantAtlas/specs/054-unify-runs-suitewide/spec.md
ahmido 3030dd9af2 054-unify-runs-suitewide (#63)
Summary

Kurz: Implementiert Feature 054 — canonical OperationRun-flow, Monitoring UI, dispatch-safety, notifications, dedupe, plus small UX safety clarifications (RBAC group search delegated; Restore group mapping DB-only).
What Changed

Core service: OperationRun lifecycle, dedupe and dispatch helpers — OperationRunService.php.
Model + migration: OperationRun model and migration — OperationRun.php, 2026_01_16_180642_create_operation_runs_table.php.
Notifications: queued + terminal DB notifications (initiator-only) — OperationRunQueued.php, OperationRunCompleted.php.
Monitoring UI: Filament list/detail + Livewire pieces (DB-only render) — OperationRunResource.php and related pages/views.
Start surfaces / Jobs: instrumented start surfaces, job middleware, and job updates to use canonical runs — multiple app/Jobs/* and app/Filament/* updates (see tests for full coverage).
RBAC + Restore UX clarifications: RBAC group search is delegated-Graph-based and disabled without delegated token; Restore group mapping remains DB-only (directory cache) and helper text always visible — TenantResource.php, RestoreRunResource.php.
Specs / Constitution: updated spec & quickstart and added one-line constitution guideline about Graph usage:
spec.md
quickstart.md
constitution.md
Tests & Verification

Unit / Feature tests added/updated for run lifecycle, notifications, idempotency, and UI guards: see tests/Feature/* (notably OperationRunServiceTest, MonitoringOperationsTest, OperationRunNotificationTest, and various Filament feature tests).
Full test run locally: ./vendor/bin/sail artisan test → 587 passed, 5 skipped.
Migrations

Adds create_operation_runs_table migration; run php artisan migrate in staging after review.
Notes / Rationale

Monitoring pages are explicitly DB-only at render time (no Graph calls). Start surfaces enqueue work only and return a “View run” link.
Delegated Graph access is used only for explicit user actions (RBAC group search); restore mapping intentionally uses cached DB data only to avoid render-time Graph calls.
Dispatch wrapper marks runs failed immediately if background dispatch throws synchronously to avoid misleading “queued” states.
Upgrade / Deploy Considerations

Run migrations: ./vendor/bin/sail artisan migrate.
Background workers should be running to process queued jobs (recommended to monitor queue health during rollout).
No secret or token persistence changes.
PR checklist

 Tests updated/added for changed behavior
 Specs updated: 054-unify-runs-suitewide docs + quickstart
 Constitution note added (.specify)
 Pint formatting applied

Co-authored-by: Ahmed Darrazi <ahmeddarrazi@adsmac.local>
Reviewed-on: #63
2026-01-17 22:25:00 +00:00

235 lines
22 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Feature Specification: Unified Operations Runs Suitewide (Except Restore Domain Model) (054)
**Feature Branch**: `feat/054-unify-operations-runs-suitewide`
**Created**: 2026-01-16
**Status**: Draft
**Input**: User description: "Eliminate run sprawl by adopting one canonical tenant-scoped operation run record for long-running actions across the product, surfaced consistently in Monitoring → Operations, while keeping restore as a separate domain workflow that is still visible via an adapter entry."
## Clarifications
### Session 2026-01-16
- Q: Welche Default-Retention soll 054 für canonical Operation Runs festlegen? → A: 90 days
- Q: Transition-Strategie in 054: schreiben wir canonical Runs parallel zu Legacy-Run-Tabellen, oder ersetzen wir sofort? → A: Parallel write (canonical + legacy)
- Q: For `restore.execute`, the spec mentions it acts as an "adapter entry" linking to the restore domain record. How should this be implemented? → A: Physical Row (Create a physical row in `operation_runs` that points to the restore record).
- Q: How should concurrency and deduplication (FR-009) be enforced at the database level? → A: Partial Unique Index (unique constraint on `tenant_id, run_identity_hash` where status is `queued` or `running`).
- Q: How should the `initiator` be modeled to support both users and system processes (FR-001)? → A: Nullable FK + Name Snapshot (`user_id` nullable FK + required `initiator_name` string).
### Session 2026-01-17
- Q: Sollen `backup_schedule.run_now` und `backup_schedule.retry` in 054 zur Phase-1-Adoption (must be implemented) gehören? → A: Yes — both are Phase 1 in 054 (OperationRun producers + worker tracking).
- Q: Wenn Queue-Dispatch fehlschlägt (Background Processing unavailable), sollen wir trotzdem einen `OperationRun` anlegen und ihn sofort als fehlgeschlagen abschließen? → A: Yes — create an `OperationRun` and immediately complete it as `failed` (e.g., failure code `queue.dispatch_failed`); show a clear error and MAY include a “View run” link.
- Q: Wenn ein Start deduped wird (Run wird wiederverwendet), wer soll die InApp Notifications (“queued” + terminal outcome) bekommen? → A: Only the original initiator (`operation_runs.user_id`); no additional notifications are sent to the second starter on reuse.
- Q: Für `restore.execute`: In welchen `RestoreRunStatus`-Phasen soll überhaupt ein `OperationRun`-AdapterRow erzeugt/angezeigt werden? → A: From `previewed` onwards (previewed + execution statuses); no adapter row for `draft`/`scoped`/`checked`.
- Q: Wenn der `restore.execute` Adapter bereits ab `RestoreRunStatus=previewed` sichtbar ist: welchen `OperationRun`-State sollen wir für diese Phase setzen? → A: `status=queued`, `outcome=pending` (until `running`, then `completed` + terminal outcome).
- Q: RBAC Wizard (`TenantResource`) wie funktioniert Group Search? → A: Group search is delegated-Graph-based and the picker MUST be disabled without delegated auth.
- Q: Restore Wizard (`RestoreRunResource`) Group Mapping Phase: Graph oder DB-only? → A: DB-only via Directory Cache (`entra_groups`), no Graph calls during mapping; helper text is always shown (fallback included).
## User Scenarios & Testing *(mandatory)*
### User Story 1 - See Every Supported Operation in Monitoring (Priority: P1)
As an operator, I want Monitoring → Operations to show all supported long-running operations for my tenant in one consistent list and detail view, so I can quickly answer what ran, who started it, whether it succeeded/partially succeeded/failed, and where to look next.
**Why this priority**: This is the core value: a single, tenant-scoped source of truth for operational visibility.
**Independent Test**: Trigger at least one run of each Phase 1 run producer, then verify each appears in Monitoring with consistent status/outcome semantics, safe failure summaries, and context links.
**Acceptance Scenarios**:
1. **Given** I am signed into tenant A, **When** I open Monitoring → Operations, **Then** I see only tenant A runs and can filter by run type, run state (queued/running/terminal outcome), time range, and initiator.
2. **Given** multiple run types exist, **When** I filter to `inventory.sync`, **Then** only inventory sync runs are shown.
3. **Given** a run exists, **When** I open its detail view, **Then** I can see initiator, run type, run state (queued/running/terminal outcome), timestamps, summary counts (if applicable), sanitized failures (if any), and links to relevant feature context/results.
4. **Given** a restore run has reached `previewed` or later, **When** I open Monitoring → Operations, **Then** I can see a `restore.execute` entry that links to the existing restore record (restore history remains owned by the restore domain record).
5. **Given** I am a `Readonly` user in tenant A, **When** I view Monitoring → Operations, **Then** I can view runs and details but I do not see any start/rerun/cancel/delete controls.
6. **Given** I attempt to access a run from another tenant (direct link or list), **When** I request it, **Then** access is denied and no run details are disclosed.
---
### User Story 2 - Start Operations Without Blocking (Priority: P2)
As an operator, when I start a supported operation, I want immediate confirmation and a “View run” link so I can continue working while the operation runs in the background.
**Why this priority**: Removes long-running requests/timeouts and standardizes how operations are started and observed.
**Independent Test**: Start each Phase 1 operation from its owning UI and confirm the start returns quickly, includes “View run”, and the run progresses through queued/running into a terminal outcome.
**Acceptance Scenarios**:
1. **Given** I have permission to start a Phase 1 operation in tenant A, **When** I start it, **Then** I receive immediate confirmation with a “View run” link and the run is visible as queued or running.
2. **Given** I am a `Readonly` user in tenant A, **When** I attempt to start any Phase 1 operation, **Then** the system denies the request and does not create a new run.
3. **Given** the run reaches a terminal outcome, **When** that occurs, **Then** the initiating user receives an in-app notification including a short summary and a “View run” link.
4. **Given** background processing is unavailable, **When** I attempt to start an operation, **Then** I receive a clear message and the system MUST NOT claim it was queued.
- If an `OperationRun` record was created during the attempt, it MUST be completed immediately with outcome `failed` (never left `queued`) and MAY be linked via “View run”.
---
### User Story 3 - Duplicate Starts Reuse the Same Active Run (Priority: P3)
As an operator, I want accidental double-starts (double clicks, two admins, retries) to reuse the same active run so duplicate background work is avoided and results remain auditable.
**Why this priority**: Reduces load, prevents confusing duplicate outcomes, and makes operations safer under concurrency.
**Independent Test**: Start the same operation twice with identical effective inputs while the first is queued/running and verify the system reuses the active run.
**Acceptance Scenarios**:
1. **Given** an identical run is queued/running for a tenant, **When** another start request is made with the same effective inputs, **Then** the system reuses the existing run and does not start a second one.
2. **Given** two starts happen at nearly the same time, **When** the system resolves the race, **Then** at most one active run exists for that identity and both users are directed to it.
### Edge Cases
- Background execution unavailable: start fails fast with a clear message; if an `OperationRun` record was created, it MUST be immediately completed as `failed` (e.g., `queue.dispatch_failed`) and MUST NOT be left `queued`.
- Partial processing: at least one success and at least one failure yields “partially succeeded”, with per-item failures when applicable.
- Large run history: Monitoring remains usable with filters and defaults (recent runs, last 30 days).
- Permissions revoked mid-run: the run continues; visibility is evaluated at time of access.
## Requirements *(mandatory)*
**Constitution alignment (required):** If this feature introduces any external tenant API calls or any write/change behavior,
the spec MUST describe contract registry updates, safety gates (preview/confirmation/audit), tenant isolation, and tests.
### Scope & Assumptions
**Phase 1 adoption set (must be implemented):**
- `inventory.sync` (Inventory “Sync now”)
- `policy.sync` (Policies “Sync now”)
- `directory_groups.sync` (Directory → Groups “Sync groups”)
- `drift.generate` (Drift “Generate drift now” / auto-on-open when eligible)
- `backup_set.add_policies` (Backup Sets “Add selected” / “Add policies”)
- `backup_schedule.run_now` (Backup Schedules “Run now”)
- `backup_schedule.retry` (Backup Schedules “Retry”)
**Restore visibility (adapter only):**
- `restore.execute` appears as a canonical run entry that links to an existing restore domain record.
- The adapter row MUST be created/visible only once a restore run reaches `previewed` (or later) and MUST NOT be created for `draft`, `scoped`, or `checked`.
- When the restore run is `previewed`, the adapter `OperationRun` MUST use `status=queued` and `outcome=pending`.
- Restore execution history remains owned by the restore domain record (not replaced in Phase 1).
**Out of scope for 054 (explicit):**
- Cross-tenant compare/promotion
- UI redesign/styling polish (separate UI polish work)
- Cancel/rerun/delete controls inside Monitoring hub (hub stays view-only)
- Replacing restore domain records with canonical runs
- A full settings UI for retention/notifications/etc.
- Implementing or validating `AuditLog` behavior for audit-only actions (FR-019) beyond actions explicitly changed by 054
**Assumptions (defaults to remove ambiguity in Phase 1):**
- Canonical run history retention defaults to 90 days, with no user-facing retention configuration in 054.
- System-initiated runs (if any) do not notify users by default in Phase 1.
- Transition strategy: write canonical runs in parallel with any existing legacy per-module run tables (where they exist); Monitoring uses canonical runs as the source of truth immediately.
**Run vs Audit-only Adoption Matrix (Phase 1):**
| Feature Area | Action | Tracking | run_type / audit action |
|-------------|--------|----------|--------------------------|
| Policies | Sync now | OperationRun | `policy.sync` |
| Policies | Ignore policy | Audit-only | `policy.ignore` |
| Policies | Export to backup | OperationRun (queued) | `policy.export_backup` |
| Policy Versions | Capture snapshot | OperationRun | `policy.capture_snapshot` |
| Policy Versions | Prune versions | Audit-only | `policy_versions.prune` |
| Policy Versions | Archive versions | Audit-only | `policy_versions.archive` |
| Inventory | Sync now | OperationRun | `inventory.sync` |
| Directory Groups | Sync groups | OperationRun | `directory_groups.sync` |
| Drift | Generate drift | OperationRun | `drift.generate` |
| Backup Sets | Add policies | OperationRun | `backup_set.add_policies` |
| Backup Sets | Archive | Audit-only (DB-only) | `backup_set.archive` |
| Backup Sets | Restore (bulk) | OperationRun | `backup_set.restore` |
| Backup Sets | Force delete | Audit-only (admin-only) | `backup_set.force_delete` |
| Backup Schedules | Run now | OperationRun | `backup_schedule.run_now` |
| Backup Schedules | Retry | OperationRun | `backup_schedule.retry` |
| Backup Schedules | Edit | Audit-only | `backup_schedule.edit` |
| Backup Schedules | Delete | Audit-only | `backup_schedule.delete` |
| Tenants | Sync tenant | OperationRun | `tenant.sync` |
| Tenants | Admin consent | Audit-only | `tenant.admin_consent` |
| Tenants | Verify configuration | Audit-only | `tenant.verify_config` |
| Tenants | Setup Intune RBAC | Audit-only | `tenant.setup_rbac` |
| Tenants | Deactivate | Audit-only | `tenant.deactivate` |
| Restore | Execute restore | OperationRun (adapter) | `restore.execute` (context → `restore_run_id`) |
**Rule**: If an action is queued/background, long-running, or requires remote/external calls (e.g., Microsoft Graph),
it MUST be tracked as an OperationRun. Only fast DB-only changes MAY be Audit-only.
### Functional Requirements
- **FR-001 Canonical Operation Run**: System MUST represent each supported operation execution as a canonical, tenant-scoped operation run record that captures initiator (nullable `user_id` FK + `initiator_name` string), run type, lifecycle status/timestamps, terminal outcome (pending while active), summary counts (when applicable), safe failure summaries, an idempotency identity for dedupe, and a safe context payload referencing “what this run was about”.
- **Status semantics**: `status` represents lifecycle stage (`queued` → `running``completed`).
- **Outcome semantics (stored tokens)**: `outcome` stores machine tokens: `pending` while active, otherwise `succeeded` / `partially_succeeded` / `failed`.
- **UI labels**: Monitoring displays human labels derived from stored tokens (e.g., `partially_succeeded` → “Partially succeeded”).
- **Reserved**: `cancelled` is reserved for future use and MUST NOT be produced by 054 (Monitoring hub has no cancel controls).
- **Context safety**: `context` MUST be sanitized and MUST include only safe references (e.g., stable IDs, selection scope keys, correlation IDs). It MUST NOT include secrets/tokens/credentials, personal data, or full external payload dumps.
- **FR-002 Run taxonomy**: Run type MUST be stable and follow `"<resource>.<action>"`.
- **FR-003 Phase 1 run types**: Phase 1 run types MUST include `inventory.sync`, `policy.sync`, `directory_groups.sync`, `drift.generate`, `backup_set.add_policies`, `backup_schedule.run_now`, `backup_schedule.retry`, plus `restore.execute` implemented as a physical `operation_runs` record (adapter) pointing to the domain entity.
- **FR-004 Monitoring lists all canonical runs**: Monitoring → Operations MUST list canonical runs for the active tenant with filters for run type, run state (queued/running/terminal outcome), time range, and initiator; default sort is most recent first; default time window is last 30 days.
- **FR-005 Run detail**: Run detail MUST show initiator, run type, run state (queued/running/terminal outcome), timestamps (created/started/finished), summary counts (when applicable), sanitized failures (including per-item failures when applicable), and contextual links to owning feature surfaces/results.
- **FR-006 View-only hub**: Monitoring hub MUST be view-only (no start/rerun/cancel/delete controls) and MUST link back to owning feature surfaces.
- **FR-007 Start surfaces always enqueue**: Every Phase 1 start surface MUST authorize start, create/reuse a canonical run (dedupe), dispatch background execution, and return immediately with confirmation + “View run”.
- **FR-008 No remote work in interactive request**: Start surfaces MUST NOT perform remote work inline; long-running work happens in background execution.
- **FR-009 Deterministic idempotency**: For each run type, the system MUST define a deterministic identity for “identical run” based on tenant + effective inputs; initiator MUST NOT be part of identity. **Enforcement**: Uniqueness MUST be enforced via a partial unique index on `(tenant_id, run_identity_hash)` where status is `queued` or `running`.
- **FR-010 Phase 1 identity rules**: Identity rules MUST be defined at least as follows:
- `inventory.sync`: tenant + selection scope
- `policy.sync`: tenant + effective policy scope
- `directory_groups.sync`: tenant + selection (Phase 1 default: “all groups”)
- `backup_set.add_policies`: tenant + backup set + selected policies + option flags (if exposed)
- `backup_schedule.run_now`: tenant + backup schedule id
- `backup_schedule.retry`: tenant + backup schedule id
- `drift.generate`: tenant + scope key + baseline/current comparison inputs
- **FR-011 Run state presentation**: Monitoring MUST present a consistent run state using a single display bucket derived from lifecycle status and terminal outcome:
- If status is `queued` or `running`, display that status.
- If status is `completed`, display the terminal outcome derived from the stored token (`succeeded`, `partially_succeeded`, or `failed`) using the UI label mapping.
- **FR-012 Partial vs failed (terminal outcomes)**: “Partially succeeded” (`partially_succeeded`) means at least one success and at least one failure; “Failed” (`failed`) means zero successes or cannot proceed.
- **FR-013 Failure details are safe + useful**: Failures MUST be persisted and displayed as stable reason codes and short sanitized messages; failures MUST NOT include secrets/tokens/credentials/PII or full external payload dumps.
- **Reason codes** MUST be stable, machine-readable identifiers (lowercase, dot-separated), e.g. `graph.throttled`, `auth.forbidden`, `validation.invalid_input`, `unexpected.exception`.
- **Messages** MUST be short (≤ 200 characters), sanitized, and written for operators (no secrets/tokens/credentials/PII; no raw external payloads). If needed, messages MAY include a non-sensitive correlation identifier.
- **FR-014 Related links**: Run detail MUST include contextual links where applicable (e.g., drift findings, backup set, inventory results, directory groups, restore detail for `restore.execute`).
- **FR-015 Notifications**: System MUST emit in-app notifications for “queued” (after start) and terminal outcomes for Phase 1 runs; notifications MUST include a short summary and a “View run” link; recipients are the initiating user only.
- If a start request reuses an existing active run (dedupe), the run initiator (as stored on the `OperationRun`) remains the sole notification recipient; the second starter receives no additional notifications.
- **FR-016 Tenant isolation**: All run list/detail access MUST be tenant-scoped; cross-tenant access MUST be denied without disclosing run details.
- **FR-017 No render-time remote calls**: Monitoring pages MUST be render-safe and MUST NOT depend on external service calls during render.
- **FR-018 Roles & permissions**: Roles `Owner`, `Manager`, `Operator`, and `Readonly` MUST be able to view runs; only `Owner`, `Manager`, `Operator` may start operations; `Readonly` is strictly view-only.
- **FR-019 Audit-only actions (no OperationRun)**: Actions that are DB-only and complete within ≤2 seconds under normal
conditions MAY be executed without an OperationRun, as long as they do not start long-running background execution and
do not require any remote/external calls.
- **054 scope note**: 054 does not implement or modify audit-only actions. If any audit-only action is touched as part
of implementing 054 in the future, it MUST comply with this requirement and MUST be covered by tests.
If such an action is security-relevant or changes operational behavior (e.g., “Ignore policy”, “Deactivate tenant”,
“Admin consent”, “Prune versions”, “Force delete”), it MUST write exactly one tenant-scoped AuditLog entry with, at minimum:
- `tenant_id`
- `actor_user_id`
- `action` (stable action identifier, e.g., `policy.ignore`)
- `target_type`, `target_id`
- `before` / `after` (sanitized JSON) **or** `diff` (sanitized JSON)
- `created_at`
**Trigger guidance (to make classification reviewable)**:
- “Security-relevant” includes actions that grant/revoke access, change authorization posture, change admin consent, or otherwise modify who/what can read/write tenant data.
- “Operational behavior change” includes actions that change what the system will do in future runs (e.g., ignore/exclude resources, enable/disable schedules, retention/prune/archive actions, force deletes).
- If unclear whether an Audit-only action is security/ops-relevant, the default is to treat it as such and write an AuditLog entry.
**Sanitization (AuditLog before/after/diff)**:
- AuditLog payloads MUST include only the minimum fields needed to understand the change.
- AuditLog payloads MUST NOT include secrets/tokens/credentials, personal data, or full external payload dumps.
- If a field is sensitive, it MUST be omitted or replaced with a non-sensitive placeholder (e.g., `"[REDACTED]"`).
Monitoring/Operations remains reserved for OperationRun-tracked long-running/queued operations.
**Acceptance checks (testable)**:
- Audit-only action creates no OperationRun.
- Audit-only action creates exactly one AuditLog event containing the required fields.
- Audit-only action is tenant-scoped; cross-tenant access is forbidden and MUST NOT create AuditLog entries.
### Key Entities *(include if feature involves data)*
- **Canonical Operation Run**: A tenant-scoped record representing the lifecycle of a long-running operation, including run type, initiator (nullable `user_id` FK + `initiator_name` string), lifecycle state/timestamps, terminal outcome, summary counts, safe failure summaries, idempotency identity (uniqueness enforced by DB index on active runs), and safe context references.
- **Restore domain record (exception)**: Restore remains a domain workflow record with richer semantics and history. Monitoring shows restore activity through a physical `operation_runs` row (adapter) that links back to the restore record, without replacing it.
## Success Criteria *(mandatory)*
### Measurable Outcomes
- **SC-001**: Operators can answer “what ran, when, and did it succeed?” for any Phase 1 run in under 1 minute using Monitoring → Operations.
- **SC-002**: Starting a Phase 1 operation returns confirmation + “View run” link within 2 seconds under normal conditions.
- **SC-003**: Duplicate starts reuse the same active run in at least 99% of attempts under normal conditions.
- **SC-003 Measurement scope (definition)**: An “attempt” counts when a start request is made for an operation with identical effective inputs while an identical run is already `queued` or `running`. The success condition is that the system reuses the existing active run reference rather than creating a second active run. “Normal conditions” exclude infrastructure outages (e.g., database unavailable) that prevent either run creation or dedupe evaluation.
- **SC-004**: No secrets/tokens/credentials/PII appear in persisted failures or notifications (verified by tests).