spec(052): async add policies

This commit is contained in:
Ahmed Darrazi 2026-01-15 21:08:14 +01:00
parent a449ecec5b
commit 19361579f2
4 changed files with 440 additions and 0 deletions

View File

@ -0,0 +1,32 @@
# Specification Quality Checklist: Async “Add Policies” to Backup Set (052)
**Purpose**: Validate specification completeness and quality before proceeding to implementation
**Created**: 2026-01-14
**Feature**: [specs/052-async-add-policies/spec.md](../spec.md)
## Content Quality
- [x] No implementation details leaked into user requirements (spec focuses on behaviors, run observability, and safety rules)
- [x] Focused on user value (fast submit, dedupe, observable progress, safe failures)
- [x] Written in operator language (Backup Sets, Add Policies, runs, failures)
- [x] Mandatory sections present (User Scenarios & Testing, Requirements, Success Criteria)
## Requirement Completeness
- [x] Scope is bounded (explicit in-scope/out-of-scope list)
- [x] Requirements are testable and unambiguous (FR-001..FR-005 + NFRs + acceptance scenarios)
- [x] Idempotency and dedupe behavior is explicitly specified
- [x] Tenant isolation and authorization are explicitly specified
- [x] Data minimization and safe logging rules are explicitly specified
- [x] “No UI-time Graph calls” is explicitly specified and test-gated
## Feature Readiness
- [x] Plan includes concrete touch points and test strategy (plan.md)
- [x] Tasks are grouped by user story and prioritize tests-first (tasks.md)
## Notes
- Items marked incomplete require spec updates before implementation.
- Constitution gate: this checklist must exist for features that change runtime behavior.

View File

@ -0,0 +1,108 @@
# Implementation Plan: Async “Add Policies” to Backup Set (052)
**Branch**: `feat/052-async-add-policies` | **Date**: 2026-01-14 | **Spec**: [specs/052-async-add-policies/spec.md](spec.md)
**Input**: Feature specification from `/specs/052-async-add-policies/spec.md`
## Summary
Convert Backup Set → Add Policies (“Add selected”) to job-only execution by creating/reusing a `BulkOperationRun`, dispatching a queued job to do Graph/snapshot work, and returning immediately with observable feedback and a “View run” link.
## Technical Context
**Language/Version**: PHP 8.4 (Laravel 12)
**Admin UI**: Filament v4 + Livewire v3
**Storage**: PostgreSQL
**Queue**: Laravel queues (Sail-first locally)
**Testing**: Pest v4
**Constraints**: No Graph calls during interactive request/response; tenant isolation; idempotent dedupe; safe error persistence.
## Constitution Check
- Inventory-first: this feature updates backup storage (explicit snapshot capture), never “inventory on render”.
- Read/write separation: interactive action is “write intent only” (enqueue); job performs DB writes with auditability.
- Graph contract path: any Graph reads happen only inside the job via `GraphClientInterface` (indirectly via existing services).
- Deterministic capabilities: selection hashing and idempotency key are deterministic.
- Tenant isolation: run records and backup set mutations are scoped to the active tenant.
- Automation: job has run record lifecycle + counts + failures; dedupe prevents duplicates.
- Data minimization: failures and notifications store sanitized messages only.
## Project Structure
### Documentation (this feature)
```text
specs/052-async-add-policies/
├── plan.md
├── spec.md
├── tasks.md
└── checklists/
└── requirements.md
```
### Source Code (planned touch points)
```text
app/Livewire/BackupSetPolicyPickerTable.php # “Add selected” handler becomes enqueue-only
app/Jobs/AddPoliciesToBackupSetJob.php # new queued job (name may vary)
app/Models/BulkOperationRun.php # reused run record
app/Services/BulkOperationService.php # reused counters + failure sanitization
app/Support/RunIdempotency.php # reused deterministic key builder + active-run lookup
app/Filament/Resources/BulkOperationRunResource.php # existing run UI used for “View run”
tests/Feature/Filament/BackupSetPolicyPickerTableTest.php # updated for async behavior
tests/Feature/* # new/updated tests for idempotency + tenant isolation
```
## Execution Model
### Action Handler (UI request)
Responsibilities (must remain fast):
1. Resolve tenant + user context and authorize “add policies to backup set”.
2. Convert selected policy records into an ID list (no Graph/snapshot work).
3. Compute a deterministic idempotency key using:
- tenant id
- target backup_set_id
- operation type (e.g., `backup_set.add_policies`)
- context payload containing (sorted) policy IDs + option flags (include assignments/scope tags/foundations)
4. Reuse an active run if one exists; otherwise create a new `BulkOperationRun` with `resource=backup_set`, `action=add_policies`, `item_ids` = selected policy IDs, `total_items` = count, and `idempotency_key` set.
5. Dispatch a queued job (always async; no sync shortcut) with the run id + backup_set_id + option flags.
6. Return immediately with a Filament notification and a “View run” link (DB notification preferred).
### Queued Job
Responsibilities (all heavy work):
1. Load the `BulkOperationRun` and ensure it is still active (`pending` → `running`).
2. Validate the backup set exists and belongs to the run tenant.
3. Process the runs stored policy IDs deterministically (sequentially or chunked):
- Add/capture each policy into the backup set using existing capture services.
- Update counts and record per-item failures with safe reasons.
- Continue through all items (no circuit-breaker abort).
4. Complete the run with:
- `completed` if no failures
- `completed_with_errors` (partial) if some failures
- `failed` if the job cannot proceed (e.g., backup set missing)
5. Emit a DB notification to the initiating user summarizing the outcome and linking to the run.
### Idempotency & Concurrency
- Primary dedupe mechanism: `RunIdempotency::findActiveBulkOperationRun(tenant_id, idempotency_key)`.
- Secondary guard: existing partial unique index on `(tenant_id, idempotency_key)` for active statuses.
- Race handling: if concurrent submissions collide, prefer “find existing run and redirect” over throwing.
- Dedupe scope: tenant-wide (idempotency key does not include `user_id`).
## Testing Strategy (Pest)
Minimum required tests for 052:
- Action dispatch test: `Add selected` queues the job and creates/reuses a run.
- Fail-hard Graph guard test: binding `GraphClientInterface` (or a capture service) to a mock that must not be called during the action.
- Idempotency test: calling the action twice with the same selection queues only one job and creates only one active run.
- Tenant isolation test: run view under another tenant context is forbidden (403).
Target commands:
- `./vendor/bin/sail artisan test --filter=BackupSetAddPolicies`
- `./vendor/bin/pint --dirty`
## Rollout Notes
- Requires queue workers running for background processing (Sail locally; Dokploy workers in staging/prod).
- No destructive migrations expected; if schema is extended for better observability, it must remain backwards compatible.

View File

@ -0,0 +1,196 @@
# Feature Specification: Async “Add Policies” to Backup Set (052)
**Feature Branch**: `feat/052-async-add-policies`
**Created**: 2026-01-14
**Status**: Draft (implementation-ready)
**Input**: Make Backup Sets → “Add Policies” (Add selected) non-blocking by moving all Graph/snapshot work into a queued job. The UI action only creates/reuses a Run record and dispatches the job.
## Purpose
Make “Add selected” in the Backup Set → Add Policies flow reliable and fast by ensuring heavy work (Graph reads, snapshot capture, per-policy loops) never runs inline in an interactive request.
Primary outcomes:
- Prevent request timeouts and long waits.
- Prevent duplicate work from double-clicks/retries.
- Provide observable progress and safe failure visibility via an existing Run record type.
## Clarifications
### Session 2026-01-15
- Q: Who may start “Add selected”? → A: Roles `Owner`, `Manager`, `Operator` may start; `Readonly` may not.
- Q: Is idempotency/dedupe per-user or tenant-wide? → A: Tenant-wide (dedupe key does not include `user_id`).
- Q: Should execution ever run synchronously for small selections? → A: No; always async (never in-request).
- Q: What happens for empty selection? → A: No run/job; show “No policies selected” (current behavior).
- Q: How are run counts defined? → A: `total_items` equals the number of selected policies at submit time; “already in backup set” counts as `skipped`.
- Q: Do we use a circuit breaker (e.g., abort if >50% fails)? → A: No; process all items and surface partial results.
- Q: How should failures be persisted? → A: Stable `reason_code` + sanitized short text; never store secrets/tokens/raw payload dumps.
## Pinned Decisions (052 defaults)
- **Authorization**: start allowed for `Owner`, `Manager`, `Operator`; forbidden for `Readonly`.
- **Dedupe scope**: tenant-wide idempotency (no `user_id` in dedupe key).
- **Execution**: always async; action never performs Graph/snapshot work inline.
- **Empty selection**: no run/job; show “No policies selected”.
- **Counts**: `total_items` = selected policies; “already in backup set” = `skipped`.
- **Circuit breaker**: none; job processes all items.
- **Failure format**: `reason_code` + sanitized short text (no secrets).
## In Scope (052)
- Convert the “Add selected” action (Backup Sets → Add Policies modal) to job-only execution.
- Ensure an observable Run exists (status, counts, errors) using an existing run record type (prefer `BulkOperationRun`).
- Ensure idempotency: repeated clicks while queued/running reuse the same run for the same tenant + backup set + selection.
- Emit DB notification + optional existing progress widget integration (if already supported by the run type); no new UI framework work required.
- Add guard tests ensuring no Graph calls occur in the action handler/request.
## Out of Scope (052)
- UI redesign/polish (layout, navigation reorg, fancy tables, etc.)
- New Monitoring area or unified operations dashboard
- Group browsing/typeahead improvements (directory cache / later)
- New run tables if an existing run record can be reused
- Changing what a “policy version” means or how snapshots are stored
## User Scenarios & Testing *(mandatory)*
### User Story 1 - Add selected policies without blocking (Priority: P1)
As a tenant admin, I can add selected policies to a backup set without the UI hanging or timing out, because the system queues background work and gives me a Run record to monitor.
**Why this priority**: This is a common workflow and currently risks long requests/timeouts when Graph capture is slow or throttled.
**Independent Test**: Trigger “Add selected” and assert the request returns quickly, a Run exists, and a job is queued (no Graph work happens inline).
**Acceptance Scenarios**:
1. **Given** I am on a Backup Set and select policies in “Add Policies”, **When** I click “Add selected”, **Then** the UI returns quickly and a queued Run is created (or reused) with a link to its detail view.
2. **Given** the job runs, **When** it completes, **Then** the backup set contains the added policies and the Run shows final status + counts.
---
### User Story 2 - Double click / repeated submissions are deduplicated (Priority: P2)
As a tenant admin, I can click “Add selected” repeatedly (or double click) without creating duplicate work, because identical operations reuse the same active Run.
**Why this priority**: Prevents accidental duplication, inconsistent outcomes, and unnecessary Graph load.
**Independent Test**: Call the action twice with the same selection while the first run is active and assert only one Run and one queued job.
**Acceptance Scenarios**:
1. **Given** a matching Run for the same tenant + backup set + selection is queued or running, **When** I click “Add selected” again, **Then** the system reuses the existing Run and does not enqueue duplicate work.
---
### User Story 3 - Failures are visible and safe (Priority: P3)
As a tenant admin, I can see safe failure summaries when some policies cannot be captured from Graph, without secrets or raw payloads being stored or displayed.
**Why this priority**: Operators must be able to triage issues safely; secret leakage is unacceptable.
**Independent Test**: Force a simulated Graph failure in the job and assert the Run ends as failed/partial with a sanitized reason code/summary (no secrets).
**Acceptance Scenarios**:
1. **Given** Graph returns an error for some policies, **When** the job completes, **Then** the Run is marked partial/completed-with-errors and includes per-item failures with safe reason codes/summaries.
2. **Given** a failure includes sensitive substrings (e.g., “Bearer ”), **When** it is persisted, **Then** stored reasons are redacted/sanitized.
---
### Edge Cases
- Empty selection (no policies selected)
- Backup set deleted or tenant context missing between queue and execution
- Some selected policies are already in the backup set
- Policies deleted/ignored locally between queue and execution
- Graph throttling/transient failures (429/503) during capture
## Requirements *(mandatory)*
**Constitution alignment (required):** This feature performs Graph reads and writes local DB state, so it MUST be idempotent & observable, tenant-scoped, and safe-loggable. Graph calls MUST go through `GraphClientInterface` and MUST NOT occur during UI render or action request handling.
### Functional Requirements
- **FR-001 (Job-only action handler)**: The “Add selected” handler MUST:
- validate input + authorization
- create/reuse a Run record
- dispatch a queued job
- return immediately with a notification and a link to the Run
It MUST NOT:
- call `GraphClientInterface`
- call snapshot capture services
- loop over selected policies to do work inline
- run a synchronous “small selection” shortcut
- **FR-002 (Run record + observability)**: The system MUST persist for each run:
- `tenant_id`, `initiator_user_id` (or equivalent)
- `resource = backup_set`, `action = add_policies` (or existing taxonomy)
- status lifecycle: queued → running → succeeded|failed|partial (map to existing run status semantics where possible, e.g., queued=`pending`, succeeded=`completed`, partial=`completed_with_errors`)
- counts: `total_items` = selected policies at submit time; `processed_items` = succeeded+failed+skipped; “already in backup set” increments `skipped`
- safe error context (summary + per-item outcomes at least for failures)
- **FR-003 (Idempotency / dedupe)**: While a matching run is queued or running, the action MUST reuse it and MUST NOT enqueue duplicate work.
Recommended dedupe key: `tenant_id + backup_set_id + operation_type(add_policies) + selection_hash`.
- **FR-004 (Job execution semantics)**: The queued job MUST:
- load the selection deterministically (IDs or a stored selection payload)
- process items sequentially or in safe batches
- update run progress counts as it goes
- process all items (no circuit-breaker abort)
- record per-item outcomes (at minimum: failure entries with stable `reason_code` + sanitized short text)
**Reason codes (minimum set):**
- `already_in_backup_set` (skipped)
- `policy_not_found` (failed)
- `policy_ignored` (skipped)
- `backup_set_not_found` (failed)
- `backup_set_archived` (failed)
- `graph_forbidden` (failed)
- `graph_throttled` (failed)
- `graph_transient` (failed)
- `unknown` (failed)
- **FR-005 (User-visible feedback)**: On action submit, the UI MUST:
- show “Queued” feedback
- provide a “View run” link (DB notification preferred)
### Non-Functional Requirements
- **NFR-001 (Tenant isolation and authorization)**:
- Run list/view MUST be tenant-scoped
- Cross-tenant run access MUST be denied (403)
- Only authorized roles can start “Add Policies”
- **NFR-002 (Data minimization & safe logging)**:
- No access tokens, “Bearer ” strings, or raw Graph payload dumps in notifications, run failures, or logs
- Persist only sanitized/allowlisted error fields and stable identifiers
- **NFR-003 (Determinism)**: Given the same input selection, results are reproducible and the dedupe key is stable.
- **NFR-004 (No UI-time Graph calls)**: Rendering the Backup Set UI and Run detail pages MUST not require Graph calls.
### Key Entities *(include if feature involves data)*
- **BulkOperationRun**: Existing run record used for long-running operations (status + counts + failures).
- **BackupSet**: Target set to which policies are added.
- **BackupItem**: Persisted item rows representing a policy in a backup set.
## Success Criteria *(mandatory)*
### Measurable Outcomes
- **SC-001 (Fast submit)**: Clicking “Add selected” returns without timing out and does not perform Graph work in-request (proved by guard tests).
- **SC-002 (Observable run)**: A Run is created/reused and is visible in the UI with status and progress counts.
- **SC-003 (No duplicates)**: Double-click does not create duplicate runs/jobs (proved by idempotency tests).
- **SC-004 (Safe failures)**: Failures show safe reason codes/summaries and do not store secrets/tokens (proved by sanitization tests).
## Rollout / Migration
- No destructive migrations required for 052.
- If additional idempotency indexing is needed beyond existing run infrastructure, add a minimal migration (backwards compatible).
## Open Questions (Optional)
- Should progress appear in an existing global progress widget (only if already supported; new widget work is out of scope)?

View File

@ -0,0 +1,104 @@
---
description: "Task list for implementing Async “Add Policies” to Backup Set (052)"
---
# Tasks: Async “Add Policies” to Backup Set (052)
**Input**: Design documents from `/specs/052-async-add-policies/`
**Prerequisites**: plan.md (required), spec.md (required)
**Tests**: Required (Pest), per spec.md (SC-001..SC-004).
**Organization**: Tasks are grouped by user story so each story can be implemented and tested independently.
## Format: `- [ ] T### [P?] [US#] Description (with file path)`
- **[P]**: Can run in parallel (different files, no dependencies)
- **[US#]**: User story mapping (US1/US2/US3)
## Path Conventions (Laravel)
- App code: `app/`
- DB: `database/migrations/`
- Filament admin: `app/Filament/Resources/`
- Livewire tables: `app/Livewire/`
- Tests (Pest): `tests/Feature/`, `tests/Unit/`
---
## Phase 1: Setup (Shared Infrastructure)
- [ ] T001 [P] Confirm existing run infra supports this operation (BulkOperationRun statuses + unique idempotency index) and document mapping queued↔pending, partial↔completed_with_errors in specs/052-async-add-policies/plan.md
- [ ] T002 [P] Decide and standardize taxonomy strings: operationType (`backup_set.add_policies`), resource (`backup_set`), action (`add_policies`) in specs/052-async-add-policies/spec.md and plan.md
---
## Phase 2: User Story 1 - Add selected policies without blocking (Priority: P1) 🎯 MVP
**Goal**: “Add selected” returns quickly and queues background work with an observable Run record.
**Independent Test**: Trigger the action and assert a job is queued and a run exists; no Graph/capture work occurs in-request.
### Tests (write first) ⚠️
- [ ] T010 [P] [US1] Update/replace sync-path assertions in tests/Feature/Filament/BackupSetPolicyPickerTableTest.php to assert job dispatch + run creation
- [ ] T011 [P] [US1] Add fail-hard guard test to ensure no Graph calls occur during the bulk action (mock `App\\Services\\Graph\\GraphClientInterface` and assert `->never()`)
### Implementation
- [ ] T020 [US1] Create queued job to add policies to a backup set in app/Jobs/AddPoliciesToBackupSetJob.php (uses run lifecycle + safe failures via BulkOperationService)
- [ ] T021 [US1] Update bulk action handler in app/Livewire/BackupSetPolicyPickerTable.php to create/reuse BulkOperationRun and dispatch AddPoliciesToBackupSetJob (no inline snapshot capture)
- [ ] T022 [US1] Emit “queued” notification with “View run” link on submit (Filament DB notification preferred) from app/Livewire/BackupSetPolicyPickerTable.php
- [ ] T023 [US1] Emit completion/failure DB notification from app/Jobs/AddPoliciesToBackupSetJob.php with a safe summary
**Checkpoint**: US1 complete — submit is non-blocking, job executes, run is visible.
---
## Phase 3: User Story 2 - Double click / repeated submissions are deduplicated (Priority: P2)
**Goal**: Matching queued/running operations reuse the same run and do not enqueue duplicates.
**Independent Test**: Call the action twice for the same tenant + backup set + selection and assert only one run and one job dispatch.
### Tests ⚠️
- [ ] T030 [P] [US2] Add idempotency test in tests/Feature/BackupSets/BackupSetAddPoliciesIdempotencyTest.php (or extend existing picker test) asserting one run + one queued job
### Implementation
- [ ] T031 [US2] Implement deterministic selection hashing and idempotency key creation using app/Support/RunIdempotency.php (sorted policy ids + option flags), and reuse active runs via findActiveBulkOperationRun
- [ ] T032 [US2] Handle race conditions safely (unique index collisions) by recovering the existing run rather than failing the request
- [ ] T033 [US2] Ensure the action is always async (no `dispatchSync` path for small selections) in app/Livewire/BackupSetPolicyPickerTable.php
**Checkpoint**: US2 complete — double clicks are safe and deduped.
---
## Phase 4: User Story 3 - Failures are visible and safe (Priority: P3)
**Goal**: Failures are persisted safely and tenant isolation is enforced for run visibility.
**Independent Test**: Force failure paths and confirm safe failures persisted; cross-tenant access is forbidden.
### Tests ⚠️
- [ ] T040 [P] [US3] Add tenant isolation test for the created run (403 cross-tenant) in tests/Feature/BackupSets/BackupSetAddPoliciesTenantIsolationTest.php (or extend tests/Feature/RunAuthorizationTenantIsolationTest.php)
- [ ] T041 [P] [US3] Add sanitization test: failure reason containing token-like content is stored as redacted (exercise BulkOperationService::sanitizeFailureReason)
### Implementation
- [ ] T042 [US3] Ensure job records per-item failures with sanitized reasons and does not store raw Graph payloads in run failures or notifications (app/Jobs/AddPoliciesToBackupSetJob.php)
- [ ] T043 [US3] Record stable failure `reason_code` values (per spec.md) alongside sanitized short text in run failures (app/Jobs/AddPoliciesToBackupSetJob.php and/or app/Services/BulkOperationService.php)
- [ ] T044 [US3] Record “already in backup set” as `skipped` (with reason_code `already_in_backup_set`) and ensure counts match spec.md (app/Jobs/AddPoliciesToBackupSetJob.php)
- [ ] T045 [US3] Ensure job processes all items (no circuit breaker abort) and run status reflects partial completion (app/Jobs/AddPoliciesToBackupSetJob.php)
**Checkpoint**: US3 complete — failures are safe and observable; tenant isolation holds.
---
## Phase 5: Polish & Validation
- [ ] T050 [P] Run formatting on changed files with ./vendor/bin/pint --dirty
- [ ] T051 Run targeted tests: ./vendor/bin/sail artisan test --filter=BackupSetAddPolicies