TenantAtlas/specs/052-async-add-policies/spec.md
ahmido c60d16ffba feat/052-async-add-policies (#59)
Status Update

Committed the async “Add selected” flow: job-only handler, deterministic run reuse, sanitized failure tracking, observation updates, and the new BulkOperationService/Progress test coverage.
All relevant tasks in tasks.md are marked done, and the checklist under requirements.md is fully satisfied (PASS).
Ran ./vendor/bin/pint --dirty plus BackupSetPolicyPickerTableTest.php—all green.

Co-authored-by: Ahmed Darrazi <ahmeddarrazi@adsmac.local>
Reviewed-on: #59
2026-01-15 22:20:16 +00:00

10 KiB

Feature Specification: Async “Add Policies” to Backup Set (052)

Feature Branch: feat/052-async-add-policies
Created: 2026-01-14
Status: Draft (implementation-ready)
Input: Make Backup Sets → “Add Policies” (Add selected) non-blocking by moving all Graph/snapshot work into a queued job. The UI action only creates/reuses a Run record and dispatches the job.

Purpose

Make “Add selected” in the Backup Set → Add Policies flow reliable and fast by ensuring heavy work (Graph reads, snapshot capture, per-policy loops) never runs inline in an interactive request.

Primary outcomes:

  • Prevent request timeouts and long waits.
  • Prevent duplicate work from double-clicks/retries.
  • Provide observable progress and safe failure visibility via an existing Run record type.

Clarifications

Session 2026-01-15

  • Q: Who may start “Add selected”? → A: Roles Owner, Manager, Operator may start; Readonly may not.
  • Q: Is idempotency/dedupe per-user or tenant-wide? → A: Tenant-wide (dedupe key does not include user_id).
  • Q: Should execution ever run synchronously for small selections? → A: No; always async (never in-request).
  • Q: What happens for empty selection? → A: No run/job; show “No policies selected” (current behavior).
  • Q: How are run counts defined? → A: total_items equals the number of selected policies at submit time; “already in backup set” counts as skipped.
  • Q: Do we use a circuit breaker (e.g., abort if >50% fails)? → A: No; process all items and surface partial results.
  • Q: How should failures be persisted? → A: Stable reason_code + sanitized short text; never store secrets/tokens/raw payload dumps.

Pinned Decisions (052 defaults)

  • Authorization: start allowed for Owner, Manager, Operator; forbidden for Readonly.
  • Dedupe scope: tenant-wide idempotency (no user_id in dedupe key).
  • Execution: always async; action never performs Graph/snapshot work inline.
  • Empty selection: no run/job; show “No policies selected”.
  • Counts: total_items = selected policies; “already in backup set” = skipped.
  • Circuit breaker: none; job processes all items.
  • Failure format: reason_code + sanitized short text (no secrets).

In Scope (052)

  • Convert the “Add selected” action (Backup Sets → Add Policies modal) to job-only execution.
  • Ensure an observable Run exists (status, counts, errors) using an existing run record type (prefer BulkOperationRun).
  • Ensure idempotency: repeated clicks while queued/running reuse the same run for the same tenant + backup set + selection.
  • Emit DB notification + optional existing progress widget integration (if already supported by the run type); no new UI framework work required.
  • Add guard tests ensuring no Graph calls occur in the action handler/request.

Out of Scope (052)

  • UI redesign/polish (layout, navigation reorg, fancy tables, etc.)
  • New Monitoring area or unified operations dashboard
  • Group browsing/typeahead improvements (directory cache / later)
  • New run tables if an existing run record can be reused
  • Changing what a “policy version” means or how snapshots are stored

User Scenarios & Testing (mandatory)

User Story 1 - Add selected policies without blocking (Priority: P1)

As a tenant admin, I can add selected policies to a backup set without the UI hanging or timing out, because the system queues background work and gives me a Run record to monitor.

Why this priority: This is a common workflow and currently risks long requests/timeouts when Graph capture is slow or throttled.

Independent Test: Trigger “Add selected” and assert the request returns quickly, a Run exists, and a job is queued (no Graph work happens inline).

Acceptance Scenarios:

  1. Given I am on a Backup Set and select policies in “Add Policies”, When I click “Add selected”, Then the UI returns quickly and a queued Run is created (or reused) with a link to its detail view.
  2. Given the job runs, When it completes, Then the backup set contains the added policies and the Run shows final status + counts.

User Story 2 - Double click / repeated submissions are deduplicated (Priority: P2)

As a tenant admin, I can click “Add selected” repeatedly (or double click) without creating duplicate work, because identical operations reuse the same active Run.

Why this priority: Prevents accidental duplication, inconsistent outcomes, and unnecessary Graph load.

Independent Test: Call the action twice with the same selection while the first run is active and assert only one Run and one queued job.

Acceptance Scenarios:

  1. Given a matching Run for the same tenant + backup set + selection is queued or running, When I click “Add selected” again, Then the system reuses the existing Run and does not enqueue duplicate work.

User Story 3 - Failures are visible and safe (Priority: P3)

As a tenant admin, I can see safe failure summaries when some policies cannot be captured from Graph, without secrets or raw payloads being stored or displayed.

Why this priority: Operators must be able to triage issues safely; secret leakage is unacceptable.

Independent Test: Force a simulated Graph failure in the job and assert the Run ends as failed/partial with a sanitized reason code/summary (no secrets).

Acceptance Scenarios:

  1. Given Graph returns an error for some policies, When the job completes, Then the Run is marked partial/completed-with-errors and includes per-item failures with safe reason codes/summaries.
  2. Given a failure includes sensitive substrings (e.g., “Bearer ”), When it is persisted, Then stored reasons are redacted/sanitized.

Edge Cases

  • Empty selection (no policies selected)
  • Backup set deleted or tenant context missing between queue and execution
  • Some selected policies are already in the backup set
  • Policies deleted/ignored locally between queue and execution
  • Graph throttling/transient failures (429/503) during capture

Requirements (mandatory)

Constitution alignment (required): This feature performs Graph reads and writes local DB state, so it MUST be idempotent & observable, tenant-scoped, and safe-loggable. Graph calls MUST go through GraphClientInterface and MUST NOT occur during UI render or action request handling.

Functional Requirements

  • FR-001 (Job-only action handler): The “Add selected” handler MUST:

    • validate input + authorization
    • create/reuse a Run record
    • dispatch a queued job
    • return immediately with a notification and a link to the Run It MUST NOT:
    • call GraphClientInterface
    • call snapshot capture services
    • loop over selected policies to do work inline
    • run a synchronous “small selection” shortcut
  • FR-002 (Run record + observability): The system MUST persist for each run:

    • tenant_id, initiator_user_id (or equivalent)
    • resource = backup_set, action = add_policies (or existing taxonomy)
    • status lifecycle: queued → running → succeeded|failed|partial (map to existing run status semantics where possible, e.g., queued=pending, succeeded=completed, partial=completed_with_errors)
    • counts: total_items = selected policies at submit time; processed_items = succeeded+failed+skipped; “already in backup set” increments skipped
    • safe error context (summary + per-item outcomes at least for failures)
  • FR-003 (Idempotency / dedupe): While a matching run is queued or running, the action MUST reuse it and MUST NOT enqueue duplicate work.

    Recommended dedupe key: tenant_id + backup_set_id + operation_type(add_policies) + selection_hash.

  • FR-004 (Job execution semantics): The queued job MUST:

    • load the selection deterministically (IDs or a stored selection payload)
    • process items sequentially or in safe batches
    • update run progress counts as it goes
    • process all items (no circuit-breaker abort)
    • record per-item outcomes (at minimum: failure entries with stable reason_code + sanitized short text)

    Reason codes (minimum set):

    • already_in_backup_set (skipped)
    • policy_not_found (failed)
    • policy_ignored (skipped)
    • backup_set_not_found (failed)
    • backup_set_archived (failed)
    • graph_forbidden (failed)
    • graph_throttled (failed)
    • graph_transient (failed)
    • unknown (failed)
  • FR-005 (User-visible feedback): On action submit, the UI MUST:

    • show “Queued” feedback
    • provide a “View run” link (DB notification preferred)

Non-Functional Requirements

  • NFR-001 (Tenant isolation and authorization):

    • Run list/view MUST be tenant-scoped
    • Cross-tenant run access MUST be denied (403)
    • Only authorized roles can start “Add Policies”
  • NFR-002 (Data minimization & safe logging):

    • No access tokens, “Bearer ” strings, or raw Graph payload dumps in notifications, run failures, or logs
    • Persist only sanitized/allowlisted error fields and stable identifiers
  • NFR-003 (Determinism): Given the same input selection, results are reproducible and the dedupe key is stable.

  • NFR-004 (No UI-time Graph calls): Rendering the Backup Set UI and Run detail pages MUST not require Graph calls.

Key Entities (include if feature involves data)

  • BulkOperationRun: Existing run record used for long-running operations (status + counts + failures).
  • BackupSet: Target set to which policies are added.
  • BackupItem: Persisted item rows representing a policy in a backup set.

Success Criteria (mandatory)

Measurable Outcomes

  • SC-001 (Fast submit): Clicking “Add selected” returns without timing out and does not perform Graph work in-request (proved by guard tests).
  • SC-002 (Observable run): A Run is created/reused and is visible in the UI with status and progress counts.
  • SC-003 (No duplicates): Double-click does not create duplicate runs/jobs (proved by idempotency tests).
  • SC-004 (Safe failures): Failures show safe reason codes/summaries and do not store secrets/tokens (proved by sanitization tests).

Rollout / Migration

  • No destructive migrations required for 052.
  • If additional idempotency indexing is needed beyond existing run infrastructure, add a minimal migration (backwards compatible).

Open Questions (Optional)

  • Should progress appear in an existing global progress widget (only if already supported; new widget work is out of scope)?