197 lines
10 KiB
Markdown
197 lines
10 KiB
Markdown
# Feature Specification: Async “Add Policies” to Backup Set (052)
|
|
|
|
**Feature Branch**: `feat/052-async-add-policies`
|
|
**Created**: 2026-01-14
|
|
**Status**: Draft (implementation-ready)
|
|
**Input**: Make Backup Sets → “Add Policies” (Add selected) non-blocking by moving all Graph/snapshot work into a queued job. The UI action only creates/reuses a Run record and dispatches the job.
|
|
|
|
## Purpose
|
|
|
|
Make “Add selected” in the Backup Set → Add Policies flow reliable and fast by ensuring heavy work (Graph reads, snapshot capture, per-policy loops) never runs inline in an interactive request.
|
|
|
|
Primary outcomes:
|
|
- Prevent request timeouts and long waits.
|
|
- Prevent duplicate work from double-clicks/retries.
|
|
- Provide observable progress and safe failure visibility via an existing Run record type.
|
|
|
|
## Clarifications
|
|
|
|
### Session 2026-01-15
|
|
|
|
- Q: Who may start “Add selected”? → A: Roles `Owner`, `Manager`, `Operator` may start; `Readonly` may not.
|
|
- Q: Is idempotency/dedupe per-user or tenant-wide? → A: Tenant-wide (dedupe key does not include `user_id`).
|
|
- Q: Should execution ever run synchronously for small selections? → A: No; always async (never in-request).
|
|
- Q: What happens for empty selection? → A: No run/job; show “No policies selected” (current behavior).
|
|
- Q: How are run counts defined? → A: `total_items` equals the number of selected policies at submit time; “already in backup set” counts as `skipped`.
|
|
- Q: Do we use a circuit breaker (e.g., abort if >50% fails)? → A: No; process all items and surface partial results.
|
|
- Q: How should failures be persisted? → A: Stable `reason_code` + sanitized short text; never store secrets/tokens/raw payload dumps.
|
|
|
|
## Pinned Decisions (052 defaults)
|
|
|
|
- **Authorization**: start allowed for `Owner`, `Manager`, `Operator`; forbidden for `Readonly`.
|
|
- **Dedupe scope**: tenant-wide idempotency (no `user_id` in dedupe key).
|
|
- **Execution**: always async; action never performs Graph/snapshot work inline.
|
|
- **Empty selection**: no run/job; show “No policies selected”.
|
|
- **Counts**: `total_items` = selected policies; “already in backup set” = `skipped`.
|
|
- **Circuit breaker**: none; job processes all items.
|
|
- **Failure format**: `reason_code` + sanitized short text (no secrets).
|
|
|
|
## In Scope (052)
|
|
|
|
- Convert the “Add selected” action (Backup Sets → Add Policies modal) to job-only execution.
|
|
- Ensure an observable Run exists (status, counts, errors) using an existing run record type (prefer `BulkOperationRun`).
|
|
- Ensure idempotency: repeated clicks while queued/running reuse the same run for the same tenant + backup set + selection.
|
|
- Emit DB notification + optional existing progress widget integration (if already supported by the run type); no new UI framework work required.
|
|
- Add guard tests ensuring no Graph calls occur in the action handler/request.
|
|
|
|
## Out of Scope (052)
|
|
|
|
- UI redesign/polish (layout, navigation reorg, fancy tables, etc.)
|
|
- New Monitoring area or unified operations dashboard
|
|
- Group browsing/typeahead improvements (directory cache / later)
|
|
- New run tables if an existing run record can be reused
|
|
- Changing what a “policy version” means or how snapshots are stored
|
|
|
|
## User Scenarios & Testing *(mandatory)*
|
|
|
|
### User Story 1 - Add selected policies without blocking (Priority: P1)
|
|
|
|
As a tenant admin, I can add selected policies to a backup set without the UI hanging or timing out, because the system queues background work and gives me a Run record to monitor.
|
|
|
|
**Why this priority**: This is a common workflow and currently risks long requests/timeouts when Graph capture is slow or throttled.
|
|
|
|
**Independent Test**: Trigger “Add selected” and assert the request returns quickly, a Run exists, and a job is queued (no Graph work happens inline).
|
|
|
|
**Acceptance Scenarios**:
|
|
|
|
1. **Given** I am on a Backup Set and select policies in “Add Policies”, **When** I click “Add selected”, **Then** the UI returns quickly and a queued Run is created (or reused) with a link to its detail view.
|
|
2. **Given** the job runs, **When** it completes, **Then** the backup set contains the added policies and the Run shows final status + counts.
|
|
|
|
---
|
|
|
|
### User Story 2 - Double click / repeated submissions are deduplicated (Priority: P2)
|
|
|
|
As a tenant admin, I can click “Add selected” repeatedly (or double click) without creating duplicate work, because identical operations reuse the same active Run.
|
|
|
|
**Why this priority**: Prevents accidental duplication, inconsistent outcomes, and unnecessary Graph load.
|
|
|
|
**Independent Test**: Call the action twice with the same selection while the first run is active and assert only one Run and one queued job.
|
|
|
|
**Acceptance Scenarios**:
|
|
|
|
1. **Given** a matching Run for the same tenant + backup set + selection is queued or running, **When** I click “Add selected” again, **Then** the system reuses the existing Run and does not enqueue duplicate work.
|
|
|
|
---
|
|
|
|
### User Story 3 - Failures are visible and safe (Priority: P3)
|
|
|
|
As a tenant admin, I can see safe failure summaries when some policies cannot be captured from Graph, without secrets or raw payloads being stored or displayed.
|
|
|
|
**Why this priority**: Operators must be able to triage issues safely; secret leakage is unacceptable.
|
|
|
|
**Independent Test**: Force a simulated Graph failure in the job and assert the Run ends as failed/partial with a sanitized reason code/summary (no secrets).
|
|
|
|
**Acceptance Scenarios**:
|
|
|
|
1. **Given** Graph returns an error for some policies, **When** the job completes, **Then** the Run is marked partial/completed-with-errors and includes per-item failures with safe reason codes/summaries.
|
|
2. **Given** a failure includes sensitive substrings (e.g., “Bearer ”), **When** it is persisted, **Then** stored reasons are redacted/sanitized.
|
|
|
|
---
|
|
|
|
### Edge Cases
|
|
|
|
- Empty selection (no policies selected)
|
|
- Backup set deleted or tenant context missing between queue and execution
|
|
- Some selected policies are already in the backup set
|
|
- Policies deleted/ignored locally between queue and execution
|
|
- Graph throttling/transient failures (429/503) during capture
|
|
|
|
## Requirements *(mandatory)*
|
|
|
|
**Constitution alignment (required):** This feature performs Graph reads and writes local DB state, so it MUST be idempotent & observable, tenant-scoped, and safe-loggable. Graph calls MUST go through `GraphClientInterface` and MUST NOT occur during UI render or action request handling.
|
|
|
|
### Functional Requirements
|
|
|
|
- **FR-001 (Job-only action handler)**: The “Add selected” handler MUST:
|
|
- validate input + authorization
|
|
- create/reuse a Run record
|
|
- dispatch a queued job
|
|
- return immediately with a notification and a link to the Run
|
|
It MUST NOT:
|
|
- call `GraphClientInterface`
|
|
- call snapshot capture services
|
|
- loop over selected policies to do work inline
|
|
- run a synchronous “small selection” shortcut
|
|
|
|
- **FR-002 (Run record + observability)**: The system MUST persist for each run:
|
|
- `tenant_id`, `initiator_user_id` (or equivalent)
|
|
- `resource = backup_set`, `action = add_policies` (or existing taxonomy)
|
|
- status lifecycle: queued → running → succeeded|failed|partial (map to existing run status semantics where possible, e.g., queued=`pending`, succeeded=`completed`, partial=`completed_with_errors`)
|
|
- counts: `total_items` = selected policies at submit time; `processed_items` = succeeded+failed+skipped; “already in backup set” increments `skipped`
|
|
- safe error context (summary + per-item outcomes at least for failures)
|
|
|
|
- **FR-003 (Idempotency / dedupe)**: While a matching run is queued or running, the action MUST reuse it and MUST NOT enqueue duplicate work.
|
|
|
|
Recommended dedupe key: `tenant_id + backup_set_id + operation_type(add_policies) + selection_hash`.
|
|
|
|
- **FR-004 (Job execution semantics)**: The queued job MUST:
|
|
- load the selection deterministically (IDs or a stored selection payload)
|
|
- process items sequentially or in safe batches
|
|
- update run progress counts as it goes
|
|
- process all items (no circuit-breaker abort)
|
|
- record per-item outcomes (at minimum: failure entries with stable `reason_code` + sanitized short text)
|
|
|
|
**Reason codes (minimum set):**
|
|
- `already_in_backup_set` (skipped)
|
|
- `policy_not_found` (failed)
|
|
- `policy_ignored` (skipped)
|
|
- `backup_set_not_found` (failed)
|
|
- `backup_set_archived` (failed)
|
|
- `graph_forbidden` (failed)
|
|
- `graph_throttled` (failed)
|
|
- `graph_transient` (failed)
|
|
- `unknown` (failed)
|
|
|
|
- **FR-005 (User-visible feedback)**: On action submit, the UI MUST:
|
|
- show “Queued” feedback
|
|
- provide a “View run” link (DB notification preferred)
|
|
|
|
### Non-Functional Requirements
|
|
|
|
- **NFR-001 (Tenant isolation and authorization)**:
|
|
- Run list/view MUST be tenant-scoped
|
|
- Cross-tenant run access MUST be denied (403)
|
|
- Only authorized roles can start “Add Policies”
|
|
|
|
- **NFR-002 (Data minimization & safe logging)**:
|
|
- No access tokens, “Bearer ” strings, or raw Graph payload dumps in notifications, run failures, or logs
|
|
- Persist only sanitized/allowlisted error fields and stable identifiers
|
|
|
|
- **NFR-003 (Determinism)**: Given the same input selection, results are reproducible and the dedupe key is stable.
|
|
|
|
- **NFR-004 (No UI-time Graph calls)**: Rendering the Backup Set UI and Run detail pages MUST not require Graph calls.
|
|
|
|
### Key Entities *(include if feature involves data)*
|
|
|
|
- **BulkOperationRun**: Existing run record used for long-running operations (status + counts + failures).
|
|
- **BackupSet**: Target set to which policies are added.
|
|
- **BackupItem**: Persisted item rows representing a policy in a backup set.
|
|
|
|
## Success Criteria *(mandatory)*
|
|
|
|
### Measurable Outcomes
|
|
|
|
- **SC-001 (Fast submit)**: Clicking “Add selected” returns without timing out and does not perform Graph work in-request (proved by guard tests).
|
|
- **SC-002 (Observable run)**: A Run is created/reused and is visible in the UI with status and progress counts.
|
|
- **SC-003 (No duplicates)**: Double-click does not create duplicate runs/jobs (proved by idempotency tests).
|
|
- **SC-004 (Safe failures)**: Failures show safe reason codes/summaries and do not store secrets/tokens (proved by sanitization tests).
|
|
|
|
## Rollout / Migration
|
|
|
|
- No destructive migrations required for 052.
|
|
- If additional idempotency indexing is needed beyond existing run infrastructure, add a minimal migration (backwards compatible).
|
|
|
|
## Open Questions (Optional)
|
|
|
|
- Should progress appear in an existing global progress widget (only if already supported; new widget work is out of scope)?
|