From 99eb45cbd5534826c476d86d14c24b3bc91157db Mon Sep 17 00:00:00 2001 From: Ahmed Darrazi Date: Sun, 11 Jan 2026 01:19:59 +0100 Subject: [PATCH] spec: add 049 backup/restore job orchestration --- .../checklists/requirements.md | 36 +++++ .../spec.md | 148 ++++++++++++++++++ 2 files changed, 184 insertions(+) create mode 100644 specs/049-backup-restore-job-orchestration/checklists/requirements.md create mode 100644 specs/049-backup-restore-job-orchestration/spec.md diff --git a/specs/049-backup-restore-job-orchestration/checklists/requirements.md b/specs/049-backup-restore-job-orchestration/checklists/requirements.md new file mode 100644 index 0000000..b579f09 --- /dev/null +++ b/specs/049-backup-restore-job-orchestration/checklists/requirements.md @@ -0,0 +1,36 @@ +# Specification Quality Checklist: Backup/Restore Job Orchestration (049) + +**Purpose**: Validate specification completeness and quality before proceeding to planning +**Created**: 2026-01-11 +**Feature**: [specs/049-backup-restore-job-orchestration/spec.md](../spec.md) + +## Content Quality + +- [x] No implementation details (languages, frameworks, APIs) +- [x] Focused on user value and business needs +- [x] Written for non-technical stakeholders +- [x] All mandatory sections completed + +## Requirement Completeness + +- [x] No [NEEDS CLARIFICATION] markers remain +- [x] Requirements are testable and unambiguous +- [x] Success criteria are measurable +- [x] Success criteria are technology-agnostic (no implementation details) +- [x] All acceptance scenarios are defined +- [x] Edge cases are identified +- [x] Scope is clearly bounded +- [x] Dependencies and assumptions identified + +## Feature Readiness + +- [x] All functional requirements have clear acceptance criteria +- [x] User scenarios cover primary flows +- [x] Feature meets measurable outcomes defined in Success Criteria +- [x] No implementation details leak into specification + +## Notes + +- Items marked incomplete require spec updates before `/speckit.clarify` or `/speckit.plan` + +- Constitution alignment text references Microsoft Graph; this is a process requirement and not an implementation constraint. diff --git a/specs/049-backup-restore-job-orchestration/spec.md b/specs/049-backup-restore-job-orchestration/spec.md new file mode 100644 index 0000000..8039727 --- /dev/null +++ b/specs/049-backup-restore-job-orchestration/spec.md @@ -0,0 +1,148 @@ +# Feature Specification: Backup/Restore Job Orchestration (049) + +**Feature Branch**: `feat/049-backup-restore-job-orchestration` +**Created**: 2026-01-11 +**Status**: Draft +**Input**: Ensure Backup/Restore “start/execute” actions never run inline in an interactive request; they run via background processing with run records and visible progress. + +## Purpose + +All Backup/Restore “Start/Execute” actions run exclusively via background processing with Run Records and visible progress. This prevents timeouts, double-click duplication, throttling issues, and improves reliability at MSP scale. + +## Non-Goals (Phase 1) + +- No new directory/group inventory or name resolution features (separate initiative) +- No changes to external service contracts unless required for orchestration safety +- No new promotion feature (e.g., DEV→PROD) (separate initiative) + +## User Scenarios & Testing *(mandatory)* + +### User Story 1 - Capture snapshot runs in background (Priority: P1) + +An admin can start a “capture snapshot” operation without the UI hanging or timing out, and can see progress plus the final result. + +**Why this priority**: Snapshot capture is a core workflow and a common source of long-running requests. + +**Independent Test**: Starting a snapshot capture immediately returns to the UI with a queued Run Record that later transitions to a terminal state (success/failed/partial) and can be inspected. + +**Acceptance Scenarios**: + +1. **Given** an admin has access to a tenant, **When** they start “capture snapshot”, **Then** the UI confirms it was queued and shows a link to the Run Record. +2. **Given** a capture snapshot run is executing, **When** the admin views the run, **Then** they see progress (items done vs total) and any safe error summaries. + +--- + +### User Story 2 - Restore runs in background with per-item results (Priority: P1) + +An admin can start a “restore to Intune” or “re-run restore” operation as a background run and later inspect item-level outcomes and errors. + +**Why this priority**: Restore is high-impact and must be resilient, observable, and safe under retries. + +**Independent Test**: Starting restore creates a Run Record and item results that remain accessible even if the external service is unavailable. + +**Acceptance Scenarios**: + +1. **Given** an admin starts a restore, **When** they confirm the action, **Then** the UI queues a run and returns immediately (no long-running request). +2. **Given** a restore run finishes with mixed outcomes, **When** the admin views the run details, **Then** they see succeeded/failed counts and a safe error summary per failed item. + +--- + +### User Story 3 - Backup set create/capture runs in background (Priority: P2) + +An admin can create a backup set and optionally start a capture/sync operation without the request doing heavy work. + +**Why this priority**: Creating backup sets is frequent and should not be coupled to long-running capture logic. + +**Independent Test**: Creating a backup set returns quickly and any capture/sync work appears as a run with progress. + +**Acceptance Scenarios**: + +1. **Given** an admin creates a backup set with capture enabled, **When** they submit, **Then** the backup set is created and a capture run is queued. + +--- + +### User Story 4 - Dry-run/preview runs in background (Priority: P2) + +An admin can run a dry-run/preview without UI timeouts, and the preview results are persisted and shown in the UI. + +**Why this priority**: Preview supports safe change management and must remain usable even when the external service is slow or down. + +**Independent Test**: Starting preview immediately creates a run; once finished, preview outputs are visible and reusable. + +**Acceptance Scenarios**: + +1. **Given** an admin starts a preview run, **When** the run completes, **Then** the UI shows preview results without requiring re-execution. + +### Edge Cases + +- Double-clicking an action rapidly +- Retrying while an identical run is already queued or running +- External service is unavailable (e.g., throttling or outage) +- A run gets stuck or exceeds expected duration +- Permissions change after a run was queued + +## Requirements *(mandatory)* + +**Constitution alignment (required):** If this feature introduces any Microsoft Graph calls or any write/change behavior, +the spec MUST describe contract registry updates, safety gates (preview/confirmation/audit), tenant isolation, and tests. + +### Functional Requirements + +- **FR-001 Job-only execution**: The system MUST execute the following operations via background processing and MUST NOT perform heavy work inline during the interactive request: + - Capture snapshot + - Backup set create with capture/sync (when capture is triggered) + - Restore to Intune + - Re-run restore + - Restore dry-run/preview + +- **FR-002 Run Records**: Each operation start MUST create (or deterministically re-use) a Run Record before the work begins, containing: + - Tenant identity + - Initiator identity (user reference or audit reference) + - Operation type and optional target object reference + - Status lifecycle: queued → running → (succeeded | failed | partial | canceled) + - Started/finished timestamps + - Item counts: total / succeeded / failed + - Safe error code and safe error context (no secrets) + +- **FR-003 Progress visibility**: While a run is executing, the system MUST provide visible progress in the admin UI and MUST emit in-app notifications for key state transitions (queued/running/completed/failed). + +- **FR-004 Idempotency & concurrency control**: The system MUST prevent uncontrolled duplicate execution due to double-clicks/retries by enforcing a deterministic de-duplication rule keyed by (tenant + operation type + target object) or (tenant + run id). When an identical run is already queued/running, the UI MUST show “already queued/running” and link to the existing run. + +- **FR-005 Deterministic outcome persistence**: The system MUST persist per-item outcomes for operations that act on multiple items, including status and a safe error summary, so results can be viewed later without relying on logs. + +- **FR-006 Tenant isolation & authorization**: Run visibility and execution MUST be tenant-scoped. Only authorized admins can start operations, and users MUST NOT be able to view or start runs across tenants. + +- **FR-007 Safety rules**: Preview/dry-run MUST be safe (no writes). Live restore MUST remain guarded with explicit confirmation and an auditable trail consistent with existing safety practices. + +- **FR-008 Resilience**: The system MUST handle external service throttling/outages gracefully, including retries with backoff when appropriate, and MUST end runs in a clear terminal state (failed/partial) rather than silently failing. + +- **FR-009 Safe logging & data minimization**: The system MUST NOT store secrets/tokens in Run Records, notifications, or error contexts. Error context MUST be limited to a defined, safe set of fields. + +### Acceptance Checks + +- Starting any in-scope operation returns quickly with a queued Run Record link. +- A Run Record always exists before background work begins and reaches a terminal state. +- Progress and state changes are visible in the UI via progress display and in-app notifications. +- Duplicate start attempts for the same tenant + operation + target do not create uncontrolled duplicate execution. +- Item-level outcomes and safe error summaries are viewable after completion. +- Preview/dry-run never performs writes. + +### Key Entities *(include if feature involves data)* + +- **Run Record**: A tenant-scoped record representing one started operation and its lifecycle, progress, and summary outcome. +- **Run Item Result**: A tenant-scoped record representing the outcome for a single item processed as part of a Run Record. +- **Notification Event**: A tenant-scoped event surfaced to the admin UI to communicate run state changes. + +## Success Criteria *(mandatory)* + +### Measurable Outcomes + +- **SC-001**: For 95% of operation starts, the UI confirms “queued” within 2 seconds. +- **SC-002**: Double-clicking an operation start results in at most one queued/running run for the same tenant + operation + target. +- **SC-003**: 99% of runs end in a clear terminal state (succeeded/failed/partial/canceled) with a human-readable summary. +- **SC-004**: Admins can locate the latest run status for an operation in under 30 seconds without requiring access to system logs. + +## Assumptions + +- This feature builds on the UI safety constraints from 048: admin pages must remain usable even when the external service API is unavailable. +- Run Records and item results are retained long enough to support operational troubleshooting and audits, with retention managed as a separate policy.