Ahmed Darrazi 6844bc1c17 spec: restore run wizard

2025-12-30 02:56:28 +01:00

9.0 KiB

Raw Blame History

Feature Specification: Restore Run Wizard (011)

Feature Branch: feat/011-restore-run-wizard
Created: 2025-12-30
Status: Draft
Input: Restore Run Wizard requirements (Safety First / Defensive Restore)

Overview

Implement Restore Runs as a multi-step Wizard (instead of a single “Create Restore Run” form) to enforce Safety First / Defensive Restore.

Restore is a high-risk workflow. The wizard must guide admins through explicit checkpoints: source selection → scoping → safety checks → preview → confirmation + execution.

Problem Statement

The current Restore Run creation is a single form that can lead to:

picking the wrong backup source
restoring too broad a scope unintentionally
executing without a structured “risk + preview + explicit confirmation” flow

Goals

Make restore a deliberate, stepwise process with strong defaults.
Make dry-run the default, and keep “Execute” disabled until all safety gates are satisfied.
Add server-side safety/conflict checks and persist results for auditability.
Provide a preview (diff summary at minimum) before allowing execution.

Non-Goals (v1)

Approval workflows / multi-person approvals (but design must not block future addition).
Perfect diff UX parity with Intune (basic normalized diff output is enough).
A generic wizard framework (restore-specific implementation is fine).

UX Principles

Dry-run default = ON
Wizard progression should slow the user down and force explicit decisions.
“Execute” stays disabled until:
- Preview has been completed
- No blocking checks exist
- “I reviewed the impact” checkbox is checked
- Tenant hard-confirm matches (Highlander principle)

Wizard Steps

Step 1 — Select Backup Set (Source of Truth)

Question: “What are we restoring from?”

Inputs

Backup Set (required)

Read-only

Snapshot timestamp
Tenant name
Count of policies/items
Types (Config / Security / Scripts …)

Validation

backup_set_id is required
Changing the backup set resets downstream state (scope, checks, preview, confirmation)

Step 2 — Define Restore Scope (Selectivity)

Question: “What exactly should be restored?”

Inputs

Scope mode: all (default) or selected
If selected: item multiselect with search + select all

Prefer grouped by type and platform
Mark “preview-only” types clearly
Foundations should be discoverable (scope tags, assignment filters, notification templates)

Notes

“Empty = all” only when scope mode is all (not when selected)

Step 3 — Safety & Conflict Checks (Defensive Layer)

Question: “Is this dangerous?”

Checks (server-side, persisted)

Target policy missing in target tenant?
Target policy newer than backup? (staleness / overwrite risk)
Assignments conflicts (e.g., mapping required / orphaned groups)
Scope tag conflicts (mapping required / missing)
Preview-only policies included in scope (should be warned and auto-dry-run)

Severity

❌ blocking
⚠️ warning
✅ safe

Rules

Blocking checks prevent execution.
Wizard may allow proceeding to preview, but must never allow execute while blockers exist.

Step 4 — Preview (Dry-Run Simulation)

Question: “What would happen?”

Outputs

Diff summary (at minimum):
- X policies changed
- Y assignments changed
- Z scope tags changed
Per-item normalized diff (nice-to-have for v1, but plan for it)

Defaults

“Preview only (Dry-run)” is ON by default

Step 5 — Confirm & Execute (Point of No Return)

Question: “Do you really want to do this?”

Confirmations

Checkbox: “I reviewed the impact”
Tenant hard-confirm input (must match tenant display identifier)
Environment badge (Prod/Test) highly visible (frozen at run start for audit)

Rules

Execute disabled if:
- dry_run = true
- blockers exist
- tenant confirm mismatch
- acknowledgement unchecked

Domain Model (v1-aligned)

We already have a restore_runs aggregate (restore_runs table) with:

backup_set_id, requested_items, preview, results, status, metadata, timestamps, and group_mapping.

v1 approach

Keep the existing primary key type (bigint) to avoid a disruptive migration.
Extend the lifecycle/status semantics and persist wizard computations (checks + diff summaries) in structured fields:
- Prefer adding dedicated JSON columns only if needed; otherwise use metadata for wizard state.

RestoreRun Lifecycle (proposed statuses)

draft → scoped → checked → previewed → queued → running → completed|partial|failed|cancelled

Persisted Wizard State (minimum)

backup_set_id (existing)
requested_items (selected IDs, existing)
metadata.scope_mode (all|selected)
metadata.environment (prod|test)
metadata.highlander_label (tenant identifier string, frozen)
metadata.check_summary + metadata.check_results (Step 3)
metadata.preview_summary + metadata.preview_diffs (Step 4; diffs may be truncated/limited)
metadata.confirmed_at, metadata.confirmed_by (Step 5)

Services / Responsibilities

RestoreScopeBuilder: build selectable restore items (grouped, searchable), include foundations & mark preview-only.
RestoreRiskChecker: run safety checks, return structured results + summary.
RestoreDiffGenerator: generate diff summary (and optionally per-item diffs) for preview.
RestoreExecutor: execute restore (idempotent, tenant/run locking), write detailed outcomes.
RestoreRunPolicy: enforce invariants (no execution without preview + confirmations).

User Scenarios & Testing (mandatory)

User Story 1 — Wizard-driven Restore Run (Priority: P1)

As an admin, I can create a restore run via a 5-step wizard and I cannot accidentally execute without preview + explicit confirmations.

Why this priority: This is the safety foundation; without it, restore remains risky UX.

Independent Test: In Filament, create a restore run with dry-run, see checks + preview, and confirm execute stays disabled until gates satisfied.

Acceptance Scenarios

Given I select a backup set, When I move to the next step, Then scope/check/preview state is reset when I change the backup set again.
Given I keep dry-run enabled, When I reach Step 5, Then Execute is disabled.
Given I disable dry-run, When I have not completed preview, Then Execute is disabled.

User Story 2 — Safety Checks block execution (Priority: P1)

As an admin, I see blocking vs warning checks, and execution is blocked when blockers exist.

Why this priority: Defensive restore requires an explicit risk layer.

Independent Test: Create a scope that triggers a blocking check and verify execution cannot proceed.

Acceptance Scenarios

Given a blocking check exists, When I reach Step 5, Then Execute remains disabled and blockers are visible.
Given only warnings exist, When I acknowledge impact and hard-confirm tenant, Then I can execute (dry-run off).

User Story 3 — Preview diff summary (Priority: P2)

As an admin, I can preview what would change before executing restore.

Why this priority: A restore without preview is operationally unsafe.

Independent Test: Run Step 4 preview and verify diff summary is computed and persisted on the RestoreRun.

Acceptance Scenarios

Given I scoped items, When I run preview, Then I see a summary (changed policies count) and it persists on the restore run.

Edge Cases

Very large backup sets (hundreds/thousands of items): selection/search must remain responsive.
Switching backup set mid-flow resets downstream state safely.
Policies not present in target tenant: shown as warning/blocker depending on restore mode.
RBAC-limited tenant setup: checks must clearly show “inventory/restore may be partial”.

Functional Requirements

FR-011.1: System MUST implement Restore Run creation as a 5-step wizard in Filament.
FR-011.2: System MUST default dry_run = true and prevent execution while dry-run is enabled.
FR-011.3: System MUST run server-side safety checks and persist results (summary + details) for audit.
FR-011.4: System MUST generate at least a diff summary on preview and persist it.
FR-011.5: System MUST require explicit acknowledgement + tenant hard-confirm before allowing execution.
FR-011.6: System MUST freeze environment badge and tenant label for audit on run creation.
FR-011.7: System MUST keep execution disabled if any blocking checks exist.
FR-011.8: System MUST record execution outcomes and leave an auditable trail (existing audit log patterns).

Success Criteria

SC-011.1: Admins can only execute after preview + confirmations; no accidental execution path exists.
SC-011.2: Blocking checks reliably prevent execution.
SC-011.3: Preview produces a persisted summary for every run.

9.0 KiB Raw Blame History

Feature Specification: Restore Run Wizard (011)

Overview

Problem Statement

Goals

Non-Goals (v1)

UX Principles

Wizard Steps

Step 1 — Select Backup Set (Source of Truth)

Step 2 — Define Restore Scope (Selectivity)

Step 3 — Safety & Conflict Checks (Defensive Layer)

Step 4 — Preview (Dry-Run Simulation)

Step 5 — Confirm & Execute (Point of No Return)

Domain Model (v1-aligned)

RestoreRun Lifecycle (proposed statuses)

Persisted Wizard State (minimum)

Services / Responsibilities

User Scenarios & Testing (mandatory)

User Story 1 — Wizard-driven Restore Run (Priority: P1)

User Story 2 — Safety Checks block execution (Priority: P1)

User Story 3 — Preview diff summary (Priority: P2)

Edge Cases

Functional Requirements

Success Criteria

9.0 KiB

Raw Blame History