TenantAtlas/specs/011-restore-run-wizard/spec.md
ahmido 43efd30922 spec: restore run wizard (#14)
## Summary
<!-- Kurz: Was ändert sich und warum? -->

## Spec-Driven Development (SDD)
- [ ] Es gibt eine Spec unter `specs/<NNN>-<feature>/`
- [ ] Enthaltene Dateien: `plan.md`, `tasks.md`, `spec.md`
- [ ] Spec beschreibt Verhalten/Acceptance Criteria (nicht nur Implementation)
- [ ] Wenn sich Anforderungen während der Umsetzung geändert haben: Spec/Plan/Tasks wurden aktualisiert

## Implementation
- [ ] Implementierung entspricht der Spec
- [ ] Edge cases / Fehlerfälle berücksichtigt
- [ ] Keine unbeabsichtigten Änderungen außerhalb des Scopes

## Tests
- [ ] Tests ergänzt/aktualisiert (Pest/PHPUnit)
- [ ] Relevante Tests lokal ausgeführt (`./vendor/bin/sail artisan test` oder `php artisan test`)

## Migration / Config / Ops (falls relevant)
- [ ] Migration(en) enthalten und getestet
- [ ] Rollback bedacht (rückwärts kompatibel, sichere Migration)
- [ ] Neue Env Vars dokumentiert (`.env.example` / Doku)
- [ ] Queue/cron/storage Auswirkungen geprüft

## UI (Filament/Livewire) (falls relevant)
- [ ] UI-Flows geprüft
- [ ] Screenshots/Notizen hinzugefügt

## Notes
<!-- Links, Screenshots, Follow-ups, offene Punkte -->

Co-authored-by: Ahmed Darrazi <ahmeddarrazi@adsmac.local>
Reviewed-on: #14
2025-12-30 02:00:00 +00:00

225 lines
9.0 KiB
Markdown

# Feature Specification: Restore Run Wizard (011)
**Feature Branch**: `feat/011-restore-run-wizard`
**Created**: 2025-12-30
**Status**: Draft
**Input**: Restore Run Wizard requirements (Safety First / Defensive Restore)
## Overview
Implement **Restore Runs** as a **multi-step Wizard** (instead of a single “Create Restore Run” form) to enforce **Safety First / Defensive Restore**.
Restore is a high-risk workflow. The wizard must guide admins through explicit checkpoints:
source selection → scoping → safety checks → preview → confirmation + execution.
## Problem Statement
The current Restore Run creation is a single form that can lead to:
- picking the wrong backup source
- restoring too broad a scope unintentionally
- executing without a structured “risk + preview + explicit confirmation” flow
## Goals
- Make restore a **deliberate, stepwise** process with strong defaults.
- Make **dry-run** the default, and keep “Execute” disabled until all safety gates are satisfied.
- Add **server-side safety/conflict checks** and persist results for auditability.
- Provide a **preview** (diff summary at minimum) before allowing execution.
## Non-Goals (v1)
- Approval workflows / multi-person approvals (but design must not block future addition).
- Perfect diff UX parity with Intune (basic normalized diff output is enough).
- A generic wizard framework (restore-specific implementation is fine).
---
## UX Principles
- **Dry-run default = ON**
- Wizard progression should slow the user down and force explicit decisions.
- “Execute” stays disabled until:
- Preview has been completed
- No blocking checks exist
- “I reviewed the impact” checkbox is checked
- Tenant hard-confirm matches (Highlander principle)
---
## Wizard Steps
### Step 1 — Select Backup Set (Source of Truth)
**Question:** “What are we restoring from?”
**Inputs**
- Backup Set (required)
**Read-only**
- Snapshot timestamp
- Tenant name
- Count of policies/items
- Types (Config / Security / Scripts …)
**Validation**
- `backup_set_id` is required
- Changing the backup set resets downstream state (scope, checks, preview, confirmation)
### Step 2 — Define Restore Scope (Selectivity)
**Question:** “What exactly should be restored?”
**Inputs**
- Scope mode: `all` (default) or `selected`
- If `selected`: item multiselect with search + select all
**UI**
- Prefer grouped by **type** and **platform**
- Mark “preview-only” types clearly
- Foundations should be discoverable (scope tags, assignment filters, notification templates)
**Notes**
- “Empty = all” only when scope mode is `all` (not when `selected`)
### Step 3 — Safety & Conflict Checks (Defensive Layer)
**Question:** “Is this dangerous?”
**Checks (server-side, persisted)**
- Target policy missing in target tenant?
- Target policy newer than backup? (staleness / overwrite risk)
- Assignments conflicts (e.g., mapping required / orphaned groups)
- Scope tag conflicts (mapping required / missing)
- Preview-only policies included in scope (should be warned and auto-dry-run)
**Severity**
- ❌ blocking
- ⚠️ warning
- ✅ safe
**Rules**
- Blocking checks prevent execution.
- Wizard may allow proceeding to preview, but must never allow execute while blockers exist.
### Step 4 — Preview (Dry-Run Simulation)
**Question:** “What would happen?”
**Outputs**
- Diff summary (at minimum):
- X policies changed
- Y assignments changed
- Z scope tags changed
- Per-item normalized diff (nice-to-have for v1, but plan for it)
**Defaults**
- “Preview only (Dry-run)” is ON by default
### Step 5 — Confirm & Execute (Point of No Return)
**Question:** “Do you really want to do this?”
**Confirmations**
- Checkbox: “I reviewed the impact”
- Tenant hard-confirm input (must match tenant display identifier)
- Environment badge (Prod/Test) highly visible (frozen at run start for audit)
**Rules**
- Execute disabled if:
- `dry_run = true`
- blockers exist
- tenant confirm mismatch
- acknowledgement unchecked
---
## Domain Model (v1-aligned)
We already have a `restore_runs` aggregate (`restore_runs` table) with:
- `backup_set_id`, `requested_items`, `preview`, `results`, `status`, `metadata`, timestamps, and `group_mapping`.
**v1 approach**
- Keep the existing primary key type (bigint) to avoid a disruptive migration.
- Extend the lifecycle/status semantics and persist wizard computations (checks + diff summaries) in structured fields:
- Prefer adding dedicated JSON columns only if needed; otherwise use `metadata` for wizard state.
### RestoreRun Lifecycle (proposed statuses)
`draft → scoped → checked → previewed → queued → running → completed|partial|failed|cancelled`
### Persisted Wizard State (minimum)
- `backup_set_id` (existing)
- `requested_items` (selected IDs, existing)
- `metadata.scope_mode` (`all|selected`)
- `metadata.environment` (`prod|test`)
- `metadata.highlander_label` (tenant identifier string, frozen)
- `metadata.check_summary` + `metadata.check_results` (Step 3)
- `metadata.preview_summary` + `metadata.preview_diffs` (Step 4; diffs may be truncated/limited)
- `metadata.confirmed_at`, `metadata.confirmed_by` (Step 5)
---
## Services / Responsibilities
- **RestoreScopeBuilder**: build selectable restore items (grouped, searchable), include foundations & mark preview-only.
- **RestoreRiskChecker**: run safety checks, return structured results + summary.
- **RestoreDiffGenerator**: generate diff summary (and optionally per-item diffs) for preview.
- **RestoreExecutor**: execute restore (idempotent, tenant/run locking), write detailed outcomes.
- **RestoreRunPolicy**: enforce invariants (no execution without preview + confirmations).
---
## User Scenarios & Testing *(mandatory)*
### User Story 1 — Wizard-driven Restore Run (Priority: P1)
As an admin, I can create a restore run via a 5-step wizard and I cannot accidentally execute without preview + explicit confirmations.
**Why this priority**: This is the safety foundation; without it, restore remains risky UX.
**Independent Test**: In Filament, create a restore run with dry-run, see checks + preview, and confirm execute stays disabled until gates satisfied.
**Acceptance Scenarios**
1. **Given** I select a backup set, **When** I move to the next step, **Then** scope/check/preview state is reset when I change the backup set again.
2. **Given** I keep dry-run enabled, **When** I reach Step 5, **Then** Execute is disabled.
3. **Given** I disable dry-run, **When** I have not completed preview, **Then** Execute is disabled.
---
### User Story 2 — Safety Checks block execution (Priority: P1)
As an admin, I see blocking vs warning checks, and execution is blocked when blockers exist.
**Why this priority**: Defensive restore requires an explicit risk layer.
**Independent Test**: Create a scope that triggers a blocking check and verify execution cannot proceed.
**Acceptance Scenarios**
1. **Given** a blocking check exists, **When** I reach Step 5, **Then** Execute remains disabled and blockers are visible.
2. **Given** only warnings exist, **When** I acknowledge impact and hard-confirm tenant, **Then** I can execute (dry-run off).
---
### User Story 3 — Preview diff summary (Priority: P2)
As an admin, I can preview what would change before executing restore.
**Why this priority**: A restore without preview is operationally unsafe.
**Independent Test**: Run Step 4 preview and verify diff summary is computed and persisted on the RestoreRun.
**Acceptance Scenarios**
1. **Given** I scoped items, **When** I run preview, **Then** I see a summary (changed policies count) and it persists on the restore run.
---
## Edge Cases
- Very large backup sets (hundreds/thousands of items): selection/search must remain responsive.
- Switching backup set mid-flow resets downstream state safely.
- Policies not present in target tenant: shown as warning/blocker depending on restore mode.
- RBAC-limited tenant setup: checks must clearly show “inventory/restore may be partial”.
---
## Functional Requirements
- **FR-011.1**: System MUST implement Restore Run creation as a 5-step wizard in Filament.
- **FR-011.2**: System MUST default `dry_run = true` and prevent execution while dry-run is enabled.
- **FR-011.3**: System MUST run server-side safety checks and persist results (summary + details) for audit.
- **FR-011.4**: System MUST generate at least a diff summary on preview and persist it.
- **FR-011.5**: System MUST require explicit acknowledgement + tenant hard-confirm before allowing execution.
- **FR-011.6**: System MUST freeze environment badge and tenant label for audit on run creation.
- **FR-011.7**: System MUST keep execution disabled if any blocking checks exist.
- **FR-011.8**: System MUST record execution outcomes and leave an auditable trail (existing audit log patterns).
---
## Success Criteria
- **SC-011.1**: Admins can only execute after preview + confirmations; no accidental execution path exists.
- **SC-011.2**: Blocking checks reliably prevent execution.
- **SC-011.3**: Preview produces a persisted summary for every run.