TenantAtlas/specs/181-restore-safety-integrity/research.md

# Research: Restore Safety Integrity

## Decision 1: Derive a deterministic scope fingerprint from existing restore inputs instead of creating a new persisted scope entity

- Decision: Represent restore scope identity as a deterministic fingerprint derived from the existing restore inputs that materially change checked or written behavior: `backup_set_id`, scope mode, sorted selected item IDs, and normalized group mapping values. Persist that fingerprint only inside existing `RestoreRun` metadata when historical execution truth needs to be retained.
- Rationale: The current restore domain already has the raw inputs on `RestoreRun` and in the wizard state. The missing truth is not a new entity but a reliable way to say whether checks and preview still apply to the current selection. A derived fingerprint solves the mismatch problem without introducing a new table or second scope model.
- Alternatives considered:
  - Use only timestamps to decide whether preview or checks are current. Rejected because time alone cannot detect scope mismatch.
  - Create a dedicated persisted `restore_scope_snapshots` table. Rejected because the scope has no independent lifecycle outside the restore run and would violate the feature's proportionality goal.

## Decision 2: Keep integrity states separate from risk severity

- Decision: Model preview integrity and checks integrity as derived state families separate from `RestoreRiskChecker` severities. `current`, `stale`, `invalidated`, and `not_generated` or `not_run` answer whether the basis is still trustworthy; blocking and warning counts continue to answer what the risk checker found.
- Rationale: Existing risk checks already classify blockers and warnings, but they do not answer whether the evaluated scope still matches the operator's current selection. Treating these as one concept would continue the current trust failure where `no blockers` can be misread as `safe`.
- Alternatives considered:
  - Reuse blocker or warning severity to encode staleness and mismatch. Rejected because severity and integrity have different operator consequences.
  - Collapse integrity into one generic `needs rerun` label. Rejected because the UI needs to distinguish `never run`, `stale`, and `invalidated` as different truths.

## Decision 3: Preserve invalidation evidence in wizard state instead of silently clearing prior work

- Decision: Replace the current silent reset behavior for preview and checks with explicit invalidation evidence in the wizard state. The last generated basis may still be cleared from being executable truth, but the operator should see that a prior preview or check existed and no longer applies.
- Rationale: The current wizard already resets `check_summary`, `check_results`, `checks_ran_at`, `preview_summary`, `preview_diffs`, and `preview_ran_at` when scope-affecting inputs change. That preserves safety mechanically, but it does not preserve the operator truth that the prior work was invalidated by a change they just made.
- Alternatives considered:
  - Keep the current silent clearing behavior and add helper text only. Rejected because it still reads too much like `not generated` instead of `invalidated by your change`.
  - Keep the old values visible without marking them invalid. Rejected because it risks making stale truth look reusable.

## Decision 4: Persist only a narrow execution-time safety snapshot on the existing restore run

- Decision: When a real restore is queued, persist a compact execution-time safety snapshot inside existing `RestoreRun` metadata. The snapshot should capture the scope fingerprint, preview basis, checks basis, derived safety state, and primary blocker or warning context that justified or constrained execution.
- Rationale: The result and detail surfaces need historical truth about what basis was used at confirmation time. Re-deriving that later from mutable thresholds or current UI logic risks rewriting history. A narrow metadata snapshot keeps the audit-relevant truth on the existing restore record without creating a second persisted model.
- Alternatives considered:
  - Recompute execution-time safety state dynamically from the current code and current timestamps. Rejected because historical truth can drift as code or thresholds change.
  - Persist a full recovery-health document. Rejected because this feature does not claim tenant-wide recovery truth.

## Decision 5: Derive result follow-up truth from existing restore results and operation outcomes instead of adding a recovery entity

- Decision: Compute `completed`, `partial`, `failed`, and `completed_with_follow_up` from existing restore results, assignment outcomes, metadata, and linked `OperationRun` outcome. Treat cause families and next actions as derived read-model fields for the detail surfaces.
- Rationale: `RestoreRun.results`, assignment outcomes, and operation-run linkage already contain enough signal to decide whether operator follow-up remains. The product problem is weak surfacing of that truth, not missing domain storage.
- Alternatives considered:
  - Add a dedicated persisted recovery status column or table. Rejected because the feature does not need a second source of truth.
  - Use only `RestoreRun.status` as the result meaning. Rejected because `completed` does not mean `recovered` and `partial` does not explain the operator consequence on its own.

## Decision 6: Keep restore-specific follow-up truth visible on the canonical operation detail through enrichment or a safe deep link

- Decision: Reuse the existing restore-to-operation linkage and enrich the canonical operation detail for `restore.execute` runs with restore follow-up truth or a single safe route into the restore detail page. Do not add new `OperationRun` persistence for restore-specific state.
- Rationale: Canonical monitoring is already the shared destination for operational truth. The feature must keep restore meaning visible there, but the restore-specific source of truth still belongs to `RestoreRun`.
- Alternatives considered:
  - Persist restore-follow-up labels directly on `OperationRun`. Rejected because it duplicates restore truth into the monitoring record.
  - Leave canonical operation detail generic and rely entirely on restore detail for follow-up truth. Rejected because it breaks continuity from monitoring.

## Decision 7: Reuse Filament wizard, action, and infolist seams already present in the codebase

- Decision: Implement the feature inside the existing `RestoreRunResource::getWizardSteps()`, `CreateRestoreRun`, restore form component views, restore infolist entry views, and the existing canonical operation detail seams. Rely on Filament wizard lifecycle hooks and action testing patterns rather than inventing a new UI shell.
- Rationale: Filament v5 already supports wizard step validation hooks, confirmation modals for actions, and direct action testing. Existing restore surfaces are already built on these seams, so a narrow hardening slice should stay inside them.
- Alternatives considered:
  - Rebuild restore safety as a custom standalone screen outside Filament. Rejected because it would duplicate current routing, RBAC, and UI patterns.
  - Push interactivity into custom infolist entry classes. Rejected because Filament custom infolist entries are display-oriented, not Livewire components, and the current restore detail need is presentation hardening rather than a new client-side interaction model.

## Decision 8: Extend the existing Pest and Livewire test surface instead of creating a new browser-first harness

- Decision: Add focused unit and feature coverage around the new integrity resolvers, wizard invalidation, confirmation hardening, result attention, canonical operation continuity, and RBAC-safe degradation by extending the existing restore-related Pest and Livewire tests.
- Rationale: The repository already has strong restore wizard, preview, execution, hardening, RBAC, and ops-UX regression coverage. Filament's testing guidance supports direct action invocation and visibility assertions, which fit this feature precisely.
- Alternatives considered:
  - Rely only on manual UI validation. Rejected because this slice is specifically about preventing subtle trust regressions.
  - Add a large browser-only suite as the primary guard. Rejected because the critical assertions are server-driven state and action consequences that fit existing Pest and Livewire tests better.