TenantAtlas/specs/364-restore-high-risk-operation-reconciliation/spec.md
ahmido 3ce1cae71e feat: implement restore high risk operation reconciliation (#435)
Implemented restore high risk operation reconciliation.

Co-authored-by: Ahmed Darrazi <ahmed.darrazi@live.de>
Reviewed-on: #435
2026-06-07 14:10:34 +00:00

45 KiB

Feature Specification: Spec 364 - Restore and High-Risk Operation Reconciliation

Feature Branch: 364-restore-high-risk-operation-reconciliation Created: 2026-06-07 Status: Draft Type: Restore execution truth hardening / OperationRun reconciliation follow-up / no new persistence Runtime posture: Tighten restore.execute reconciliation so real restore execution cannot be marked successful from weak terminal signals alone. Reuse existing RestoreRun, OperationRun, audit, proof, and evidence surfaces. Do not create a new restore engine, new operation family, or new persisted verification model. Input: User-provided Spec 364 draft in /Users/ahmeddarrazi/.codex/attachments/fe416f8b-141a-44eb-ae89-ab62a4691bed/pasted-text.txt, reconciled against current repo truth after Specs 358-363.

Dependencies And Historical Context

This package continues the current OperationRun truth line:

  • Spec 358 - OperationRun Queue Truth Foundation established honest queued/running stale handling and explicitly deferred business-success reconciliation.
  • Spec 359 - OperationRun Reconciliation Adapter Framework & Review Compose Adapter added a bounded adapter path and treated restore as an existing adapter precedent.
  • Spec 360 - OperationRun Canonical Cutover Cleanup canonicalized the adapter registry, context.reconciliation, and dispatch/correlation semantics over the real restore and review-compose cases.
  • Spec 361 - Report and Evidence Reconciliation Adapters added artifact-backed reconciliation for evidence snapshot and review-pack generation while keeping restore expansion out of scope.
  • Spec 362 - Sync, Capture, and Backup Operation Semantics added selected-family proof semantics for sync, baseline capture, and backup schedule runs while deferring restore.
  • Spec 363 - Explicit UiActionContext Contract is implemented action-context hardening and is dependency context only.

Restore-specific productization already exists and must not be reopened:

  • Spec 333 - Restore Create UX Final Productization productized the pre-execution restore wizard.
  • Spec 335 - Restore Run Detail / Post-Execution Proof Productization productized the restore detail proof surface.
  • Spec 181 - Restore Safety Integrity defines earlier restore safety requirements and remains historical context.

Current repo truth already contains:

  • App\Support\Operations\Reconciliation\RestoreExecuteReconciliationAdapter
  • App\Services\AdapterRunReconciler
  • App\Services\OperationRunService
  • App\Support\Operations\Reconciliation\ReconciliationResult
  • App\Support\OperationRunOutcome values succeeded, partially_succeeded, blocked, and failed
  • App\Support\RestoreRunStatus
  • App\Jobs\ExecuteRestoreRunJob
  • App\Listeners\SyncRestoreRunToOperationRun
  • restore proof/result presentation in RestoreRunDetailPresenter, RestoreSafetyResolver, and restore infolist Blade entries

The user draft's verification_required concept is valid product truth, but repo truth does not justify a new OperationRun outcome or new restore.verify operation type in this slice. Spec 364 represents verification gaps through existing outcomes plus restore-specific reason and evidence metadata.

Spec Candidate Check (mandatory - SPEC-GATE-001)

  • Problem: restore.execute can still be reconciled from terminal RestoreRun status in a way that risks overclaiming success for high-risk tenant-changing work. The current adapter maps previewed and completed to succeeded without requiring a complete proof bundle that distinguishes execution proof, provider acceptance, item-level result truth, verification evidence, scope safety, and audit continuity.
  • Today's failure: A stale or interrupted restore operation can become a calm successful OperationRun when the related restore record is terminal, even if post-run evidence is unavailable, item outcomes are partial, provider proof is incomplete, or the restore status represents preview-only or pre-execution truth.
  • User-visible improvement: Operators inspecting Operations or Restore Run detail will see restore execution truth that is honest by default: succeeded only when execution and proof are complete, partially succeeded when mutation occurred but verification or item truth is incomplete, blocked when safety gates prevent execution, and failed when no safe proof exists.
  • Smallest enterprise-capable version: Harden exactly restore.execute reconciliation and visible fallout over current repo-real proof paths. Reuse existing OperationRunOutcome, RestoreRunStatus, RestoreRun.results, RestoreRun.metadata, RestoreRun.operation_run_id, audit links, and existing Restore Run detail proof presentation. Do not add new operation types, persistence, Graph contracts, or restore wizard behavior.
  • Explicit non-goals: No new restore wizard, no new restore engine, no new rollback system, no restore.preview / restore.validate / restore.verify operation types, no new OperationRun outcome, no new table, no new Graph provider, no new Backup model, no new diff algorithm, no UI redesign of Operations or Restore detail, no retry console, no destructive cleanup action, no compatibility shims for pre-production historical rows.
  • Permanent complexity imported: One bounded restore-proof decision path inside or beside the existing restore reconciliation adapter, focused Unit/Feature/Browser coverage, and small copy/metadata adjustments on existing Operations and Restore detail surfaces. No new persisted truth or cross-domain framework.
  • Why now: Specs 358-362 intentionally matured OperationRun truth family by family. Restore is the highest-risk remaining adapter family because it can mutate real tenant configuration and false success is more dangerous than false calm for read-only report or backup work.
  • Why not local: A Restore Run detail copy-only fix would not stop stale adapter reconciliation from writing overclaiming OperationRun truth. A job-only fix would not cover late/stale reconciliation. The proof boundary belongs in the shared adapter/service path and must remain visible in current run surfaces.
  • Approval class: Core Enterprise.
  • Red flags triggered: high-risk domain semantics, adapter proof hardening, restore-specific reason codes. Defense: this slice narrows the existing restore adapter rather than adding a framework; it uses existing outcomes, existing records, existing surfaces, and fail-closed rules.
  • Score: Nutzen: 2 | Dringlichkeit: 2 | Scope: 2 | Komplexitaet: 1 | Produktnaehe: 2 | Wiederverwendung: 2 | Gesamt: 11/12
  • Decision: approve.

Candidate Source And Completed-Spec Guardrail

  • Candidate source:
    • direct user-provided Spec 364 draft in pasted-text.txt
    • repo-real follow-through after Specs 358-363
    • roadmap relationship: Golden Master Governance and restore safety/read-write separation; this is execution-truth hardening rather than a customer portal or productization backlog item
  • Queue boundary: docs/product/spec-candidates.md records no safe automatic next-best-prep target for auto-selection. This package is an intentional manual promotion from direct user input, not an automatic backlog pick.
  • Completed-spec check result:
    • no specs/364-* package existed before this prep
    • no local or remote 364-* branch existed before this prep
    • Specs 333, 335, and 358-363 are completed or implementation-context packages and must not be rewritten, normalized, unchecked, or reopened
    • completed close-out, validation, smoke, browser, and checked task markers in related specs remain historical evidence
  • Close alternatives deferred:
    • broader restore workflow redesign is deferred because Spec 333 already covers create UX and Spec 335 already covers detail proof productization
    • new restore.verify or rollback operation families are deferred until there is an explicit product/runtime source of truth for verification execution
    • generic high-risk operation framework is deferred; promotion.execute and AI or operational-control high-risk flows need separate product decisions
    • support-desk, customer-review, provider-scope, and canonical-link productization lanes are already specced/active/completed or manual-promotion only
  • Smallest viable implementation slice: harden restore.execute adapter proof and related Operations/Restore detail presentation only.

Summary

This feature prevents restore execution from being treated as successful merely because a related restore record reached a terminal status.

For restore.execute, the system must prove the intended mutation was accepted, item results are interpretable, verification or post-run evidence is available when success is claimed, scope remained safe, and audit continuity exists. If those signals are incomplete, the operation must finalize as partial, blocked, failed, or not reconciled instead of succeeding.

Success Proof Bundle Matrix

The implementation must use this matrix as the narrow repo-real proof boundary before writing OperationRunOutcome::Succeeded.

Proof element Repo-real source to inspect Success threshold Missing or invalid proof result Reason code guidance
Same-scope restore linkage OperationRun.context.restore_run_id, RestoreRun.operation_run_id, RestoreRun.workspace_id, RestoreRun.managed_environment_id, default non-trashed RestoreRun query A non-trashed RestoreRun belongs to the same workspace and managed environment as the OperationRun and links back through context or operation_run_id without contradiction Return non-final not_reconciled when no same-scope RestoreRun can be safely identified; never use a wrong-scope or trashed RestoreRun as proof restore.proof_missing, restore.scope_mismatch, restore.run_deleted
Execution proof RestoreRun.is_dry_run, RestoreRun.status, RestoreRun.started_at, RestoreRun.completed_at, linked OperationRun status/outcome Real execution, not preview/dry-run, reached an execution terminal status with timestamps consistent enough for audit and operator review previewed, dry-run, missing start/completion proof, or terminal status alone cannot produce success restore.preview_only, restore.execution_proof_missing
Provider acceptance / mutation proof Existing restore service output persisted in RestoreRun.results, RestoreRun.failure_reason, and safe aggregate metadata Results or metadata show the provider accepted or safely attempted the requested mutation and no provider-level rejection remains Missing or rejected provider proof with a same-scope RestoreRun finalizes as failed or blocked; raw provider payloads stay out of reconciliation metadata restore.provider_proof_missing, restore.provider_rejected
Item or aggregate result truth RestoreRun.results.items, RestoreRun.results.assignment_outcomes, RestoreRun.metadata.total_items, processed_items, succeeded_items, failed_items, skipped_items, plus existing RestoreRun item helpers Item or aggregate counts are interpretable, flat numeric where written to OperationRun, and show all required work succeeded Mixed results produce partially_succeeded; absent result truth after execution withholds success restore.results_mixed, restore.results_missing
Post-run evidence or explicit proof availability Existing Restore Run detail evidence path, linked OperationRun, existing EvidenceSnapshot context when available, and safe metadata flags already persisted by the restore flow Evidence is available or the existing restore result explicitly proves why recovery proof is available without a new persisted verification model Missing evidence after mutation produces existing partial/blocked/failed truth with a visible proof-gap reason, not success restore.verification_required, restore.evidence_missing
Audit continuity AuditLog rows in the same workspace and managed environment, preferably linked through operation_run_id or stable restore action metadata; existing operation terminal audit remains service-owned Same-scope audit trail can explain start/failure/completion or reconciliation without exposing secrets Otherwise-complete proof without audit continuity must withhold success and record a safe proof-gap reason restore.audit_missing

not_reconciled is a ReconciliationResult decision, not an OperationRunOutcome. It is valid only when the adapter cannot safely identify enough same-scope restore proof to finalize the run. If a same-scope RestoreRun exists and proves execution reached an unsafe, partial, blocked, failed, or verification-gap state, the adapter must finalize with an existing outcome instead of using not_reconciled to hide operator-visible truth.

Business/Product Value

  • Reduces the risk of false recovery claims after tenant-changing operations.
  • Makes restore monitoring and restore detail consistent with TenantPilot's read/write separation and audit-first posture.
  • Keeps the platform sellable as Governance-of-Record by making high-risk mutation truth stricter than report, evidence, sync, or backup truth.

Primary Users / Operators

  • Tenant/MSP operators who start and review restore execution.
  • Workspace owners/managers who approve or supervise high-risk changes.
  • Support/platform operators who troubleshoot restore outcomes through Operations and audit evidence.

Roadmap Relationship

Spec 364 belongs to the OperationRun execution-truth maturity line and the restore safety lane. It is not a new customer-facing workspace, not a new restore product surface, and not a generic high-risk operation framework.

Spec Scope Fields (mandatory)

  • Scope: canonical-view plus environment-bound restore execution truth
  • Primary Routes:
    • /admin/workspaces/{workspace}/operations
    • /admin/workspaces/{workspace}/operations/{run}
    • existing environment-scoped Restore Run list/create/detail routes via App\Filament\Resources\RestoreRunResource
  • Data Ownership:
    • operation_runs remain the only execution and reconciliation truth
    • restore_runs remain the restore request/result truth
    • audit_logs remain audit trail truth
    • existing evidence snapshot links remain optional post-run evidence context only when already repo-backed
    • no new persistence is introduced
  • RBAC:
    • existing workspace-first OperationRun access remains authoritative
    • existing Restore Run policy/resource access remains authoritative
    • non-members and wrong-scope actors remain 404
    • members missing restore capability remain 403 for restore execution
    • no new capability strings are introduced

For canonical-view specs:

  • Default filter behavior when tenant-context is active: the Operations hub remains workspace-scoped with explicit environment filters. Reconciliation must use the run's stored workspace and managed environment scope, not remembered environment state, current page filters, or Filament tenant fallback.
  • Explicit entitlement checks preventing cross-tenant leakage: no adapter may reconcile to a restore run outside the operation's workspace and managed environment, and no related link may bypass current scope-safe routes.

UI Surface Impact (mandatory - UI-COV-001)

  • No UI surface impact
  • Existing page changed
  • New page/route added
  • Navigation changed
  • Filament panel/provider surface changed
  • New modal/drawer/wizard/action added
  • New table/form/state added
  • Customer-facing surface changed
  • Dangerous action changed
  • Status/evidence/review presentation changed
  • Workspace/environment context presentation changed

UI/Productization Coverage (mandatory when UI Surface Impact is not "No UI surface impact")

  • Route/page/surface:
    • App\Filament\Pages\Monitoring\Operations
    • App\Filament\Pages\Operations\TenantlessOperationRunViewer
    • App\Filament\Resources\OperationRunResource as shared implementation seam
    • App\Filament\Resources\RestoreRunResource detail proof surface
    • restore execution start/confirmation path only where proof or queued feedback reflects restore.execute
  • Current or new page archetype: existing Operations monitoring/detail family plus existing environment-bound dangerous restore workflow/detail surfaces
  • Design depth: Domain Pattern Surface / Manual Review Required for restore proof and dangerous-action truth
  • Repo-truth level: repo-verified
  • Existing pattern reused: current OperationRun monitoring family, current Restore Run detail proof model, current Restore Create safety/proof model, current OperationRunLinks, current BADGE-001 status badge semantics
  • New pattern required: none; the change narrows existing restore proof/reconciliation behavior
  • Screenshot required: one bounded browser smoke screenshot only if implementation materially changes visible hierarchy; otherwise existing Spec 333/335 screenshot anchors remain sufficient
  • Page audit required: no new page-report identity unless implementation introduces a materially new visible hierarchy
  • Customer-safe review required: no customer-facing surface; copy must still avoid false recovery claims
  • Dangerous-action review required: yes; success wording and execute/verification claims must not overstate tenant recovery
  • Coverage files updated or explicitly not needed:
    • docs/ui-ux-enterprise-audit/route-inventory.md
    • docs/ui-ux-enterprise-audit/design-coverage-matrix.md
    • docs/ui-ux-enterprise-audit/page-reports/...
    • docs/ui-ux-enterprise-audit/strategic-surfaces.md
    • docs/ui-ux-enterprise-audit/grouped-follow-up-candidates.md
    • docs/ui-ux-enterprise-audit/unresolved-pages.md
    • N/A - existing Operations and Restore Run page families already cover these reachable surfaces unless implementation proves visible hierarchy drift
  • No-impact rationale when applicable: N/A

Cross-Cutting / Shared Pattern Reuse (mandatory)

  • Cross-cutting feature?: yes
  • Interaction class(es): status messaging, action links, dangerous-action proof wording, OperationRun reconciliation diagnostics, restore result/proof viewers
  • Systems touched:
    • OperationRunReconciliationRegistry
    • RestoreExecuteReconciliationAdapter
    • AdapterRunReconciler
    • OperationRunService
    • ReconciliationResult
    • RestoreSafetyResolver
    • RestoreRunDetailPresenter
    • current Operations and Restore detail renderers
  • Existing pattern(s) to extend: current adapter reconciliation path, current OperationRun lifecycle service ownership, current restore proof/detail model, current audit trail
  • Shared contract / presenter / builder / renderer to reuse: OperationRunService::applyReconciliationResult(), OperationRun::reconciliation(), OperationRunLinks, RestoreRunDetailPresenter, RestoreSafetyResolver, BadgeCatalog / BadgeRenderer
  • Why the existing shared path is sufficient or insufficient: the shared path exists, but the restore adapter's proof bar is too weak for tenant-changing work. It needs stricter restore-specific decision rules, not a new framework.
  • Allowed deviation and why: a small restore proof evaluator is allowed only if it keeps adapter logic reviewable and stays derived-only over existing records.
  • Consistency impact: restore success, partial, blocked, failed, and verification-gap wording must match across Operations, run detail, restore detail, notifications where existing, and audit-safe metadata.
  • Review focus: no new outcome family, no success from previewed, no success from terminal status alone, no raw provider payload in default UI, no bypass of policies or GraphClientInterface.

OperationRun UX Impact (mandatory)

  • Touches OperationRun start/completion/link UX?: yes, completion/reconciliation and link presentation only
  • Shared OperationRun UX contract/layer reused: OperationRunService, OperationRunLinks, OperationUxPresenter, current Operations hub/detail surfaces
  • Delegated start/completion UX behaviors:
    • existing restore queued feedback and run links remain on the shared path
    • reconciliation finalization remains service-owned
    • terminal notifications remain on the current central lifecycle path
  • Local surface-owned behavior that remains: restore initiation inputs, preview/dry-run controls, confirmation copy, and restore-specific proof detail
  • Queued DB-notification policy: unchanged; no new queued DB notification policy
  • Terminal notification path: unchanged central lifecycle mechanism
  • Exception required?: none

Provider Boundary / Platform Core Check (mandatory)

  • Shared provider/platform boundary touched?: yes
  • Boundary classification: mixed
  • Seams affected: provider-backed restore.execute, write gate, provider-operation start checks, restore result metadata, OperationRun reconciliation metadata
  • Neutral platform terms preserved or introduced: operation, execution proof, provider acceptance, verification evidence, scope safety, audit trail, managed environment
  • Provider-specific semantics retained and why: Microsoft/Intune restore behavior remains provider-owned because the current runtime has only Microsoft restore execution. Provider-specific payloads stay inside existing restore/provider services and are not promoted to platform-core taxonomy.
  • Why this does not deepen provider coupling accidentally: the spec tightens proof criteria around existing restore.execute; it does not create provider-neutral restore abstractions, provider registries, or Graph contract expansion.
  • Follow-up path: follow-up-spec only if future restore verification becomes a distinct queued operation with repo-real execution and artifact truth.

UI / Surface Guardrail Impact (mandatory)

Surface / Change Operator-facing surface change? Native vs Custom Shared-Family Relevance State Layers Touched Exception Needed? Low-Impact / N/A Note
Operations hub restore outcome wording yes Native Filament page shared monitoring family page, table row no existing surface only
Tenantless run detail restore reconciliation explanation yes Native Filament page shared monitoring detail family detail no explanation and proof metadata only
Restore Run detail proof state yes Filament infolist plus existing custom Blade entry restore proof/detail family detail no proof-safe presentation over existing state
Restore execute confirmation/start feedback yes Filament action/wizard dangerous action family wizard/action no no new action; proof semantics may be tightened

Decision-First Surface Role (mandatory)

Surface Decision Role Human-in-the-loop Moment Immediately Visible for First Decision On-Demand Detail / Evidence Why This Is Primary or Why Not Workflow Alignment Attention-load Reduction
Operations hub Primary Decision Surface Decide whether a restore run needs follow-up lifecycle, outcome, proof gap, one safe next action full run detail, restore detail, diagnostics primary because it is the canonical monitoring queue aligns with operations triage removes false-success row reading
Tenantless run detail Tertiary Evidence / Diagnostics Surface Confirm why restore reconciliation finalized a run one restore-specific explanation and related restore link raw context and support diagnostics tertiary because the run is selected preserves current detail role keeps proof reason above raw context
Restore Run detail Primary Decision Surface Decide whether recovery proof is available or follow-up is required result state, reason, impact, proof availability, one primary next action item outcomes, raw result payload, evidence diagnostics primary for restore result truth follows post-execution restore review separates completion from recovery proof
Restore execute confirmation Primary Decision Surface Decide whether real tenant mutation may start safety gates, preview/check currentness, mutation scope, confirmation preview details and diagnostics primary because mutation can alter tenant configuration follows safe restore execution flow prevents action before proof review

Audience-Aware Disclosure (mandatory)

Surface Audience Modes In Scope Decision-First Default-Visible Content Operator Diagnostics Support / Raw Evidence One Dominant Next Action Hidden / Gated By Default Duplicate-Truth Prevention
Operations hub operator-MSP, support-platform outcome, proof gap, related restore target reconciliation reason, summary counts raw context only in run detail open restore run or inspect run raw provider payloads one row outcome plus one link
Run detail operator-MSP, support-platform restore-specific reconciliation explanation context.reconciliation and related records raw context and failures secondary open restore run / inspect proof raw provider payloads and IDs explanation references one proof source
Restore detail operator-MSP, support-platform recovery proof question, result summary, evidence state item outcomes, failure family, audit links raw results collapsed open operation proof / open evidence / review gap raw JSON, internal reason ownership presenter owns result decision once

UI/UX Surface Classification (mandatory)

Surface Action Surface Class Surface Type Likely Next Operator Action Primary Inspect/Open Model Row Click Secondary Actions Placement Destructive Actions Placement Canonical Collection Route Canonical Detail Route Scope Signals Canonical Noun Critical Truth Visible by Default Exception Type / Justification
Operations hub List / Workbench Monitoring queue inspect a restore needing follow-up row/detail route allowed row/detail secondary links none introduced /admin/workspaces/{workspace}/operations /admin/workspaces/{workspace}/operations/{run} workspace and environment Operations / Operation restore outcome and proof gap none
Restore Run detail Detail / Evidence Dangerous workflow result review restore result proof detail page N/A diagnostics and related links after summary none introduced Restore Runs list Restore Run detail workspace and environment Restore Run completion vs recovery proof none
Restore execute confirmation Workflow / Dangerous Action Restore execution gate confirm or stop restore wizard step N/A proof/diagnostics panels final confirm step only Restore Runs list Restore Run detail after creation workspace, environment, mutation scope Restore Run safety and proof readiness none

Operator Surface Contract (mandatory)

Surface Primary Persona Decision / Operator Action Supported Surface Type Primary Operator Question Default-visible Information Diagnostics-only Information Status Dimensions Used Mutation Scope Primary Actions Dangerous Actions
Restore Run detail Tenant operator / MSP operator Decide whether recovery proof is available or follow-up is required Restore result detail Was this restore executed safely, and is recovery proof available? result state, reason, impact, operation proof, evidence state, summary counts item JSON, raw provider diagnostics, raw context execution outcome, provider acceptance, verification evidence, recovery proof, lifecycle read-only detail over a prior Microsoft tenant mutation open operation proof, open evidence, review proof gap none introduced
Operations run detail Workspace operator / support operator Inspect restore-linked operation proof Operation diagnostics detail Why did this restore operation finish this way? lifecycle, outcome, restore reconciliation reason, related restore link raw run context, failures, support evidence lifecycle, execution outcome, reconciliation proof read-only monitoring open restore run none introduced
Restore execute confirmation Tenant operator / MSP operator Confirm whether real restore execution may start Dangerous workflow wizard Can this restore mutate the Microsoft tenant now? safety gates, preview/check currentness, mutation scope, typed confirmation, proof limits preview detail, mapping detail, raw diff readiness, mutation scope, evidence availability Microsoft tenant when execution proceeds execute restore after confirmation execute restore

Proportionality Review (mandatory when structural complexity is introduced)

  • New source of truth?: no
  • New persisted entity/table/artifact?: no
  • New abstraction?: maybe; a small restore proof evaluator may be introduced only if it replaces duplicated adapter/detail proof logic and stays local to restore execution truth
  • New enum/state/reason family?: no new OperationRun outcome or persisted status family; small restore-specific reason codes such as restore.verification_required may be derived metadata only if they change operator next action
  • New cross-domain UI framework/taxonomy?: no
  • Current operator problem: false successful reconciliation for tenant-changing restore execution can make operators believe recovery is proven when only terminal status or partial execution exists.
  • Existing structure is insufficient because: the current restore adapter maps terminal restore status directly to OperationRun outcome without a strict proof bundle; restore detail proof surfaces already distinguish evidence but the run lifecycle can still overclaim.
  • Narrowest correct implementation: harden the existing RestoreExecuteReconciliationAdapter and existing presenters to require proof for success and fail closed otherwise.
  • Ownership cost: a small set of restore proof rules and focused tests that future restore changes must honor.
  • Alternative intentionally rejected: adding verification_required as a new OperationRun outcome or building a new restore verification operation family is rejected because the current repo has no corresponding execution truth.
  • Release truth: current-release truth; restore execution exists and is high-risk now.

Compatibility Posture

This feature assumes a pre-production environment.

Backward compatibility, legacy aliases, migration shims, historical fixtures, and compatibility-specific tests are out of scope unless explicitly required by this spec.

Canonical replacement is preferred over preservation. Existing RestoreRunStatus::Aborted and RestoreRunStatus::CompletedWithErrors may remain as current housekeeping semantics, but Spec 364 must not add new compatibility-only restore status aliases.

Testing / Lane / Runtime Impact (mandatory for runtime behavior changes)

  • Test purpose / classification: Unit + Feature/Livewire; Browser only if visible hierarchy changes
  • Validation lane(s): fast-feedback + confidence; browser only if visible hierarchy changes; PostgreSQL only if implementation touches query/index/lock behavior, which is not expected
  • Why this classification and these lanes are sufficient: Unit tests prove proof mapping and fail-closed adapter decisions; Feature tests prove Operations/Restore detail and authorization-safe fallout; one Browser smoke is justified only for changed high-risk visible proof hierarchy.
  • New or expanded test families: focused Spec 364 restore reconciliation tests; no heavy-governance family
  • Fixture / helper cost impact: reuse existing restore, backup set, operation run, evidence, and workspace fixtures; do not widen defaults
  • Heavy-family visibility / justification: no heavy-governance family; browser smoke only if visible hierarchy changes
  • Special surface test profile: shared-detail-family + monitoring-state-page + dangerous-workflow
  • Standard-native relief or required special coverage: not standard relief; restore is high-risk and needs focused proof
  • Reviewer handoff: reviewers must verify no success outcome is produced from preview-only, incomplete, wrong-scope, or missing-verification restore truth
  • Budget / baseline / trend impact: low; bounded Unit/Feature tests and optional one browser smoke
  • Escalation needed: document-in-feature
  • Active feature PR close-out entry: Guardrail / Exception / Smoke Coverage
  • Planned validation commands:
    • cd apps/platform && ./vendor/bin/sail artisan test --compact --filter=Spec364
    • cd apps/platform && ./vendor/bin/sail artisan test --compact --filter=RestoreRun
    • cd apps/platform && ./vendor/bin/sail artisan test --compact --filter=OperationRun
    • cd apps/platform && ./vendor/bin/sail php vendor/bin/pest tests/Browser/Spec364RestoreHighRiskOperationReconciliationSmokeTest.php --compact if browser coverage is added
    • cd apps/platform && ./vendor/bin/sail pint --dirty

User Scenarios & Testing (mandatory)

User Story 1 - Reconcile Restore Success Only With Complete Proof (Priority: P1)

As an MSP operator reviewing a restore-linked operation, I want restore.execute to become successful only when execution proof, provider acceptance, item result truth, post-run evidence, and audit continuity are present, so that I do not mistake a terminal record for verified recovery.

Why this priority: This prevents the most dangerous false success claim in a tenant-changing flow.

Independent Test: Create restore runs with completed, previewed, partial, failed, and incomplete-proof states; run adapter reconciliation; verify only complete proof produces succeeded.

Acceptance Scenarios:

  1. Given a restore.execute OperationRun linked to a completed RestoreRun with execution proof, provider acceptance, item counts, and post-run evidence, When adapter reconciliation runs, Then the OperationRun is completed with succeeded and proof metadata is stored in context.reconciliation.
  2. Given a linked RestoreRun is only previewed, When adapter reconciliation runs, Then the OperationRun is not marked as successful execution.
  3. Given provider acceptance or item result proof is missing, When reconciliation runs, Then success is withheld and the decision becomes partial, failed, blocked, or not reconciled according to the available proof.

User Story 2 - Surface Partial And Verification-Gap Truth Without New Outcomes (Priority: P1)

As an operator reviewing a restore result, I want verification gaps and mixed item outcomes to be visible without creating a misleading new run state, so that I know the next safe action.

Why this priority: Verification gaps are real operator truth, but a new persisted outcome would create avoidable platform complexity.

Independent Test: Reconcile completed-but-unverified and mixed-outcome restore runs and assert existing partially_succeeded, blocked, or failed outcomes plus restore-specific reason metadata and visible copy.

Acceptance Scenarios:

  1. Given a restore mutates tenant state but post-run evidence is unavailable, When reconciliation finalizes, Then the OperationRun outcome is not succeeded; it carries a restore verification-gap reason and a primary next action to review or generate evidence.
  2. Given some restore items succeed and others fail or are skipped, When reconciliation finalizes, Then the outcome is partially_succeeded and summary counts remain flat numeric values.
  3. Given a write gate, provider capability, backup availability, or scope safety blocker prevents meaningful execution, When reconciliation finalizes or fails closed, Then the outcome is blocked or failed with safe reason metadata.

User Story 3 - Preserve Scope Safety And Audit Continuity (Priority: P2)

As a platform/support operator, I want restore reconciliation to prove it matched the correct workspace and managed environment and to preserve audit references, so that troubleshooting does not expose or conflate tenant data.

Why this priority: Restore proof is only trustworthy if it is scope-safe and audit-backed.

Independent Test: Attempt reconciliation across wrong workspace/environment restore records and verify no reconciliation occurs; verify same-scope runs include safe audit/proof identifiers only.

Acceptance Scenarios:

  1. Given a restore run from another managed environment has the same ID-like context shape, When reconciliation evaluates the OperationRun, Then the adapter refuses to reconcile it.
  2. Given audit continuity is missing for a high-risk restore execution, When success would otherwise be possible, Then success is withheld or an explicit proof-gap reason is recorded.
  3. Given a user lacks access to the related restore detail, When Operations renders, Then no hidden restore metadata or tenant existence leaks.

User Story 4 - Keep Unsupported High-Risk Restore Families Out Of Scope (Priority: P3)

As a reviewer, I want the spec to explicitly reject new restore operation families and generic high-risk operation machinery, so that implementation stays bounded.

Why this priority: The user draft contains valid future language, but widening this slice would collide with the constitution's anti-bloat rules.

Independent Test: Static or feature assertions prove only restore.execute is registered for Spec 364 restore reconciliation hardening and no restore.verify, restore.rollback.*, or generic high-risk registry is introduced.

Acceptance Scenarios:

  1. Given a run type such as restore.verify or restore.rollback.execute, When Spec 364 reconciliation support is inspected, Then it is unsupported unless a future spec creates repo-real execution truth for it.
  2. Given promotion.execute or AI execution is high-risk, When this implementation is reviewed, Then it remains out of scope and no generic high-risk framework appears.

Edge Cases

  • A RestoreRun status is previewed; this is pre-execution truth and must not mark restore.execute as successful.
  • A RestoreRun is completed but item outcomes or summary counts are absent; success must be withheld unless proof is sufficient.
  • A restore job writes a terminal failure after a provider exception; failure reason must be sanitized and no raw provider payload may appear in default UI or audit metadata.
  • A restore is blocked by write gate or provider capability; no new execution success may be inferred from a related existing record.
  • A restore has post-run evidence available but it belongs to a different workspace or managed environment; reconciliation must fail closed.
  • A system-run or initiator-null restore context must follow existing OperationRun notification rules and avoid initiator-only terminal DB notifications.

Requirements (mandatory)

Functional Requirements

  • FR-364-001: The system MUST support Spec 364 hardening only for canonical restore.execute in this slice.
  • FR-364-002: The system MUST NOT mark restore.execute as succeeded from RestoreRunStatus::Previewed.
  • FR-364-003: The system MUST NOT mark restore.execute as succeeded from terminal RestoreRun status alone.
  • FR-364-004: The system MUST require a complete success proof bundle before writing OperationRunOutcome::Succeeded for restored execution.
  • FR-364-005: The success proof bundle MUST follow the Success Proof Bundle Matrix and include same-scope RestoreRun linkage, execution proof, provider acceptance or equivalent safe mutation proof, interpretable item or aggregate result truth, post-run evidence or explicit proof availability, and audit continuity.
  • FR-364-006: Missing verification or post-run evidence after mutation MUST NOT produce succeeded; it MUST produce an existing outcome such as partially_succeeded, blocked, or failed with restore-specific reason metadata.
  • FR-364-007: Mixed item results MUST produce partially_succeeded and flat numeric summary counts where counts are available.
  • FR-364-008: Write-gate, provider capability, backup availability, or scope-safety blockers MUST produce blocked or failed when same-scope restore proof exists; they may produce a non-final not_reconciled decision only when the adapter cannot safely identify same-scope proof.
  • FR-364-009: Reconciliation MUST fail closed when the linked RestoreRun is missing, wrong-scope, soft-deleted in a way that invalidates proof, lacks required proof metadata, or lacks required audit continuity.
  • FR-364-010: Reconciliation metadata MUST be safe for audit and operator display: no secrets, no raw provider payloads, no raw credential payloads, and no hidden tenant hints.
  • FR-364-011: The Operations hub and run detail MUST show restore-specific success, partial, blocked, failed, or proof-gap meaning using existing shared OperationRun presentation paths.
  • FR-364-012: Restore Run detail MUST continue to distinguish operation proof from post-run evidence and MUST not claim verified recovery when evidence is absent.
  • FR-364-013: Implementation MUST NOT introduce a new OperationRunOutcome, new OperationRunStatus, new persisted restore verification table, or new restore operation type.
  • FR-364-014: Unsupported future restore or high-risk operation types MUST remain unsupported and fail closed unless a future spec provides repo-real execution truth.
  • FR-364-015: Tests MUST prove wrong-workspace and wrong-managed-environment restore records cannot reconcile a run.
  • FR-364-016: Tests MUST prove success, partial, blocked, failed, preview-only, missing-proof, missing-audit, soft-deleted RestoreRun, and wrong-scope branches.

Non-Functional Requirements

  • NFR-364-001: Reconciliation must remain DB-local and must not call Microsoft Graph or any provider API.
  • NFR-364-002: Reconciliation must remain idempotent and service-owned through current OperationRunService paths.
  • NFR-364-003: Default-visible UI must remain calm but not falsely reassuring.
  • NFR-364-004: Summary counts must use existing flat numeric OperationRun summary rules.
  • NFR-364-005: No migration, env var, scheduler, queue family, package, panel provider, or asset registration is expected.

Key Entities (include if feature involves data)

  • OperationRun: existing execution and reconciliation truth for restore.execute.
  • RestoreRun: existing restore request/result truth with status, preview, results, metadata, and optional operation link.
  • AuditLog: existing audit trail truth for restore started/failed/executed events.
  • EvidenceSnapshot: optional post-run evidence context where existing links already prove scope-safe evidence.

Success Criteria (mandatory)

Measurable Outcomes

  • SC-364-001: restore.execute reconciliation produces succeeded only for complete-proof fixtures and never for preview-only, missing-proof, missing-audit, soft-deleted, wrong-scope, or mixed-result fixtures.
  • SC-364-002: Focused Spec 364 Unit/Feature tests cover all primary outcome branches and pass in the narrow validation lane.
  • SC-364-003: Existing Operations and Restore detail surfaces present proof gaps without introducing a new route, page family, or customer-facing surface.
  • SC-364-004: No new database table, migration, OperationRun outcome/status, restore operation type, package, or Graph contract is introduced.
  • SC-364-005: Audit/proof metadata contains only safe identifiers, counts, reason codes, and links; no secrets or raw provider payloads appear.

Assumptions

  • Existing Restore Run result and metadata payloads already contain enough safe aggregate or item-level truth to distinguish success, partial, blocked, failed, and proof-gap cases; if implementation proves otherwise, success must be withheld rather than inventing a new persisted proof model.
  • Existing post-run evidence links may be unavailable for many restores; this is a partial/proof-gap state, not a success state.
  • Current pre-production posture allows canonical cleanup without historical compatibility shims.

Risks

  • Risk 1 - Existing restore data lacks enough proof for success: mitigate by failing closed and documenting exact missing proof rather than loosening success criteria.
  • Risk 2 - Over-widening into a new restore verification operation: mitigate by forbidding new operation types in this spec and deferring verification execution to a future spec.
  • Risk 3 - UI repeats proof truth in multiple places: mitigate by keeping Restore Run detail presenter and OperationRun presenter aligned and avoiding duplicate default-visible summaries.
  • Risk 4 - Test fixtures become broad and expensive: mitigate by using focused factories and existing helpers without widening global defaults.

Open Questions

No open question blocks preparation. Implementation must verify the exact available RestoreRun metadata keys before deciding whether any small derived helper is needed.

Follow-Up Spec Candidates

  • Restore verification operation family v1, only if a future product decision creates repo-real queued verification and evidence truth.
  • Restore rollback execution truth, only after restore verification semantics exist.
  • Cross-domain high-risk operation framework, only if at least two additional high-risk operation families need the same proof boundary and cannot be handled locally.
  • Customer-safe restore recovery report, only after internal proof semantics are stable.