ahmido 3ce1cae71e feat: implement restore high risk operation reconciliation (#435 )

Implemented restore high risk operation reconciliation.

Co-authored-by: Ahmed Darrazi <ahmed.darrazi@live.de>
Reviewed-on: #435

2026-06-07 14:10:34 +00:00

45 KiB

Raw Permalink Blame History

Feature Specification: Spec 364 - Restore and High-Risk Operation Reconciliation

Feature Branch: 364-restore-high-risk-operation-reconciliation Created: 2026-06-07 Status: Draft Type: Restore execution truth hardening / OperationRun reconciliation follow-up / no new persistence Runtime posture: Tighten restore.execute reconciliation so real restore execution cannot be marked successful from weak terminal signals alone. Reuse existing RestoreRun, OperationRun, audit, proof, and evidence surfaces. Do not create a new restore engine, new operation family, or new persisted verification model. Input: User-provided Spec 364 draft in /Users/ahmeddarrazi/.codex/attachments/fe416f8b-141a-44eb-ae89-ab62a4691bed/pasted-text.txt, reconciled against current repo truth after Specs 358-363.

Dependencies And Historical Context

This package continues the current OperationRun truth line:

Spec 358 - OperationRun Queue Truth Foundation established honest queued/running stale handling and explicitly deferred business-success reconciliation.
Spec 359 - OperationRun Reconciliation Adapter Framework & Review Compose Adapter added a bounded adapter path and treated restore as an existing adapter precedent.
Spec 360 - OperationRun Canonical Cutover Cleanup canonicalized the adapter registry, context.reconciliation, and dispatch/correlation semantics over the real restore and review-compose cases.
Spec 361 - Report and Evidence Reconciliation Adapters added artifact-backed reconciliation for evidence snapshot and review-pack generation while keeping restore expansion out of scope.
Spec 362 - Sync, Capture, and Backup Operation Semantics added selected-family proof semantics for sync, baseline capture, and backup schedule runs while deferring restore.
Spec 363 - Explicit UiActionContext Contract is implemented action-context hardening and is dependency context only.

Restore-specific productization already exists and must not be reopened:

Spec 333 - Restore Create UX Final Productization productized the pre-execution restore wizard.
Spec 335 - Restore Run Detail / Post-Execution Proof Productization productized the restore detail proof surface.
Spec 181 - Restore Safety Integrity defines earlier restore safety requirements and remains historical context.

Current repo truth already contains:

App\Support\Operations\Reconciliation\RestoreExecuteReconciliationAdapter
App\Services\AdapterRunReconciler
App\Services\OperationRunService
App\Support\Operations\Reconciliation\ReconciliationResult
App\Support\OperationRunOutcome values succeeded, partially_succeeded, blocked, and failed
App\Support\RestoreRunStatus
App\Jobs\ExecuteRestoreRunJob
App\Listeners\SyncRestoreRunToOperationRun
restore proof/result presentation in RestoreRunDetailPresenter, RestoreSafetyResolver, and restore infolist Blade entries

The user draft's verification_required concept is valid product truth, but repo truth does not justify a new OperationRun outcome or new restore.verify operation type in this slice. Spec 364 represents verification gaps through existing outcomes plus restore-specific reason and evidence metadata.

Spec Candidate Check (mandatory - SPEC-GATE-001)

Problem: restore.execute can still be reconciled from terminal RestoreRun status in a way that risks overclaiming success for high-risk tenant-changing work. The current adapter maps previewed and completed to succeeded without requiring a complete proof bundle that distinguishes execution proof, provider acceptance, item-level result truth, verification evidence, scope safety, and audit continuity.
Today's failure: A stale or interrupted restore operation can become a calm successful OperationRun when the related restore record is terminal, even if post-run evidence is unavailable, item outcomes are partial, provider proof is incomplete, or the restore status represents preview-only or pre-execution truth.
User-visible improvement: Operators inspecting Operations or Restore Run detail will see restore execution truth that is honest by default: succeeded only when execution and proof are complete, partially succeeded when mutation occurred but verification or item truth is incomplete, blocked when safety gates prevent execution, and failed when no safe proof exists.
Smallest enterprise-capable version: Harden exactly restore.execute reconciliation and visible fallout over current repo-real proof paths. Reuse existing OperationRunOutcome, RestoreRunStatus, RestoreRun.results, RestoreRun.metadata, RestoreRun.operation_run_id, audit links, and existing Restore Run detail proof presentation. Do not add new operation types, persistence, Graph contracts, or restore wizard behavior.
Explicit non-goals: No new restore wizard, no new restore engine, no new rollback system, no restore.preview / restore.validate / restore.verify operation types, no new OperationRun outcome, no new table, no new Graph provider, no new Backup model, no new diff algorithm, no UI redesign of Operations or Restore detail, no retry console, no destructive cleanup action, no compatibility shims for pre-production historical rows.
Permanent complexity imported: One bounded restore-proof decision path inside or beside the existing restore reconciliation adapter, focused Unit/Feature/Browser coverage, and small copy/metadata adjustments on existing Operations and Restore detail surfaces. No new persisted truth or cross-domain framework.
Why now: Specs 358-362 intentionally matured OperationRun truth family by family. Restore is the highest-risk remaining adapter family because it can mutate real tenant configuration and false success is more dangerous than false calm for read-only report or backup work.
Why not local: A Restore Run detail copy-only fix would not stop stale adapter reconciliation from writing overclaiming OperationRun truth. A job-only fix would not cover late/stale reconciliation. The proof boundary belongs in the shared adapter/service path and must remain visible in current run surfaces.
Approval class: Core Enterprise.
Red flags triggered: high-risk domain semantics, adapter proof hardening, restore-specific reason codes. Defense: this slice narrows the existing restore adapter rather than adding a framework; it uses existing outcomes, existing records, existing surfaces, and fail-closed rules.
Score: Nutzen: 2 | Dringlichkeit: 2 | Scope: 2 | Komplexitaet: 1 | Produktnaehe: 2 | Wiederverwendung: 2 | Gesamt: 11/12
Decision: approve.

Candidate Source And Completed-Spec Guardrail

Candidate source:
- direct user-provided Spec 364 draft in pasted-text.txt
- repo-real follow-through after Specs 358-363
- roadmap relationship: Golden Master Governance and restore safety/read-write separation; this is execution-truth hardening rather than a customer portal or productization backlog item
Queue boundary: docs/product/spec-candidates.md records no safe automatic next-best-prep target for auto-selection. This package is an intentional manual promotion from direct user input, not an automatic backlog pick.
Completed-spec check result:
- no specs/364-* package existed before this prep
- no local or remote 364-* branch existed before this prep
- Specs 333, 335, and 358-363 are completed or implementation-context packages and must not be rewritten, normalized, unchecked, or reopened
- completed close-out, validation, smoke, browser, and checked task markers in related specs remain historical evidence
Close alternatives deferred:
- broader restore workflow redesign is deferred because Spec 333 already covers create UX and Spec 335 already covers detail proof productization
- new restore.verify or rollback operation families are deferred until there is an explicit product/runtime source of truth for verification execution
- generic high-risk operation framework is deferred; promotion.execute and AI or operational-control high-risk flows need separate product decisions
- support-desk, customer-review, provider-scope, and canonical-link productization lanes are already specced/active/completed or manual-promotion only
Smallest viable implementation slice: harden restore.execute adapter proof and related Operations/Restore detail presentation only.

Summary

This feature prevents restore execution from being treated as successful merely because a related restore record reached a terminal status.

For restore.execute, the system must prove the intended mutation was accepted, item results are interpretable, verification or post-run evidence is available when success is claimed, scope remained safe, and audit continuity exists. If those signals are incomplete, the operation must finalize as partial, blocked, failed, or not reconciled instead of succeeding.

Success Proof Bundle Matrix

The implementation must use this matrix as the narrow repo-real proof boundary before writing OperationRunOutcome::Succeeded.

Proof element	Repo-real source to inspect	Success threshold	Missing or invalid proof result	Reason code guidance
Same-scope restore linkage	`OperationRun.context.restore_run_id`, `RestoreRun.operation_run_id`, `RestoreRun.workspace_id`, `RestoreRun.managed_environment_id`, default non-trashed `RestoreRun` query	A non-trashed RestoreRun belongs to the same workspace and managed environment as the OperationRun and links back through context or `operation_run_id` without contradiction	Return non-final `not_reconciled` when no same-scope RestoreRun can be safely identified; never use a wrong-scope or trashed RestoreRun as proof	`restore.proof_missing`, `restore.scope_mismatch`, `restore.run_deleted`
Execution proof	`RestoreRun.is_dry_run`, `RestoreRun.status`, `RestoreRun.started_at`, `RestoreRun.completed_at`, linked OperationRun status/outcome	Real execution, not preview/dry-run, reached an execution terminal status with timestamps consistent enough for audit and operator review	`previewed`, dry-run, missing start/completion proof, or terminal status alone cannot produce success	`restore.preview_only`, `restore.execution_proof_missing`
Provider acceptance / mutation proof	Existing restore service output persisted in `RestoreRun.results`, `RestoreRun.failure_reason`, and safe aggregate metadata	Results or metadata show the provider accepted or safely attempted the requested mutation and no provider-level rejection remains	Missing or rejected provider proof with a same-scope RestoreRun finalizes as `failed` or `blocked`; raw provider payloads stay out of reconciliation metadata	`restore.provider_proof_missing`, `restore.provider_rejected`
Item or aggregate result truth	`RestoreRun.results.items`, `RestoreRun.results.assignment_outcomes`, `RestoreRun.metadata.total_items`, `processed_items`, `succeeded_items`, `failed_items`, `skipped_items`, plus existing RestoreRun item helpers	Item or aggregate counts are interpretable, flat numeric where written to OperationRun, and show all required work succeeded	Mixed results produce `partially_succeeded`; absent result truth after execution withholds success	`restore.results_mixed`, `restore.results_missing`
Post-run evidence or explicit proof availability	Existing Restore Run detail evidence path, linked `OperationRun`, existing `EvidenceSnapshot` context when available, and safe metadata flags already persisted by the restore flow	Evidence is available or the existing restore result explicitly proves why recovery proof is available without a new persisted verification model	Missing evidence after mutation produces existing partial/blocked/failed truth with a visible proof-gap reason, not success	`restore.verification_required`, `restore.evidence_missing`
Audit continuity	`AuditLog` rows in the same workspace and managed environment, preferably linked through `operation_run_id` or stable restore action metadata; existing operation terminal audit remains service-owned	Same-scope audit trail can explain start/failure/completion or reconciliation without exposing secrets	Otherwise-complete proof without audit continuity must withhold success and record a safe proof-gap reason	`restore.audit_missing`

not_reconciled is a ReconciliationResult decision, not an OperationRunOutcome. It is valid only when the adapter cannot safely identify enough same-scope restore proof to finalize the run. If a same-scope RestoreRun exists and proves execution reached an unsafe, partial, blocked, failed, or verification-gap state, the adapter must finalize with an existing outcome instead of using not_reconciled to hide operator-visible truth.

Business/Product Value

Reduces the risk of false recovery claims after tenant-changing operations.
Makes restore monitoring and restore detail consistent with TenantPilot's read/write separation and audit-first posture.
Keeps the platform sellable as Governance-of-Record by making high-risk mutation truth stricter than report, evidence, sync, or backup truth.

Primary Users / Operators

Tenant/MSP operators who start and review restore execution.
Workspace owners/managers who approve or supervise high-risk changes.
Support/platform operators who troubleshoot restore outcomes through Operations and audit evidence.

Roadmap Relationship

Spec 364 belongs to the OperationRun execution-truth maturity line and the restore safety lane. It is not a new customer-facing workspace, not a new restore product surface, and not a generic high-risk operation framework.

Spec Scope Fields (mandatory)

Scope: canonical-view plus environment-bound restore execution truth
Primary Routes:
- /admin/workspaces/{workspace}/operations
- /admin/workspaces/{workspace}/operations/{run}
- existing environment-scoped Restore Run list/create/detail routes via App\Filament\Resources\RestoreRunResource
Data Ownership:
- operation_runs remain the only execution and reconciliation truth
- restore_runs remain the restore request/result truth
- audit_logs remain audit trail truth
- existing evidence snapshot links remain optional post-run evidence context only when already repo-backed
- no new persistence is introduced
RBAC:
- existing workspace-first OperationRun access remains authoritative
- existing Restore Run policy/resource access remains authoritative
- non-members and wrong-scope actors remain 404
- members missing restore capability remain 403 for restore execution
- no new capability strings are introduced

For canonical-view specs:

Default filter behavior when tenant-context is active: the Operations hub remains workspace-scoped with explicit environment filters. Reconciliation must use the run's stored workspace and managed environment scope, not remembered environment state, current page filters, or Filament tenant fallback.
Explicit entitlement checks preventing cross-tenant leakage: no adapter may reconcile to a restore run outside the operation's workspace and managed environment, and no related link may bypass current scope-safe routes.

UI Surface Impact (mandatory - UI-COV-001)

No UI surface impact
Existing page changed
New page/route added
Navigation changed
Filament panel/provider surface changed
New modal/drawer/wizard/action added
New table/form/state added
Customer-facing surface changed
Dangerous action changed
Status/evidence/review presentation changed
Workspace/environment context presentation changed

UI/Productization Coverage (mandatory when UI Surface Impact is not "No UI surface impact")

Route/page/surface:
- App\Filament\Pages\Monitoring\Operations
- App\Filament\Pages\Operations\TenantlessOperationRunViewer
- App\Filament\Resources\OperationRunResource as shared implementation seam
- App\Filament\Resources\RestoreRunResource detail proof surface
- restore execution start/confirmation path only where proof or queued feedback reflects restore.execute
Current or new page archetype: existing Operations monitoring/detail family plus existing environment-bound dangerous restore workflow/detail surfaces
Design depth: Domain Pattern Surface / Manual Review Required for restore proof and dangerous-action truth
Repo-truth level: repo-verified
Existing pattern reused: current OperationRun monitoring family, current Restore Run detail proof model, current Restore Create safety/proof model, current OperationRunLinks, current BADGE-001 status badge semantics
New pattern required: none; the change narrows existing restore proof/reconciliation behavior
Screenshot required: one bounded browser smoke screenshot only if implementation materially changes visible hierarchy; otherwise existing Spec 333/335 screenshot anchors remain sufficient
Page audit required: no new page-report identity unless implementation introduces a materially new visible hierarchy
Customer-safe review required: no customer-facing surface; copy must still avoid false recovery claims
Dangerous-action review required: yes; success wording and execute/verification claims must not overstate tenant recovery
Coverage files updated or explicitly not needed:
- docs/ui-ux-enterprise-audit/route-inventory.md
- docs/ui-ux-enterprise-audit/design-coverage-matrix.md
- docs/ui-ux-enterprise-audit/page-reports/...
- docs/ui-ux-enterprise-audit/strategic-surfaces.md
- docs/ui-ux-enterprise-audit/grouped-follow-up-candidates.md
- docs/ui-ux-enterprise-audit/unresolved-pages.md
- N/A - existing Operations and Restore Run page families already cover these reachable surfaces unless implementation proves visible hierarchy drift
No-impact rationale when applicable: N/A

Cross-Cutting / Shared Pattern Reuse (mandatory)

Cross-cutting feature?: yes
Interaction class(es): status messaging, action links, dangerous-action proof wording, OperationRun reconciliation diagnostics, restore result/proof viewers
Systems touched:
- OperationRunReconciliationRegistry
- RestoreExecuteReconciliationAdapter
- AdapterRunReconciler
- OperationRunService
- ReconciliationResult
- RestoreSafetyResolver
- RestoreRunDetailPresenter
- current Operations and Restore detail renderers
Existing pattern(s) to extend: current adapter reconciliation path, current OperationRun lifecycle service ownership, current restore proof/detail model, current audit trail
Shared contract / presenter / builder / renderer to reuse: OperationRunService::applyReconciliationResult(), OperationRun::reconciliation(), OperationRunLinks, RestoreRunDetailPresenter, RestoreSafetyResolver, BadgeCatalog / BadgeRenderer
Why the existing shared path is sufficient or insufficient: the shared path exists, but the restore adapter's proof bar is too weak for tenant-changing work. It needs stricter restore-specific decision rules, not a new framework.
Allowed deviation and why: a small restore proof evaluator is allowed only if it keeps adapter logic reviewable and stays derived-only over existing records.
Consistency impact: restore success, partial, blocked, failed, and verification-gap wording must match across Operations, run detail, restore detail, notifications where existing, and audit-safe metadata.
Review focus: no new outcome family, no success from previewed, no success from terminal status alone, no raw provider payload in default UI, no bypass of policies or GraphClientInterface.

OperationRun UX Impact (mandatory)

Touches OperationRun start/completion/link UX?: yes, completion/reconciliation and link presentation only
Shared OperationRun UX contract/layer reused: OperationRunService, OperationRunLinks, OperationUxPresenter, current Operations hub/detail surfaces
Delegated start/completion UX behaviors:
- existing restore queued feedback and run links remain on the shared path
- reconciliation finalization remains service-owned
- terminal notifications remain on the current central lifecycle path
Local surface-owned behavior that remains: restore initiation inputs, preview/dry-run controls, confirmation copy, and restore-specific proof detail
Queued DB-notification policy: unchanged; no new queued DB notification policy
Terminal notification path: unchanged central lifecycle mechanism
Exception required?: none

Provider Boundary / Platform Core Check (mandatory)

Shared provider/platform boundary touched?: yes
Boundary classification: mixed
Seams affected: provider-backed restore.execute, write gate, provider-operation start checks, restore result metadata, OperationRun reconciliation metadata
Neutral platform terms preserved or introduced: operation, execution proof, provider acceptance, verification evidence, scope safety, audit trail, managed environment
Provider-specific semantics retained and why: Microsoft/Intune restore behavior remains provider-owned because the current runtime has only Microsoft restore execution. Provider-specific payloads stay inside existing restore/provider services and are not promoted to platform-core taxonomy.
Why this does not deepen provider coupling accidentally: the spec tightens proof criteria around existing restore.execute; it does not create provider-neutral restore abstractions, provider registries, or Graph contract expansion.
Follow-up path: follow-up-spec only if future restore verification becomes a distinct queued operation with repo-real execution and artifact truth.

UI / Surface Guardrail Impact (mandatory)

Surface / Change	Operator-facing surface change?	Native vs Custom	Shared-Family Relevance	State Layers Touched	Exception Needed?	Low-Impact / `N/A` Note
Operations hub restore outcome wording	yes	Native Filament page	shared monitoring family	page, table row	no	existing surface only
Tenantless run detail restore reconciliation explanation	yes	Native Filament page	shared monitoring detail family	detail	no	explanation and proof metadata only
Restore Run detail proof state	yes	Filament infolist plus existing custom Blade entry	restore proof/detail family	detail	no	proof-safe presentation over existing state
Restore execute confirmation/start feedback	yes	Filament action/wizard	dangerous action family	wizard/action	no	no new action; proof semantics may be tightened

Decision-First Surface Role (mandatory)

Surface	Decision Role	Human-in-the-loop Moment	Immediately Visible for First Decision	On-Demand Detail / Evidence	Why This Is Primary or Why Not	Workflow Alignment	Attention-load Reduction
Operations hub	Primary Decision Surface	Decide whether a restore run needs follow-up	lifecycle, outcome, proof gap, one safe next action	full run detail, restore detail, diagnostics	primary because it is the canonical monitoring queue	aligns with operations triage	removes false-success row reading
Tenantless run detail	Tertiary Evidence / Diagnostics Surface	Confirm why restore reconciliation finalized a run	one restore-specific explanation and related restore link	raw context and support diagnostics	tertiary because the run is selected	preserves current detail role	keeps proof reason above raw context
Restore Run detail	Primary Decision Surface	Decide whether recovery proof is available or follow-up is required	result state, reason, impact, proof availability, one primary next action	item outcomes, raw result payload, evidence diagnostics	primary for restore result truth	follows post-execution restore review	separates completion from recovery proof
Restore execute confirmation	Primary Decision Surface	Decide whether real tenant mutation may start	safety gates, preview/check currentness, mutation scope, confirmation	preview details and diagnostics	primary because mutation can alter tenant configuration	follows safe restore execution flow	prevents action before proof review

Audience-Aware Disclosure (mandatory)

Surface	Audience Modes In Scope	Decision-First Default-Visible Content	Operator Diagnostics	Support / Raw Evidence	One Dominant Next Action	Hidden / Gated By Default	Duplicate-Truth Prevention
Operations hub	operator-MSP, support-platform	outcome, proof gap, related restore target	reconciliation reason, summary counts	raw context only in run detail	open restore run or inspect run	raw provider payloads	one row outcome plus one link
Run detail	operator-MSP, support-platform	restore-specific reconciliation explanation	context.reconciliation and related records	raw context and failures secondary	open restore run / inspect proof	raw provider payloads and IDs	explanation references one proof source
Restore detail	operator-MSP, support-platform	recovery proof question, result summary, evidence state	item outcomes, failure family, audit links	raw results collapsed	open operation proof / open evidence / review gap	raw JSON, internal reason ownership	presenter owns result decision once

UI/UX Surface Classification (mandatory)

Surface	Action Surface Class	Surface Type	Likely Next Operator Action	Primary Inspect/Open Model	Row Click	Secondary Actions Placement	Destructive Actions Placement	Canonical Collection Route	Canonical Detail Route	Scope Signals	Canonical Noun	Critical Truth Visible by Default	Exception Type / Justification
Operations hub	List / Workbench	Monitoring queue	inspect a restore needing follow-up	row/detail route	allowed	row/detail secondary links	none introduced	`/admin/workspaces/{workspace}/operations`	`/admin/workspaces/{workspace}/operations/{run}`	workspace and environment	Operations / Operation	restore outcome and proof gap	none
Restore Run detail	Detail / Evidence	Dangerous workflow result	review restore result proof	detail page	N/A	diagnostics and related links after summary	none introduced	Restore Runs list	Restore Run detail	workspace and environment	Restore Run	completion vs recovery proof	none
Restore execute confirmation	Workflow / Dangerous Action	Restore execution gate	confirm or stop restore	wizard step	N/A	proof/diagnostics panels	final confirm step only	Restore Runs list	Restore Run detail after creation	workspace, environment, mutation scope	Restore Run	safety and proof readiness	none

Operator Surface Contract (mandatory)

Surface	Primary Persona	Decision / Operator Action Supported	Surface Type	Primary Operator Question	Default-visible Information	Diagnostics-only Information	Status Dimensions Used	Mutation Scope	Primary Actions	Dangerous Actions
Restore Run detail	Tenant operator / MSP operator	Decide whether recovery proof is available or follow-up is required	Restore result detail	Was this restore executed safely, and is recovery proof available?	result state, reason, impact, operation proof, evidence state, summary counts	item JSON, raw provider diagnostics, raw context	execution outcome, provider acceptance, verification evidence, recovery proof, lifecycle	read-only detail over a prior Microsoft tenant mutation	open operation proof, open evidence, review proof gap	none introduced
Operations run detail	Workspace operator / support operator	Inspect restore-linked operation proof	Operation diagnostics detail	Why did this restore operation finish this way?	lifecycle, outcome, restore reconciliation reason, related restore link	raw run context, failures, support evidence	lifecycle, execution outcome, reconciliation proof	read-only monitoring	open restore run	none introduced
Restore execute confirmation	Tenant operator / MSP operator	Confirm whether real restore execution may start	Dangerous workflow wizard	Can this restore mutate the Microsoft tenant now?	safety gates, preview/check currentness, mutation scope, typed confirmation, proof limits	preview detail, mapping detail, raw diff	readiness, mutation scope, evidence availability	Microsoft tenant when execution proceeds	execute restore after confirmation	execute restore

Proportionality Review (mandatory when structural complexity is introduced)

New source of truth?: no
New persisted entity/table/artifact?: no
New abstraction?: maybe; a small restore proof evaluator may be introduced only if it replaces duplicated adapter/detail proof logic and stays local to restore execution truth
New enum/state/reason family?: no new OperationRun outcome or persisted status family; small restore-specific reason codes such as restore.verification_required may be derived metadata only if they change operator next action
New cross-domain UI framework/taxonomy?: no
Current operator problem: false successful reconciliation for tenant-changing restore execution can make operators believe recovery is proven when only terminal status or partial execution exists.
Existing structure is insufficient because: the current restore adapter maps terminal restore status directly to OperationRun outcome without a strict proof bundle; restore detail proof surfaces already distinguish evidence but the run lifecycle can still overclaim.
Narrowest correct implementation: harden the existing RestoreExecuteReconciliationAdapter and existing presenters to require proof for success and fail closed otherwise.
Ownership cost: a small set of restore proof rules and focused tests that future restore changes must honor.
Alternative intentionally rejected: adding verification_required as a new OperationRun outcome or building a new restore verification operation family is rejected because the current repo has no corresponding execution truth.
Release truth: current-release truth; restore execution exists and is high-risk now.

Compatibility Posture

This feature assumes a pre-production environment.

Backward compatibility, legacy aliases, migration shims, historical fixtures, and compatibility-specific tests are out of scope unless explicitly required by this spec.

Canonical replacement is preferred over preservation. Existing RestoreRunStatus::Aborted and RestoreRunStatus::CompletedWithErrors may remain as current housekeeping semantics, but Spec 364 must not add new compatibility-only restore status aliases.

Testing / Lane / Runtime Impact (mandatory for runtime behavior changes)

Test purpose / classification: Unit + Feature/Livewire; Browser only if visible hierarchy changes
Validation lane(s): fast-feedback + confidence; browser only if visible hierarchy changes; PostgreSQL only if implementation touches query/index/lock behavior, which is not expected
Why this classification and these lanes are sufficient: Unit tests prove proof mapping and fail-closed adapter decisions; Feature tests prove Operations/Restore detail and authorization-safe fallout; one Browser smoke is justified only for changed high-risk visible proof hierarchy.
New or expanded test families: focused Spec 364 restore reconciliation tests; no heavy-governance family
Fixture / helper cost impact: reuse existing restore, backup set, operation run, evidence, and workspace fixtures; do not widen defaults
Heavy-family visibility / justification: no heavy-governance family; browser smoke only if visible hierarchy changes
Special surface test profile: shared-detail-family + monitoring-state-page + dangerous-workflow
Standard-native relief or required special coverage: not standard relief; restore is high-risk and needs focused proof
Reviewer handoff: reviewers must verify no success outcome is produced from preview-only, incomplete, wrong-scope, or missing-verification restore truth
Budget / baseline / trend impact: low; bounded Unit/Feature tests and optional one browser smoke
Escalation needed: document-in-feature
Active feature PR close-out entry: Guardrail / Exception / Smoke Coverage
Planned validation commands:
- cd apps/platform && ./vendor/bin/sail artisan test --compact --filter=Spec364
- cd apps/platform && ./vendor/bin/sail artisan test --compact --filter=RestoreRun
- cd apps/platform && ./vendor/bin/sail artisan test --compact --filter=OperationRun
- cd apps/platform && ./vendor/bin/sail php vendor/bin/pest tests/Browser/Spec364RestoreHighRiskOperationReconciliationSmokeTest.php --compact if browser coverage is added
- cd apps/platform && ./vendor/bin/sail pint --dirty

User Scenarios & Testing (mandatory)

User Story 1 - Reconcile Restore Success Only With Complete Proof (Priority: P1)

As an MSP operator reviewing a restore-linked operation, I want restore.execute to become successful only when execution proof, provider acceptance, item result truth, post-run evidence, and audit continuity are present, so that I do not mistake a terminal record for verified recovery.

Why this priority: This prevents the most dangerous false success claim in a tenant-changing flow.

Independent Test: Create restore runs with completed, previewed, partial, failed, and incomplete-proof states; run adapter reconciliation; verify only complete proof produces succeeded.

Acceptance Scenarios:

Given a restore.execute OperationRun linked to a completed RestoreRun with execution proof, provider acceptance, item counts, and post-run evidence, When adapter reconciliation runs, Then the OperationRun is completed with succeeded and proof metadata is stored in context.reconciliation.
Given a linked RestoreRun is only previewed, When adapter reconciliation runs, Then the OperationRun is not marked as successful execution.
Given provider acceptance or item result proof is missing, When reconciliation runs, Then success is withheld and the decision becomes partial, failed, blocked, or not reconciled according to the available proof.

User Story 2 - Surface Partial And Verification-Gap Truth Without New Outcomes (Priority: P1)

As an operator reviewing a restore result, I want verification gaps and mixed item outcomes to be visible without creating a misleading new run state, so that I know the next safe action.

Why this priority: Verification gaps are real operator truth, but a new persisted outcome would create avoidable platform complexity.

Independent Test: Reconcile completed-but-unverified and mixed-outcome restore runs and assert existing partially_succeeded, blocked, or failed outcomes plus restore-specific reason metadata and visible copy.

Acceptance Scenarios:

Given a restore mutates tenant state but post-run evidence is unavailable, When reconciliation finalizes, Then the OperationRun outcome is not succeeded; it carries a restore verification-gap reason and a primary next action to review or generate evidence.
Given some restore items succeed and others fail or are skipped, When reconciliation finalizes, Then the outcome is partially_succeeded and summary counts remain flat numeric values.
Given a write gate, provider capability, backup availability, or scope safety blocker prevents meaningful execution, When reconciliation finalizes or fails closed, Then the outcome is blocked or failed with safe reason metadata.

User Story 3 - Preserve Scope Safety And Audit Continuity (Priority: P2)

As a platform/support operator, I want restore reconciliation to prove it matched the correct workspace and managed environment and to preserve audit references, so that troubleshooting does not expose or conflate tenant data.

Why this priority: Restore proof is only trustworthy if it is scope-safe and audit-backed.

Independent Test: Attempt reconciliation across wrong workspace/environment restore records and verify no reconciliation occurs; verify same-scope runs include safe audit/proof identifiers only.

Acceptance Scenarios:

Given a restore run from another managed environment has the same ID-like context shape, When reconciliation evaluates the OperationRun, Then the adapter refuses to reconcile it.
Given audit continuity is missing for a high-risk restore execution, When success would otherwise be possible, Then success is withheld or an explicit proof-gap reason is recorded.
Given a user lacks access to the related restore detail, When Operations renders, Then no hidden restore metadata or tenant existence leaks.

User Story 4 - Keep Unsupported High-Risk Restore Families Out Of Scope (Priority: P3)

As a reviewer, I want the spec to explicitly reject new restore operation families and generic high-risk operation machinery, so that implementation stays bounded.

Why this priority: The user draft contains valid future language, but widening this slice would collide with the constitution's anti-bloat rules.

Independent Test: Static or feature assertions prove only restore.execute is registered for Spec 364 restore reconciliation hardening and no restore.verify, restore.rollback.*, or generic high-risk registry is introduced.

Acceptance Scenarios:

Given a run type such as restore.verify or restore.rollback.execute, When Spec 364 reconciliation support is inspected, Then it is unsupported unless a future spec creates repo-real execution truth for it.
Given promotion.execute or AI execution is high-risk, When this implementation is reviewed, Then it remains out of scope and no generic high-risk framework appears.

Edge Cases

A RestoreRun status is previewed; this is pre-execution truth and must not mark restore.execute as successful.
A RestoreRun is completed but item outcomes or summary counts are absent; success must be withheld unless proof is sufficient.
A restore job writes a terminal failure after a provider exception; failure reason must be sanitized and no raw provider payload may appear in default UI or audit metadata.
A restore is blocked by write gate or provider capability; no new execution success may be inferred from a related existing record.
A restore has post-run evidence available but it belongs to a different workspace or managed environment; reconciliation must fail closed.
A system-run or initiator-null restore context must follow existing OperationRun notification rules and avoid initiator-only terminal DB notifications.

Requirements (mandatory)

Functional Requirements

FR-364-001: The system MUST support Spec 364 hardening only for canonical restore.execute in this slice.
FR-364-002: The system MUST NOT mark restore.execute as succeeded from RestoreRunStatus::Previewed.
FR-364-003: The system MUST NOT mark restore.execute as succeeded from terminal RestoreRun status alone.
FR-364-004: The system MUST require a complete success proof bundle before writing OperationRunOutcome::Succeeded for restored execution.
FR-364-005: The success proof bundle MUST follow the Success Proof Bundle Matrix and include same-scope RestoreRun linkage, execution proof, provider acceptance or equivalent safe mutation proof, interpretable item or aggregate result truth, post-run evidence or explicit proof availability, and audit continuity.
FR-364-006: Missing verification or post-run evidence after mutation MUST NOT produce succeeded; it MUST produce an existing outcome such as partially_succeeded, blocked, or failed with restore-specific reason metadata.
FR-364-007: Mixed item results MUST produce partially_succeeded and flat numeric summary counts where counts are available.
FR-364-008: Write-gate, provider capability, backup availability, or scope-safety blockers MUST produce blocked or failed when same-scope restore proof exists; they may produce a non-final not_reconciled decision only when the adapter cannot safely identify same-scope proof.
FR-364-009: Reconciliation MUST fail closed when the linked RestoreRun is missing, wrong-scope, soft-deleted in a way that invalidates proof, lacks required proof metadata, or lacks required audit continuity.
FR-364-010: Reconciliation metadata MUST be safe for audit and operator display: no secrets, no raw provider payloads, no raw credential payloads, and no hidden tenant hints.
FR-364-011: The Operations hub and run detail MUST show restore-specific success, partial, blocked, failed, or proof-gap meaning using existing shared OperationRun presentation paths.
FR-364-012: Restore Run detail MUST continue to distinguish operation proof from post-run evidence and MUST not claim verified recovery when evidence is absent.
FR-364-013: Implementation MUST NOT introduce a new OperationRunOutcome, new OperationRunStatus, new persisted restore verification table, or new restore operation type.
FR-364-014: Unsupported future restore or high-risk operation types MUST remain unsupported and fail closed unless a future spec provides repo-real execution truth.
FR-364-015: Tests MUST prove wrong-workspace and wrong-managed-environment restore records cannot reconcile a run.
FR-364-016: Tests MUST prove success, partial, blocked, failed, preview-only, missing-proof, missing-audit, soft-deleted RestoreRun, and wrong-scope branches.

Non-Functional Requirements

NFR-364-001: Reconciliation must remain DB-local and must not call Microsoft Graph or any provider API.
NFR-364-002: Reconciliation must remain idempotent and service-owned through current OperationRunService paths.
NFR-364-003: Default-visible UI must remain calm but not falsely reassuring.
NFR-364-004: Summary counts must use existing flat numeric OperationRun summary rules.
NFR-364-005: No migration, env var, scheduler, queue family, package, panel provider, or asset registration is expected.

Key Entities (include if feature involves data)

OperationRun: existing execution and reconciliation truth for restore.execute.
RestoreRun: existing restore request/result truth with status, preview, results, metadata, and optional operation link.
AuditLog: existing audit trail truth for restore started/failed/executed events.
EvidenceSnapshot: optional post-run evidence context where existing links already prove scope-safe evidence.

Success Criteria (mandatory)

Measurable Outcomes

SC-364-001: restore.execute reconciliation produces succeeded only for complete-proof fixtures and never for preview-only, missing-proof, missing-audit, soft-deleted, wrong-scope, or mixed-result fixtures.
SC-364-002: Focused Spec 364 Unit/Feature tests cover all primary outcome branches and pass in the narrow validation lane.
SC-364-003: Existing Operations and Restore detail surfaces present proof gaps without introducing a new route, page family, or customer-facing surface.
SC-364-004: No new database table, migration, OperationRun outcome/status, restore operation type, package, or Graph contract is introduced.
SC-364-005: Audit/proof metadata contains only safe identifiers, counts, reason codes, and links; no secrets or raw provider payloads appear.

Assumptions

Existing Restore Run result and metadata payloads already contain enough safe aggregate or item-level truth to distinguish success, partial, blocked, failed, and proof-gap cases; if implementation proves otherwise, success must be withheld rather than inventing a new persisted proof model.
Existing post-run evidence links may be unavailable for many restores; this is a partial/proof-gap state, not a success state.
Current pre-production posture allows canonical cleanup without historical compatibility shims.

Risks

Risk 1 - Existing restore data lacks enough proof for success: mitigate by failing closed and documenting exact missing proof rather than loosening success criteria.
Risk 2 - Over-widening into a new restore verification operation: mitigate by forbidding new operation types in this spec and deferring verification execution to a future spec.
Risk 3 - UI repeats proof truth in multiple places: mitigate by keeping Restore Run detail presenter and OperationRun presenter aligned and avoiding duplicate default-visible summaries.
Risk 4 - Test fixtures become broad and expensive: mitigate by using focused factories and existing helpers without widening global defaults.

Open Questions

No open question blocks preparation. Implementation must verify the exact available RestoreRun metadata keys before deciding whether any small derived helper is needed.

Follow-Up Spec Candidates

Restore verification operation family v1, only if a future product decision creates repo-real queued verification and evidence truth.
Restore rollback execution truth, only after restore verification semantics exist.
Cross-domain high-risk operation framework, only if at least two additional high-risk operation families need the same proof boundary and cannot be handled locally.
Customer-safe restore recovery report, only after internal proof semantics are stable.

45 KiB Raw Permalink Blame History

Feature Specification: Spec 364 - Restore and High-Risk Operation Reconciliation

Dependencies And Historical Context

Spec Candidate Check (mandatory - SPEC-GATE-001)

Candidate Source And Completed-Spec Guardrail

Summary

Success Proof Bundle Matrix

Business/Product Value

Primary Users / Operators

Roadmap Relationship

Spec Scope Fields (mandatory)

UI Surface Impact (mandatory - UI-COV-001)

UI/Productization Coverage (mandatory when UI Surface Impact is not "No UI surface impact")

Cross-Cutting / Shared Pattern Reuse (mandatory)

OperationRun UX Impact (mandatory)

Provider Boundary / Platform Core Check (mandatory)

UI / Surface Guardrail Impact (mandatory)

Decision-First Surface Role (mandatory)

Audience-Aware Disclosure (mandatory)

UI/UX Surface Classification (mandatory)

Operator Surface Contract (mandatory)

Proportionality Review (mandatory when structural complexity is introduced)

Compatibility Posture

Testing / Lane / Runtime Impact (mandatory for runtime behavior changes)

User Scenarios & Testing (mandatory)

User Story 1 - Reconcile Restore Success Only With Complete Proof (Priority: P1)

User Story 2 - Surface Partial And Verification-Gap Truth Without New Outcomes (Priority: P1)

User Story 3 - Preserve Scope Safety And Audit Continuity (Priority: P2)

User Story 4 - Keep Unsupported High-Risk Restore Families Out Of Scope (Priority: P3)

Edge Cases

Requirements (mandatory)

Functional Requirements

Non-Functional Requirements

Key Entities (include if feature involves data)

Success Criteria (mandatory)

Measurable Outcomes

Assumptions

Risks

Open Questions

Follow-Up Spec Candidates

45 KiB

Raw Permalink Blame History