6.3 KiB
Phase 0 Research: Queued Execution Reauthorization and Scope Continuity
Decision: Extend the existing OperationRun and queue middleware seam instead of creating a second execution framework
Rationale: The repo already has the right core primitives for observability and queue orchestration: OperationRunService, TrackOperationRun, ProviderOperationStartGate, blocked outcome semantics, and sanitized terminal audit handling. The missing piece is not an entirely new framework but a canonical execution-legitimacy check that runs before jobs start doing real work.
Alternatives considered:
- Create a separate execution-orchestration subsystem just for reauthorization: rejected because it would duplicate
OperationRunlifecycle ownership and make the queue path harder to reason about. - Keep adding local checks inside individual jobs: rejected because that is the exact drift pattern this feature is supposed to eliminate.
Decision: Reuse OperationRunOutcome::Blocked as the canonical execution-denial outcome
Rationale: OperationRunOutcome already includes blocked, and OperationRunService::finalizeBlockedRun() already writes sanitized blocked outcomes with reason codes, next steps, terminal audit, and normal terminal notification behavior. Reusing that vocabulary keeps Monitoring and operator language consistent.
Alternatives considered:
- Add a new
deniedterminal outcome: rejected because it would fork existing outcome semantics and badge behavior without a strong product need. - Represent execution-time denial as
failed: rejected because the spec explicitly requires a clear distinction between intentional trust-policy refusal and ordinary runtime failure.
Decision: Human-initiated queued runs remain actor-bound; scheduled runs remain system-authority runs
Rationale: The architecture audit raised the core identity question directly. For human-initiated work, the safest and most comprehensible rule is that authority must still belong to the initiating actor when the job begins. For scheduled or initiator-null work, the system must act under explicit system authority and current tenant operability rather than pretending a user still owns the action.
Alternatives considered:
- Convert all queued jobs into system-owned authority after dispatch: rejected because that would silently broaden authority and weaken audit meaning.
- Freeze the dispatch-time actor snapshot as permanent authority: rejected because that preserves the stale-legitimacy gap this spec is trying to close.
Decision: Put the legitimacy check before TrackOperationRun marks runs as running
Rationale: TrackOperationRun currently transitions the run to running before the job body executes. For this feature, that is too late and too optimistic. A blocked-at-execution job should fail closed before side effects and before Monitoring treats it as an active operation.
Alternatives considered:
- Leave
TrackOperationRunas-is and block inside each job body: rejected because jobs would already look like active execution and the ordering would vary by job. - Mark the run running first and immediately block it afterward: rejected because it creates misleading transient truth in Monitoring and leaves room for side effects to start too early.
Decision: Reuse TenantOperabilityService for tenant-state truth, but add an execution-oriented decision seam
Rationale: Tenant operability is already centralized for selector, route, and lifecycle-safe action semantics. The queue execution path should not reintroduce raw lifecycle checks. At the same time, the existing lanes and questions do not directly represent queued execution, so the plan should extend the central seam with an execution-oriented question or adjacent support primitive.
Alternatives considered:
- Hardcode tenant lifecycle checks inside jobs: rejected because it recreates the same drift pattern that Specs 143, 144, and 148 are reducing elsewhere.
- Ignore tenant operability and only re-check capability: rejected because archived, discarded, or otherwise non-operable tenants are a distinct class of invalid execution.
Decision: Treat execution-prerequisite failures separately from capability or membership loss, but still fail closed before work
Rationale: The feature needs structured denial reasons, not just a boolean. Existing code already distinguishes provider-configuration blocks and write-gate failures. The execution contract should preserve that distinction so operators can tell the difference between authorization loss, tenant non-operability, and prerequisite invalidity.
Alternatives considered:
- Collapse every denied start into one generic
blockedreason: rejected because the spec requires operator and audit clarity. - Treat prerequisite failures as retryable by default: rejected because some prerequisite failures are deterministic policy blocks and should be terminal until state changes.
Decision: Scope the first implementation slice to representative queued job families, not every queued job in the repo
Rationale: The repo has dozens of ShouldQueue jobs. Planning all of them as day-one adopters would produce a vague plan and stall execution. The feature needs one shared contract plus enough representative adoption to prove it works across provider-backed operations, restore or write jobs, inventory or sync jobs, and bulk orchestrators.
Alternatives considered:
- Attempt repo-wide queue adoption in one slice: rejected because it is too large for a focused hardening feature.
- Apply the contract to one provider job only: rejected because that would leave the architecture mostly unchanged while claiming closure.
Decision: Preserve existing external routes and keep the first slice schema-free
Rationale: This feature is an internal execution-hardening change. Existing Filament and Monitoring routes remain the same, and the required new metadata can live in OperationRun.context and failure payloads. That keeps the first slice focused on behavior, not API or persistence churn.
Alternatives considered:
- Introduce new routes or a separate operations API just for execution legitimacy: rejected because the feature does not require a new operator flow.
- Add dedicated persistence tables for denial state: rejected because existing
OperationRunandAuditLogstructures already provide the right observability foundation.