3.4 KiB
3.4 KiB
Research — 094 Assignment Operations Observability Hardening
This document resolves implementation-relevant questions raised by the spec and records the key decisions.
Decision 1 — Operation run identity + dedupe
- Decision: Use dedupe per tenant + operation type + operation target/scope.
- Fetch target/scope: backup item identifier (or equivalent policy-version identifier).
- Restore target/scope: restore run identifier.
- Rationale: Prevents operator confusion while allowing independent operations to proceed in parallel.
- Alternatives considered:
- Dedupe per tenant only → collapses unrelated runs and obscures what is actually running.
- No dedupe → increases confusion and makes tests non-deterministic.
Decision 2 — How jobs become OperationRun-tracked
- Decision: Prefer passing an existing operation run identifier into queued jobs when a user-triggered start surface exists; otherwise the job must create/reuse a canonical run using the standard service.
- Rationale: Start surfaces should remain enqueue-only while ensuring a single umbrella run record exists for Monitoring.
- Alternatives considered:
- Create runs only inside jobs → makes it harder to link UI initiation to the run and can lead to duplicate runs.
Decision 3 — Counters semantics
- Decision:
total= items attempted,processed= succeeded,failed= failed. - Rationale: Works for both read-only fetch and destructive restore, and is easy to interpret in Monitoring.
- Alternatives considered:
- “total discovered” semantics → ambiguous when discovery and execution are separate steps.
- Different semantics per operation type → harder to reason about and to test.
Decision 4 — Failure structure: stable code + normalized reason_code
- Decision: Use operation-specific failure
codenamespaces (e.g.,assignments.fetch_failed,assignments.restore_failed) and store the underlying cause as a normalizedreason_code. - Rationale: Operators can identify what failed (operation) and why (normalized cause) consistently.
- Alternatives considered:
- Generic failure codes only → loses context; Monitoring becomes less actionable.
- Reusing unrelated restore codes for assignment operations → conflates domains and increases ambiguity.
Decision 5 — Audit log granularity for assignment restores
- Decision: Write exactly one audit log entry per assignment restore execution (per restore run).
- Rationale: Satisfies auditability while keeping log volume predictable and reviewable.
- Alternatives considered:
- Per-item audit logs → noisy for large restores.
- Both summary + per-item → overkill for this hardening scope.
Decision 6 — Graph client abstraction
- Decision: Ensure assignment-related services depend on the Graph client interface, not a concrete implementation.
- Rationale: Enables deterministic non-production tests and allows safe stub/null implementations.
- Alternatives considered:
- Concrete client type-hints → blocks mocking and makes tests fragile.
Decision 7 — Authorization semantics & cross-plane safety
- Decision: Enforce deny-as-not-found (404) for non-members/cross-plane access and forbidden (403) for members lacking capability.
- Rationale: Matches constitution RBAC-UX rules and prevents route/resource enumeration.
- Alternatives considered:
- Using 403 for non-members → leaks existence and violates deny-as-not-found standard.