# Research — 094 Assignment Operations Observability Hardening This document resolves implementation-relevant questions raised by the spec and records the key decisions. ## Decision 1 — Operation run identity + dedupe - Decision: Use **dedupe per tenant + operation type + operation target/scope**. - Fetch target/scope: backup item identifier (or equivalent policy-version identifier). - Restore target/scope: restore run identifier. - Rationale: Prevents operator confusion while allowing independent operations to proceed in parallel. - Alternatives considered: - Dedupe per tenant only → collapses unrelated runs and obscures what is actually running. - No dedupe → increases confusion and makes tests non-deterministic. ## Decision 2 — How jobs become OperationRun-tracked - Decision: Prefer passing an existing operation run identifier into queued jobs when a user-triggered start surface exists; otherwise the job must create/reuse a canonical run using the standard service. - Rationale: Start surfaces should remain enqueue-only while ensuring a single umbrella run record exists for Monitoring. - Alternatives considered: - Create runs only inside jobs → makes it harder to link UI initiation to the run and can lead to duplicate runs. ## Decision 3 — Counters semantics - Decision: `total` = items attempted, `processed` = succeeded, `failed` = failed. - Rationale: Works for both read-only fetch and destructive restore, and is easy to interpret in Monitoring. - Alternatives considered: - “total discovered” semantics → ambiguous when discovery and execution are separate steps. - Different semantics per operation type → harder to reason about and to test. ## Decision 4 — Failure structure: stable code + normalized reason_code - Decision: Use operation-specific failure `code` namespaces (e.g., `assignments.fetch_failed`, `assignments.restore_failed`) and store the underlying cause as a normalized `reason_code`. - Rationale: Operators can identify what failed (operation) and why (normalized cause) consistently. - Alternatives considered: - Generic failure codes only → loses context; Monitoring becomes less actionable. - Reusing unrelated restore codes for assignment operations → conflates domains and increases ambiguity. ## Decision 5 — Audit log granularity for assignment restores - Decision: Write **exactly one audit log entry per assignment restore execution** (per restore run). - Rationale: Satisfies auditability while keeping log volume predictable and reviewable. - Alternatives considered: - Per-item audit logs → noisy for large restores. - Both summary + per-item → overkill for this hardening scope. ## Decision 6 — Graph client abstraction - Decision: Ensure assignment-related services depend on the **Graph client interface**, not a concrete implementation. - Rationale: Enables deterministic non-production tests and allows safe stub/null implementations. - Alternatives considered: - Concrete client type-hints → blocks mocking and makes tests fragile. ## Decision 7 — Authorization semantics & cross-plane safety - Decision: Enforce deny-as-not-found (404) for non-members/cross-plane access and forbidden (403) for members lacking capability. - Rationale: Matches constitution RBAC-UX rules and prevents route/resource enumeration. - Alternatives considered: - Using 403 for non-members → leaks existence and violates deny-as-not-found standard.