TenantAtlas/specs/094-assignment-ops-observability-hardening/research.md
ahmido bda1d90fc4 Spec 094: Assignment ops observability hardening (#113)
Implements spec 094 (assignment fetch/restore observability hardening):

- Adds OperationRun tracking for assignment fetch (during backup) and assignment restore (during restore execution)
- Normalizes failure codes/reason_code and sanitizes failure messages
- Ensures exactly one audit log entry per assignment restore execution
- Enforces correct guard/membership vs capability semantics on affected admin surfaces
- Switches assignment Graph services to depend on GraphClientInterface

Also includes Postgres-only FK defense-in-depth check and a discoverable `composer test:pgsql` runner (scoped to the FK constraint test).

Tests:
- `vendor/bin/sail artisan test --compact` (passed)
- `vendor/bin/sail composer test:pgsql` (passed)

Co-authored-by: Ahmed Darrazi <ahmed.darrazi@live.de>
Reviewed-on: #113
2026-02-15 14:08:14 +00:00

3.4 KiB

Research — 094 Assignment Operations Observability Hardening

This document resolves implementation-relevant questions raised by the spec and records the key decisions.

Decision 1 — Operation run identity + dedupe

  • Decision: Use dedupe per tenant + operation type + operation target/scope.
    • Fetch target/scope: backup item identifier (or equivalent policy-version identifier).
    • Restore target/scope: restore run identifier.
  • Rationale: Prevents operator confusion while allowing independent operations to proceed in parallel.
  • Alternatives considered:
    • Dedupe per tenant only → collapses unrelated runs and obscures what is actually running.
    • No dedupe → increases confusion and makes tests non-deterministic.

Decision 2 — How jobs become OperationRun-tracked

  • Decision: Prefer passing an existing operation run identifier into queued jobs when a user-triggered start surface exists; otherwise the job must create/reuse a canonical run using the standard service.
  • Rationale: Start surfaces should remain enqueue-only while ensuring a single umbrella run record exists for Monitoring.
  • Alternatives considered:
    • Create runs only inside jobs → makes it harder to link UI initiation to the run and can lead to duplicate runs.

Decision 3 — Counters semantics

  • Decision: total = items attempted, processed = succeeded, failed = failed.
  • Rationale: Works for both read-only fetch and destructive restore, and is easy to interpret in Monitoring.
  • Alternatives considered:
    • “total discovered” semantics → ambiguous when discovery and execution are separate steps.
    • Different semantics per operation type → harder to reason about and to test.

Decision 4 — Failure structure: stable code + normalized reason_code

  • Decision: Use operation-specific failure code namespaces (e.g., assignments.fetch_failed, assignments.restore_failed) and store the underlying cause as a normalized reason_code.
  • Rationale: Operators can identify what failed (operation) and why (normalized cause) consistently.
  • Alternatives considered:
    • Generic failure codes only → loses context; Monitoring becomes less actionable.
    • Reusing unrelated restore codes for assignment operations → conflates domains and increases ambiguity.

Decision 5 — Audit log granularity for assignment restores

  • Decision: Write exactly one audit log entry per assignment restore execution (per restore run).
  • Rationale: Satisfies auditability while keeping log volume predictable and reviewable.
  • Alternatives considered:
    • Per-item audit logs → noisy for large restores.
    • Both summary + per-item → overkill for this hardening scope.

Decision 6 — Graph client abstraction

  • Decision: Ensure assignment-related services depend on the Graph client interface, not a concrete implementation.
  • Rationale: Enables deterministic non-production tests and allows safe stub/null implementations.
  • Alternatives considered:
    • Concrete client type-hints → blocks mocking and makes tests fragile.

Decision 7 — Authorization semantics & cross-plane safety

  • Decision: Enforce deny-as-not-found (404) for non-members/cross-plane access and forbidden (403) for members lacking capability.
  • Rationale: Matches constitution RBAC-UX rules and prevents route/resource enumeration.
  • Alternatives considered:
    • Using 403 for non-members → leaks existence and violates deny-as-not-found standard.