ahmido bda1d90fc4 Spec 094: Assignment ops observability hardening (#113 )

Implements spec 094 (assignment fetch/restore observability hardening):

- Adds OperationRun tracking for assignment fetch (during backup) and assignment restore (during restore execution)
- Normalizes failure codes/reason_code and sanitizes failure messages
- Ensures exactly one audit log entry per assignment restore execution
- Enforces correct guard/membership vs capability semantics on affected admin surfaces
- Switches assignment Graph services to depend on GraphClientInterface

Also includes Postgres-only FK defense-in-depth check and a discoverable `composer test:pgsql` runner (scoped to the FK constraint test).

Tests:
- `vendor/bin/sail artisan test --compact` (passed)
- `vendor/bin/sail composer test:pgsql` (passed)

Co-authored-by: Ahmed Darrazi <ahmed.darrazi@live.de>
Reviewed-on: #113

2026-02-15 14:08:14 +00:00

11 KiB

Raw Blame History

Feature Specification: Assignment Operations Observability Hardening

Feature Branch: 094-assignment-ops-observability-hardening
Created: 2026-02-14
Status: Draft
Input: User description: "Harden assignment operation observability and close remaining authorization inconsistencies so operations are fully traceable, diagnosable, and correctly access-controlled."

Spec Scope Fields (mandatory)

Scope: canonical-view
Primary Routes:
- Admin Monitoring → Operations (list + detail views)
- Provider Connections list
- Backup Sets list
- Restore Runs list
- Backup Items relationship view under a backup set
Data Ownership: operational activity records must be scoped to the correct workspace and tenant context where applicable; no new domain entities are introduced.
RBAC: all affected admin surfaces require workspace membership; actions that change configuration or trigger restores require explicit permissions as defined in the central capability registry.

Canonical-view Constraints

Default behavior when tenant-context is active: Monitoring → Operations defaults to showing operational records for the currently selected tenant context (if one is selected).
Entitlement checks: Monitoring → Operations MUST not reveal tenant-owned operational records unless the actor is entitled to that tenant scope within the active workspace context.

Clarifications

Session 2026-02-14

Q: For FR-005 / SC-004, what should “identity scope” mean for deduping “active” runs? → A: Dedupe per tenant and operation target/scope.
Q: For FR-003 (“summary counters”) what exact semantics should total / processed / failed follow? → A: total = items attempted; processed = succeeded; failed = failed.
Q: For FR-004 (“stable failure code”), which convention should we standardize on for assignment fetch/restore runs? → A: Operation-specific code namespaces + normalized reason_code for the cause.
Q: For FR-006 (identity scope = tenant + operation type + target/scope), which identifiers should define “target/scope” for these two operations? → A: Fetch targets the backup item (or policy version) identifier; restore targets the restore run identifier.
Q: For US1 / FR-002 (“restore is observable/auditable”), what audit log granularity do you want for assignment restore? → A: One audit log entry per assignment restore execution (per restore run).

User Scenarios & Testing (mandatory)

User Story 1 — Observe assignment operations end-to-end (Priority: P1)

Workspace administrators need to see assignment-related operations (both read-only fetch and destructive restore) in Monitoring so they can confirm what ran, what changed, and why something failed, without relying on server logs.

Why this priority: Assignment restore is high-risk; missing visibility creates operational and audit gaps.

Independent Test: Trigger both an assignment fetch and an assignment restore; verify each produces a monitoring-visible run record with correct lifecycle and failure details.

Acceptance Scenarios:

Given an administrator triggers an assignment fetch, When the operation starts and completes, Then Monitoring shows a run record with start/end timestamps, final outcome, and summary counters.
Given an administrator triggers an assignment restore, When the operation starts and completes, Then Monitoring shows a run record including a clear indication that it was a change-making operation.

And exactly one audit log entry is written for the restore execution.

Given an assignment fetch or restore fails due to an external dependency error, When the run completes, Then Monitoring shows a stable failure code and a sanitized, user-readable message.
Given the same assignment operation is triggered multiple times concurrently for the same tenant and scope, When the system creates tracking records, Then the admin sees a single “active” run per identity (or an equivalent deduped representation).

User Story 2 — Enforce correct access control semantics on affected admin surfaces (Priority: P2)

Workspace administrators and platform operators must not be able to cross authentication “planes” accidentally, and the admin UI must not expose bypasses that let users initiate sensitive actions without authorization.

Why this priority: Prevents cross-plane leakage and closes known authorization inconsistencies.

Independent Test: Attempt access with the wrong authentication plane and with insufficient permissions; verify outcomes are deny-as-not-found (404) or forbidden (403) per policy.

Acceptance Scenarios:

Given a user is authenticated in a different auth plane, When they attempt to access workspace-scoped admin routes, Then the response is deny-as-not-found (404).
Given a user is not a member of the workspace, When they attempt to view backup items under a backup set, Then the response is deny-as-not-found (404) and does not reveal record existence.
Given a user is a workspace member but lacks the required permission, When they attempt a protected action (such as managing provider connections), Then the response is forbidden (403).

User Story 3 — Validate assignment operations safely in non-production contexts (Priority: P3)

Platform engineers need to validate that assignment operations behave correctly without requiring live external dependencies, so regressions can be caught early.

Why this priority: Improves reliability and reduces the risk of shipping changes that only fail when external services are slow/unavailable.

Independent Test: Run automated tests that simulate both successful and failing external interactions; verify monitoring records and authorization behaviors.

Acceptance Scenarios:

Given external interactions are simulated as “successful”, When the operation runs, Then the run is marked successful and includes expected summary counters.
Given external interactions are simulated as “failing”, When the operation runs, Then the run is marked failed with a stable failure code and sanitized message.

Edge Cases

External dependency timeouts, throttling, or transient failures.
Retries and duplicate dispatches (ensure tracking remains coherent and non-spammy).
Missing or inconsistent tenant/workspace context (must fail safely, not leak).
Partial completion (some items processed, some failed): counters and failure details must remain interpretable.

Requirements (mandatory)

Functional Requirements

FR-001: The system MUST create and maintain an operational tracking record (OperationRun) for every assignment fetch operation execution.
FR-002: The system MUST create and maintain an operational tracking record (OperationRun) for every assignment restore operation execution.
FR-003: Tracking records MUST include lifecycle state (queued/running/completed), timestamps, outcome, and summary counters sufficient to understand progress and results.
- Counter semantics: total = items attempted, processed = succeeded, failed = failed.
FR-004: Failed runs MUST include a stable failure code and a sanitized, user-readable failure message.
- Failure convention: use operation-specific code namespaces for the run, and store the underlying cause as a normalized reason_code.
FR-005: The system MUST prevent duplicate “active” runs for the same tenant and identity scope (tenant + operation type + operation target/scope), or otherwise present a deduped representation that avoids operator confusion.
FR-006: Identity scope MUST be defined as:
- For assignment fetch operations: tenant + operation type + backup item identifier (or equivalent policy-version identifier).
- For assignment restore operations: tenant + operation type + restore run identifier.
FR-007: Monitoring pages MUST render using persisted operational data only (no outbound calls during page render).
FR-008: Cross-plane access MUST be deny-as-not-found (404) on affected routes.
FR-009: Authorization MUST follow consistent semantics:
- non-member / not entitled → 404 (deny-as-not-found)
- member without required permission → 403
FR-010: Any action that can change configuration or trigger a restore MUST be server-authorized; UI visibility MUST NOT be treated as authorization.
FR-011: Sensitive or destructive-like actions MUST require explicit confirmation.
FR-012: Each assignment restore execution MUST write exactly one audit log entry for the restore run.

Dependencies & Assumptions

Assignment fetch and assignment restore operations can be triggered by administrators or scheduled/queued execution paths.
Monitoring users have access to the Operations area only when they have appropriate workspace membership and permissions.
External dependency failures may occur and must be represented consistently (stable failure codes + sanitized messages).

UI Action Matrix (mandatory when the admin UI is changed)

Surface	Location	Header Actions	Inspect Affordance (List/Table)	Row Actions (max 2 visible)	Bulk Actions (grouped)	Empty-State CTA(s)	View Header Actions	Create/Edit Save+Cancel	Audit log?	Notes / Exemptions
Provider Connections	Admin → Provider Connections	Create connection (permission-gated)	View connection details	Edit / Delete (permission-gated; destructive confirms)	None	Create connection	N/A	Save + Cancel	Yes	Ensures no authorization bypass exists on header/empty-state CTAs.
Backup Sets	Admin → Backups	None (unchanged)	View backup set	Actions unchanged	None	None	N/A	N/A	N/A	Only enforcement helper consistency is affected.
Restore Runs	Admin → Restores	None (unchanged)	View restore run	Actions unchanged	None	None	N/A	N/A	Yes	Restore operations must be observable via Monitoring.
Backup Items	Under a backup set	None	View backup item details	Actions unchanged	None	None	N/A	N/A	N/A	Membership/404 checks must occur before capability/403 checks.
Monitoring → Operations	Admin → Monitoring	None	View operation run details	None	None	None	N/A	N/A	Yes	Read-only view; must not call external services during render.

Success Criteria (mandatory)

Measurable Outcomes

SC-001: 100% of assignment fetch and assignment restore executions appear in Monitoring with a completed outcome (success or failure) and timestamps.
SC-002: For failed runs, operators can identify a stable failure code and a readable failure message from Monitoring within 60 seconds, without checking server logs.
SC-003: Automated tests verify deny-as-not-found (404) vs forbidden (403) semantics for the affected surfaces.
SC-004: Duplicate active-run confusion is eliminated: repeated triggers produce a single active run per identity scope (tenant + operation type + operation target/scope) or equivalent deduped visibility.

11 KiB Raw Blame History

Feature Specification: Assignment Operations Observability Hardening

Spec Scope Fields (mandatory)

Canonical-view Constraints

Clarifications

Session 2026-02-14

User Scenarios & Testing (mandatory)

User Story 1 — Observe assignment operations end-to-end (Priority: P1)

User Story 2 — Enforce correct access control semantics on affected admin surfaces (Priority: P2)

User Story 3 — Validate assignment operations safely in non-production contexts (Priority: P3)

Edge Cases

Requirements (mandatory)

Functional Requirements

Dependencies & Assumptions

UI Action Matrix (mandatory when the admin UI is changed)

Success Criteria (mandatory)

Measurable Outcomes

11 KiB

Raw Blame History