Ahmed Darrazi 48b558db93 docs: unified operations runs specs and plan (054)

2026-01-16 19:06:30 +01:00

13 KiB

Raw Blame History

Feature Specification: Unified Operations Runs Suitewide (Except Restore Domain Model) (054)

Feature Branch: feat/054-unify-operations-runs-suitewide
Created: 2026-01-16
Status: Draft
Input: User description: "Eliminate run sprawl by adopting one canonical tenant-scoped operation run record for long-running actions across the product, surfaced consistently in Monitoring → Operations, while keeping restore as a separate domain workflow that is still visible via an adapter entry."

Clarifications

Session 2026-01-16

Q: Welche Default-Retention soll 054 für canonical Operation Runs festlegen? → A: 90 days
Q: Transition-Strategie in 054: schreiben wir canonical Runs parallel zu Legacy-Run-Tabellen, oder ersetzen wir sofort? → A: Parallel write (canonical + legacy)
Q: For restore.execute, the spec mentions it acts as an "adapter entry" linking to the restore domain record. How should this be implemented? → A: Physical Row (Create a physical row in operation_runs that points to the restore record).
Q: How should concurrency and deduplication (FR-009) be enforced at the database level? → A: Partial Unique Index (unique constraint on tenant_id, run_identity_hash where outcome is queued or running).
Q: How should the initiator be modeled to support both users and system processes (FR-001)? → A: Nullable FK + Name Snapshot (user_id nullable FK + required initiator_name string).

User Scenarios & Testing (mandatory)

User Story 1 - See Every Supported Operation in Monitoring (Priority: P1)

As an operator, I want Monitoring → Operations to show all supported long-running operations for my tenant in one consistent list and detail view, so I can quickly answer what ran, who started it, whether it succeeded/partially succeeded/failed, and where to look next.

Why this priority: This is the core value: a single, tenant-scoped source of truth for operational visibility.

Independent Test: Trigger at least one run of each Phase 1 run producer, then verify each appears in Monitoring with consistent status/outcome semantics, safe failure summaries, and context links.

Acceptance Scenarios:

Given I am signed into tenant A, When I open Monitoring → Operations, Then I see only tenant A runs and can filter by run type, outcome bucket, time range, and initiator.
Given multiple run types exist, When I filter to inventory.sync, Then only inventory sync runs are shown.
Given a run exists, When I open its detail view, Then I can see initiator, run type, outcome bucket, timestamps, summary counts (if applicable), sanitized failures (if any), and links to relevant feature context/results.
Given restore execution exists, When I open Monitoring → Operations, Then I can see a restore.execute entry that links to the existing restore record (restore history remains owned by the restore domain record).
Given I am a Readonly user in tenant A, When I view Monitoring → Operations, Then I can view runs and details but I do not see any start/rerun/cancel/delete controls.
Given I attempt to access a run from another tenant (direct link or list), When I request it, Then access is denied and no run details are disclosed.

User Story 2 - Start Operations Without Blocking (Priority: P2)

As an operator, when I start a supported operation, I want immediate confirmation and a “View run” link so I can continue working while the operation runs in the background.

Why this priority: Removes long-running requests/timeouts and standardizes how operations are started and observed.

Independent Test: Start each Phase 1 operation from its owning UI and confirm the start returns quickly, includes “View run”, and the run progresses through queued/running into a terminal outcome.

Acceptance Scenarios:

Given I have permission to start a Phase 1 operation in tenant A, When I start it, Then I receive immediate confirmation with a “View run” link and the run is visible as queued or running.
Given I am a Readonly user in tenant A, When I attempt to start any Phase 1 operation, Then the system denies the request and does not create a new run.
Given the run reaches a terminal outcome, When that occurs, Then the initiating user receives an in-app notification including a short summary and a “View run” link.
Given background processing is unavailable, When I attempt to start an operation, Then I receive a clear message and the system MUST NOT claim it was queued.

User Story 3 - Duplicate Starts Reuse the Same Active Run (Priority: P3)

As an operator, I want accidental double-starts (double clicks, two admins, retries) to reuse the same active run so duplicate background work is avoided and results remain auditable.

Why this priority: Reduces load, prevents confusing duplicate outcomes, and makes operations safer under concurrency.

Independent Test: Start the same operation twice with identical effective inputs while the first is queued/running and verify the system reuses the active run.

Acceptance Scenarios:

Given an identical run is queued/running for a tenant, When another start request is made with the same effective inputs, Then the system reuses the existing run and does not start a second one.
Given two starts happen at nearly the same time, When the system resolves the race, Then at most one active run exists for that identity and both users are directed to it.

Edge Cases

Background execution unavailable: start fails fast with a clear message; the system MUST NOT create misleading “queued” runs.
Partial processing: at least one success and at least one failure yields “partially succeeded”, with per-item failures when applicable.
Large run history: Monitoring remains usable with filters and defaults (recent runs, last 30 days).
Permissions revoked mid-run: the run continues; visibility is evaluated at time of access.

Requirements (mandatory)

Constitution alignment (required): If this feature introduces any external tenant API calls or any write/change behavior, the spec MUST describe contract registry updates, safety gates (preview/confirmation/audit), tenant isolation, and tests.

Scope & Assumptions

Phase 1 adoption set (must be implemented):

inventory.sync (Inventory “Sync now”)
policy.sync (Policies “Sync now”)
directory_groups.sync (Directory → Groups “Sync groups”)
drift.generate (Drift “Generate drift now” / auto-on-open when eligible)
backup_set.add_policies (Backup Sets “Add selected” / “Add policies”)

Restore visibility (adapter only):

restore.execute appears as a canonical run entry that links to an existing restore domain record.
Restore execution history remains owned by the restore domain record (not replaced in Phase 1).

Out of scope for 054 (explicit):

Cross-tenant compare/promotion
UI redesign/styling polish (separate UI polish work)
Cancel/rerun/delete controls inside Monitoring hub (hub stays view-only)
Replacing restore domain records with canonical runs
A full settings UI for retention/notifications/etc.

Assumptions (defaults to remove ambiguity in Phase 1):

Canonical run history retention defaults to 90 days, with no user-facing retention configuration in 054.
System-initiated runs (if any) do not notify users by default in Phase 1.
Transition strategy: write canonical runs in parallel with any existing legacy per-module run tables (where they exist); Monitoring uses canonical runs as the source of truth immediately.

Functional Requirements

FR-001 Canonical Operation Run: System MUST represent each supported operation execution as a canonical, tenant-scoped operation run record that captures initiator (nullable user_id FK + initiator_name string), run type, lifecycle status/timestamps, outcome bucket, summary counts (when applicable), safe failure summaries, an idempotency identity for dedupe, and a safe context payload referencing “what this run was about”.
FR-002 Run taxonomy: Run type MUST be stable and follow "<resource>.<action>".
FR-003 Phase 1 run types: Phase 1 run types MUST include inventory.sync, policy.sync, directory_groups.sync, drift.generate, backup_set.add_policies, plus restore.execute implemented as a physical operation_runs record (adapter) pointing to the domain entity.
FR-004 Monitoring lists all canonical runs: Monitoring → Operations MUST list canonical runs for the active tenant with filters for run type, outcome bucket, time range, and initiator; default sort is most recent first; default time window is last 30 days.
FR-005 Run detail: Run detail MUST show initiator, run type, outcome bucket, timestamps (created/started/finished), summary counts (when applicable), sanitized failures (including per-item failures when applicable), and contextual links to owning feature surfaces/results.
FR-006 View-only hub: Monitoring hub MUST be view-only (no start/rerun/cancel/delete controls) and MUST link back to owning feature surfaces.
FR-007 Start surfaces always enqueue: Every Phase 1 start surface MUST authorize start, create/reuse a canonical run (dedupe), dispatch background execution, and return immediately with confirmation + “View run”.
FR-008 No remote work in interactive request: Start surfaces MUST NOT perform remote work inline; long-running work happens in background execution.
FR-009 Deterministic idempotency: For each run type, the system MUST define a deterministic identity for “identical run” based on tenant + effective inputs; initiator MUST NOT be part of identity. Enforcement: Uniqueness MUST be enforced via a partial unique index on (tenant_id, run_identity_hash) where outcome is queued or running.
FR-010 Phase 1 identity rules: Identity rules MUST be defined at least as follows:
- inventory.sync: tenant + selection scope
- policy.sync: tenant + effective policy scope
- directory_groups.sync: tenant + selection (Phase 1 default: “all groups”)
- backup_set.add_policies: tenant + backup set + selected policies + option flags (if exposed)
- drift.generate: tenant + scope key + baseline/current comparison inputs
FR-011 Outcome buckets: Monitoring MUST present consistent outcome buckets: queued, running, succeeded, partially succeeded, failed.
FR-012 Partial vs failed: “Partially succeeded” means at least one success and at least one failure; “Failed” means zero successes or cannot proceed.
FR-013 Failure details are safe + useful: Failures MUST be persisted and displayed as stable reason codes and short sanitized messages; failures MUST NOT include secrets/tokens/credentials/PII or full external payload dumps.
FR-014 Related links: Run detail MUST include contextual links where applicable (e.g., drift findings, backup set, inventory results, directory groups, restore detail for restore.execute).
FR-015 Notifications: System MUST emit in-app notifications for “queued” (after start) and terminal outcomes for Phase 1 runs; notifications MUST include a short summary and a “View run” link; recipients are the initiating user only.
FR-016 Tenant isolation: All run list/detail access MUST be tenant-scoped; cross-tenant access MUST be denied without disclosing run details.
FR-017 No render-time remote calls: Monitoring pages MUST be render-safe and MUST NOT depend on external service calls during render.
FR-018 Roles & permissions: Roles Owner, Manager, Operator, and Readonly MUST be able to view runs; only Owner, Manager, Operator may start operations; Readonly is strictly view-only.

Key Entities (include if feature involves data)

Canonical Operation Run: A tenant-scoped record representing the lifecycle of a long-running operation, including run type, initiator (nullable user_id FK + initiator_name string), lifecycle state/timestamps, outcome bucket, summary counts, safe failure summaries, idempotency identity (uniqueness enforced by DB index on active runs), and safe context references.
Restore domain record (exception): Restore remains a domain workflow record with richer semantics and history. Monitoring shows restore activity through a physical operation_runs row (adapter) that links back to the restore record, without replacing it.

Success Criteria (mandatory)

Measurable Outcomes

SC-001: Operators can answer “what ran, when, and did it succeed?” for any Phase 1 run in under 1 minute using Monitoring → Operations.
SC-002: Starting a Phase 1 operation returns confirmation + “View run” link within 2 seconds under normal conditions.
SC-003: Duplicate starts reuse the same active run in at least 99% of attempts under normal conditions.
SC-004: No secrets/tokens/credentials/PII appear in persisted failures or notifications (verified by tests).

13 KiB Raw Blame History

Feature Specification: Unified Operations Runs Suitewide (Except Restore Domain Model) (054)

Clarifications

Session 2026-01-16

User Scenarios & Testing (mandatory)

User Story 1 - See Every Supported Operation in Monitoring (Priority: P1)

User Story 2 - Start Operations Without Blocking (Priority: P2)

User Story 3 - Duplicate Starts Reuse the Same Active Run (Priority: P3)

Edge Cases

Requirements (mandatory)

Scope & Assumptions

Functional Requirements

Key Entities (include if feature involves data)

Success Criteria (mandatory)

Measurable Outcomes

13 KiB

Raw Blame History