TenantAtlas/specs/054-unify-runs-suitewide/spec.md

22 KiB
Raw Blame History

Feature Specification: Unified Operations Runs Suitewide (Except Restore Domain Model) (054)

Feature Branch: feat/054-unify-operations-runs-suitewide
Created: 2026-01-16
Status: Draft
Input: User description: "Eliminate run sprawl by adopting one canonical tenant-scoped operation run record for long-running actions across the product, surfaced consistently in Monitoring → Operations, while keeping restore as a separate domain workflow that is still visible via an adapter entry."

Clarifications

Session 2026-01-16

  • Q: Welche Default-Retention soll 054 für canonical Operation Runs festlegen? → A: 90 days
  • Q: Transition-Strategie in 054: schreiben wir canonical Runs parallel zu Legacy-Run-Tabellen, oder ersetzen wir sofort? → A: Parallel write (canonical + legacy)
  • Q: For restore.execute, the spec mentions it acts as an "adapter entry" linking to the restore domain record. How should this be implemented? → A: Physical Row (Create a physical row in operation_runs that points to the restore record).
  • Q: How should concurrency and deduplication (FR-009) be enforced at the database level? → A: Partial Unique Index (unique constraint on tenant_id, run_identity_hash where status is queued or running).
  • Q: How should the initiator be modeled to support both users and system processes (FR-001)? → A: Nullable FK + Name Snapshot (user_id nullable FK + required initiator_name string).

Session 2026-01-17

  • Q: Sollen backup_schedule.run_now und backup_schedule.retry in 054 zur Phase-1-Adoption (must be implemented) gehören? → A: Yes — both are Phase 1 in 054 (OperationRun producers + worker tracking).
  • Q: Wenn Queue-Dispatch fehlschlägt (Background Processing unavailable), sollen wir trotzdem einen OperationRun anlegen und ihn sofort als fehlgeschlagen abschließen? → A: Yes — create an OperationRun and immediately complete it as failed (e.g., failure code queue.dispatch_failed); show a clear error and MAY include a “View run” link.
  • Q: Wenn ein Start deduped wird (Run wird wiederverwendet), wer soll die InApp Notifications (“queued” + terminal outcome) bekommen? → A: Only the original initiator (operation_runs.user_id); no additional notifications are sent to the second starter on reuse.
  • Q: Für restore.execute: In welchen RestoreRunStatus-Phasen soll überhaupt ein OperationRun-AdapterRow erzeugt/angezeigt werden? → A: From previewed onwards (previewed + execution statuses); no adapter row for draft/scoped/checked.
  • Q: Wenn der restore.execute Adapter bereits ab RestoreRunStatus=previewed sichtbar ist: welchen OperationRun-State sollen wir für diese Phase setzen? → A: status=queued, outcome=pending (until running, then completed + terminal outcome).
  • Q: RBAC Wizard (TenantResource) wie funktioniert Group Search? → A: Group search is delegated-Graph-based and the picker MUST be disabled without delegated auth.
  • Q: Restore Wizard (RestoreRunResource) Group Mapping Phase: Graph oder DB-only? → A: DB-only via Directory Cache (entra_groups), no Graph calls during mapping; helper text is always shown (fallback included).

User Scenarios & Testing (mandatory)

User Story 1 - See Every Supported Operation in Monitoring (Priority: P1)

As an operator, I want Monitoring → Operations to show all supported long-running operations for my tenant in one consistent list and detail view, so I can quickly answer what ran, who started it, whether it succeeded/partially succeeded/failed, and where to look next.

Why this priority: This is the core value: a single, tenant-scoped source of truth for operational visibility.

Independent Test: Trigger at least one run of each Phase 1 run producer, then verify each appears in Monitoring with consistent status/outcome semantics, safe failure summaries, and context links.

Acceptance Scenarios:

  1. Given I am signed into tenant A, When I open Monitoring → Operations, Then I see only tenant A runs and can filter by run type, run state (queued/running/terminal outcome), time range, and initiator.
  2. Given multiple run types exist, When I filter to inventory.sync, Then only inventory sync runs are shown.
  3. Given a run exists, When I open its detail view, Then I can see initiator, run type, run state (queued/running/terminal outcome), timestamps, summary counts (if applicable), sanitized failures (if any), and links to relevant feature context/results.
  4. Given a restore run has reached previewed or later, When I open Monitoring → Operations, Then I can see a restore.execute entry that links to the existing restore record (restore history remains owned by the restore domain record).
  5. Given I am a Readonly user in tenant A, When I view Monitoring → Operations, Then I can view runs and details but I do not see any start/rerun/cancel/delete controls.
  6. Given I attempt to access a run from another tenant (direct link or list), When I request it, Then access is denied and no run details are disclosed.

User Story 2 - Start Operations Without Blocking (Priority: P2)

As an operator, when I start a supported operation, I want immediate confirmation and a “View run” link so I can continue working while the operation runs in the background.

Why this priority: Removes long-running requests/timeouts and standardizes how operations are started and observed.

Independent Test: Start each Phase 1 operation from its owning UI and confirm the start returns quickly, includes “View run”, and the run progresses through queued/running into a terminal outcome.

Acceptance Scenarios:

  1. Given I have permission to start a Phase 1 operation in tenant A, When I start it, Then I receive immediate confirmation with a “View run” link and the run is visible as queued or running.
  2. Given I am a Readonly user in tenant A, When I attempt to start any Phase 1 operation, Then the system denies the request and does not create a new run.
  3. Given the run reaches a terminal outcome, When that occurs, Then the initiating user receives an in-app notification including a short summary and a “View run” link.
  4. Given background processing is unavailable, When I attempt to start an operation, Then I receive a clear message and the system MUST NOT claim it was queued.
    • If an OperationRun record was created during the attempt, it MUST be completed immediately with outcome failed (never left queued) and MAY be linked via “View run”.

User Story 3 - Duplicate Starts Reuse the Same Active Run (Priority: P3)

As an operator, I want accidental double-starts (double clicks, two admins, retries) to reuse the same active run so duplicate background work is avoided and results remain auditable.

Why this priority: Reduces load, prevents confusing duplicate outcomes, and makes operations safer under concurrency.

Independent Test: Start the same operation twice with identical effective inputs while the first is queued/running and verify the system reuses the active run.

Acceptance Scenarios:

  1. Given an identical run is queued/running for a tenant, When another start request is made with the same effective inputs, Then the system reuses the existing run and does not start a second one.
  2. Given two starts happen at nearly the same time, When the system resolves the race, Then at most one active run exists for that identity and both users are directed to it.

Edge Cases

  • Background execution unavailable: start fails fast with a clear message; if an OperationRun record was created, it MUST be immediately completed as failed (e.g., queue.dispatch_failed) and MUST NOT be left queued.
  • Partial processing: at least one success and at least one failure yields “partially succeeded”, with per-item failures when applicable.
  • Large run history: Monitoring remains usable with filters and defaults (recent runs, last 30 days).
  • Permissions revoked mid-run: the run continues; visibility is evaluated at time of access.

Requirements (mandatory)

Constitution alignment (required): If this feature introduces any external tenant API calls or any write/change behavior, the spec MUST describe contract registry updates, safety gates (preview/confirmation/audit), tenant isolation, and tests.

Scope & Assumptions

Phase 1 adoption set (must be implemented):

  • inventory.sync (Inventory “Sync now”)
  • policy.sync (Policies “Sync now”)
  • directory_groups.sync (Directory → Groups “Sync groups”)
  • drift.generate (Drift “Generate drift now” / auto-on-open when eligible)
  • backup_set.add_policies (Backup Sets “Add selected” / “Add policies”)
  • backup_schedule.run_now (Backup Schedules “Run now”)
  • backup_schedule.retry (Backup Schedules “Retry”)

Restore visibility (adapter only):

  • restore.execute appears as a canonical run entry that links to an existing restore domain record.
  • The adapter row MUST be created/visible only once a restore run reaches previewed (or later) and MUST NOT be created for draft, scoped, or checked.
  • When the restore run is previewed, the adapter OperationRun MUST use status=queued and outcome=pending.
  • Restore execution history remains owned by the restore domain record (not replaced in Phase 1).

Out of scope for 054 (explicit):

  • Cross-tenant compare/promotion
  • UI redesign/styling polish (separate UI polish work)
  • Cancel/rerun/delete controls inside Monitoring hub (hub stays view-only)
  • Replacing restore domain records with canonical runs
  • A full settings UI for retention/notifications/etc.
  • Implementing or validating AuditLog behavior for audit-only actions (FR-019) beyond actions explicitly changed by 054

Assumptions (defaults to remove ambiguity in Phase 1):

  • Canonical run history retention defaults to 90 days, with no user-facing retention configuration in 054.
  • System-initiated runs (if any) do not notify users by default in Phase 1.
  • Transition strategy: write canonical runs in parallel with any existing legacy per-module run tables (where they exist); Monitoring uses canonical runs as the source of truth immediately.

Run vs Audit-only Adoption Matrix (Phase 1):

Feature Area Action Tracking run_type / audit action
Policies Sync now OperationRun policy.sync
Policies Ignore policy Audit-only policy.ignore
Policies Export to backup OperationRun (queued) policy.export_backup
Policy Versions Capture snapshot OperationRun policy.capture_snapshot
Policy Versions Prune versions Audit-only policy_versions.prune
Policy Versions Archive versions Audit-only policy_versions.archive
Inventory Sync now OperationRun inventory.sync
Directory Groups Sync groups OperationRun directory_groups.sync
Drift Generate drift OperationRun drift.generate
Backup Sets Add policies OperationRun backup_set.add_policies
Backup Sets Archive Audit-only (DB-only) backup_set.archive
Backup Sets Restore (bulk) OperationRun backup_set.restore
Backup Sets Force delete Audit-only (admin-only) backup_set.force_delete
Backup Schedules Run now OperationRun backup_schedule.run_now
Backup Schedules Retry OperationRun backup_schedule.retry
Backup Schedules Edit Audit-only backup_schedule.edit
Backup Schedules Delete Audit-only backup_schedule.delete
Tenants Sync tenant OperationRun tenant.sync
Tenants Admin consent Audit-only tenant.admin_consent
Tenants Verify configuration Audit-only tenant.verify_config
Tenants Setup Intune RBAC Audit-only tenant.setup_rbac
Tenants Deactivate Audit-only tenant.deactivate
Restore Execute restore OperationRun (adapter) restore.execute (context → restore_run_id)

Rule: If an action is queued/background, long-running, or requires remote/external calls (e.g., Microsoft Graph), it MUST be tracked as an OperationRun. Only fast DB-only changes MAY be Audit-only.

Functional Requirements

  • FR-001 Canonical Operation Run: System MUST represent each supported operation execution as a canonical, tenant-scoped operation run record that captures initiator (nullable user_id FK + initiator_name string), run type, lifecycle status/timestamps, terminal outcome (pending while active), summary counts (when applicable), safe failure summaries, an idempotency identity for dedupe, and a safe context payload referencing “what this run was about”.
    • Status semantics: status represents lifecycle stage (queuedrunningcompleted).
    • Outcome semantics (stored tokens): outcome stores machine tokens: pending while active, otherwise succeeded / partially_succeeded / failed.
      • UI labels: Monitoring displays human labels derived from stored tokens (e.g., partially_succeeded → “Partially succeeded”).
      • Reserved: cancelled is reserved for future use and MUST NOT be produced by 054 (Monitoring hub has no cancel controls).
    • Context safety: context MUST be sanitized and MUST include only safe references (e.g., stable IDs, selection scope keys, correlation IDs). It MUST NOT include secrets/tokens/credentials, personal data, or full external payload dumps.
  • FR-002 Run taxonomy: Run type MUST be stable and follow "<resource>.<action>".
  • FR-003 Phase 1 run types: Phase 1 run types MUST include inventory.sync, policy.sync, directory_groups.sync, drift.generate, backup_set.add_policies, backup_schedule.run_now, backup_schedule.retry, plus restore.execute implemented as a physical operation_runs record (adapter) pointing to the domain entity.
  • FR-004 Monitoring lists all canonical runs: Monitoring → Operations MUST list canonical runs for the active tenant with filters for run type, run state (queued/running/terminal outcome), time range, and initiator; default sort is most recent first; default time window is last 30 days.
  • FR-005 Run detail: Run detail MUST show initiator, run type, run state (queued/running/terminal outcome), timestamps (created/started/finished), summary counts (when applicable), sanitized failures (including per-item failures when applicable), and contextual links to owning feature surfaces/results.
  • FR-006 View-only hub: Monitoring hub MUST be view-only (no start/rerun/cancel/delete controls) and MUST link back to owning feature surfaces.
  • FR-007 Start surfaces always enqueue: Every Phase 1 start surface MUST authorize start, create/reuse a canonical run (dedupe), dispatch background execution, and return immediately with confirmation + “View run”.
  • FR-008 No remote work in interactive request: Start surfaces MUST NOT perform remote work inline; long-running work happens in background execution.
  • FR-009 Deterministic idempotency: For each run type, the system MUST define a deterministic identity for “identical run” based on tenant + effective inputs; initiator MUST NOT be part of identity. Enforcement: Uniqueness MUST be enforced via a partial unique index on (tenant_id, run_identity_hash) where status is queued or running.
  • FR-010 Phase 1 identity rules: Identity rules MUST be defined at least as follows:
    • inventory.sync: tenant + selection scope
    • policy.sync: tenant + effective policy scope
    • directory_groups.sync: tenant + selection (Phase 1 default: “all groups”)
    • backup_set.add_policies: tenant + backup set + selected policies + option flags (if exposed)
    • backup_schedule.run_now: tenant + backup schedule id
    • backup_schedule.retry: tenant + backup schedule id
    • drift.generate: tenant + scope key + baseline/current comparison inputs
  • FR-011 Run state presentation: Monitoring MUST present a consistent run state using a single display bucket derived from lifecycle status and terminal outcome:
    • If status is queued or running, display that status.
    • If status is completed, display the terminal outcome derived from the stored token (succeeded, partially_succeeded, or failed) using the UI label mapping.
  • FR-012 Partial vs failed (terminal outcomes): “Partially succeeded” (partially_succeeded) means at least one success and at least one failure; “Failed” (failed) means zero successes or cannot proceed.
  • FR-013 Failure details are safe + useful: Failures MUST be persisted and displayed as stable reason codes and short sanitized messages; failures MUST NOT include secrets/tokens/credentials/PII or full external payload dumps.
    • Reason codes MUST be stable, machine-readable identifiers (lowercase, dot-separated), e.g. graph.throttled, auth.forbidden, validation.invalid_input, unexpected.exception.
    • Messages MUST be short (≤ 200 characters), sanitized, and written for operators (no secrets/tokens/credentials/PII; no raw external payloads). If needed, messages MAY include a non-sensitive correlation identifier.
  • FR-014 Related links: Run detail MUST include contextual links where applicable (e.g., drift findings, backup set, inventory results, directory groups, restore detail for restore.execute).
  • FR-015 Notifications: System MUST emit in-app notifications for “queued” (after start) and terminal outcomes for Phase 1 runs; notifications MUST include a short summary and a “View run” link; recipients are the initiating user only.
    • If a start request reuses an existing active run (dedupe), the run initiator (as stored on the OperationRun) remains the sole notification recipient; the second starter receives no additional notifications.
  • FR-016 Tenant isolation: All run list/detail access MUST be tenant-scoped; cross-tenant access MUST be denied without disclosing run details.
  • FR-017 No render-time remote calls: Monitoring pages MUST be render-safe and MUST NOT depend on external service calls during render.
  • FR-018 Roles & permissions: Roles Owner, Manager, Operator, and Readonly MUST be able to view runs; only Owner, Manager, Operator may start operations; Readonly is strictly view-only.
  • FR-019 Audit-only actions (no OperationRun): Actions that are DB-only and complete within ≤2 seconds under normal conditions MAY be executed without an OperationRun, as long as they do not start long-running background execution and do not require any remote/external calls.
    • 054 scope note: 054 does not implement or modify audit-only actions. If any audit-only action is touched as part of implementing 054 in the future, it MUST comply with this requirement and MUST be covered by tests. If such an action is security-relevant or changes operational behavior (e.g., “Ignore policy”, “Deactivate tenant”, “Admin consent”, “Prune versions”, “Force delete”), it MUST write exactly one tenant-scoped AuditLog entry with, at minimum:
    • tenant_id
    • actor_user_id
    • action (stable action identifier, e.g., policy.ignore)
    • target_type, target_id
    • before / after (sanitized JSON) or diff (sanitized JSON)
    • created_at Trigger guidance (to make classification reviewable):
    • “Security-relevant” includes actions that grant/revoke access, change authorization posture, change admin consent, or otherwise modify who/what can read/write tenant data.
    • “Operational behavior change” includes actions that change what the system will do in future runs (e.g., ignore/exclude resources, enable/disable schedules, retention/prune/archive actions, force deletes).
    • If unclear whether an Audit-only action is security/ops-relevant, the default is to treat it as such and write an AuditLog entry. Sanitization (AuditLog before/after/diff):
    • AuditLog payloads MUST include only the minimum fields needed to understand the change.
    • AuditLog payloads MUST NOT include secrets/tokens/credentials, personal data, or full external payload dumps.
    • If a field is sensitive, it MUST be omitted or replaced with a non-sensitive placeholder (e.g., "[REDACTED]"). Monitoring/Operations remains reserved for OperationRun-tracked long-running/queued operations. Acceptance checks (testable):
    • Audit-only action creates no OperationRun.
    • Audit-only action creates exactly one AuditLog event containing the required fields.
    • Audit-only action is tenant-scoped; cross-tenant access is forbidden and MUST NOT create AuditLog entries.

Key Entities (include if feature involves data)

  • Canonical Operation Run: A tenant-scoped record representing the lifecycle of a long-running operation, including run type, initiator (nullable user_id FK + initiator_name string), lifecycle state/timestamps, terminal outcome, summary counts, safe failure summaries, idempotency identity (uniqueness enforced by DB index on active runs), and safe context references.
  • Restore domain record (exception): Restore remains a domain workflow record with richer semantics and history. Monitoring shows restore activity through a physical operation_runs row (adapter) that links back to the restore record, without replacing it.

Success Criteria (mandatory)

Measurable Outcomes

  • SC-001: Operators can answer “what ran, when, and did it succeed?” for any Phase 1 run in under 1 minute using Monitoring → Operations.
  • SC-002: Starting a Phase 1 operation returns confirmation + “View run” link within 2 seconds under normal conditions.
  • SC-003: Duplicate starts reuse the same active run in at least 99% of attempts under normal conditions.
  • SC-003 Measurement scope (definition): An “attempt” counts when a start request is made for an operation with identical effective inputs while an identical run is already queued or running. The success condition is that the system reuses the existing active run reference rather than creating a second active run. “Normal conditions” exclude infrastructure outages (e.g., database unavailable) that prevent either run creation or dedupe evaluation.
  • SC-004: No secrets/tokens/credentials/PII appear in persisted failures or notifications (verified by tests).