TenantAtlas/specs/113-platform-ops-runbooks/research.md
2026-02-26 02:18:19 +01:00

5.6 KiB
Raw Blame History

Research — Spec 113: Platform Ops Runbooks

This file resolves the design unknowns required to produce an implementation plan that fits the existing TenantAtlas codebase.

Decisions

1) Reuse existing backfill pipeline (Command + Job) via a single service

  • Decision: Extract a single “runbook service” that is called from:
    • /system runbook UI (preflight + start)
    • CLI command (tenantpilot:findings:backfill-lifecycle)
    • deploy-time hook
  • Rationale: The repo already contains a correct tenant-scoped implementation:
    • Command: app/Console/Commands/TenantpilotBackfillFindingLifecycle.php
    • Job: app/Jobs/BackfillFindingLifecycleJob.php
    • It uses OperationRunService for lifecycle transitions and idempotency, and a cache lock per tenant.
  • Alternatives considered:
    • Build a new pipeline from scratch → rejected as it duplicates proven behavior and increases drift risk.

2) “All tenants” scope uses a single workspace run updated by many tenant jobs

  • Decision: Implement All-tenants as:
    1. one workspace-scoped OperationRun (tenant_id = null) created with OperationRunService::ensureWorkspaceRunWithIdentity()
    2. fan-out to many queued tenant jobs that all increment the same workspace runs summary_counts and contribute failures
    3. completion via OperationRunService::maybeCompleteBulkRun() when processed >= total (same pattern as workspace backfills)
  • Rationale:
    • This matches an existing proven pattern in the repo (tenantpilot:backfill-workspace-ids + BackfillWorkspaceIdsJob).
    • It yields a single “View run” target with meaningful progress, without needing parent/child run stitching.
    • Tenant isolation remains intact because each job still operates tenant-scoped and holds the existing per-tenant lock.
  • Alternatives considered:
    • Separate per-tenant OperationRun records + an umbrella run → rejected for v1 due to added coordination complexity.

3) Workspace scope for /system runbooks (v1)

  • Decision: v1 targets the default workspace (same workspace that owns the platform Tenant created by PlatformUserSeeder).
  • Rationale:
    • Platform identity currently has no explicit workspace selector in the System panel.
    • Existing seeder creates Workspace(slug=default) and a Tenant(external_id=platform) inside it.
  • Alternatives considered:
    • Multi-workspace operator selection in /system → deferred (not in spec, requires new UX + entitlement model).

4) Remove/disable /admin maintenance action (FR-001)

  • Decision: Remove or feature-flag off the existing /admin header action “Backfill findings lifecycle” currently present in app/Filament/Resources/FindingResource/Pages/ListFindings.php.
  • Rationale: Spec explicitly forbids customer-plane exposure in production-like environments.
  • Alternatives considered:
    • Keep the action but hide visually → rejected; it still exists as an affordance and is easy to re-enable by accident.

5) Session isolation for /system (SR-004)

  • Decision: Add a System-panel-only middleware that sets a dedicated session cookie name for /system/* before StartSession runs.
  • Rationale:
    • SystemPanelProvider defines its own middleware list; we can insert a middleware at the top.
    • Changing config(['session.cookie' => ...]) per request is sufficient for cookie separation without introducing a new domain.
  • Alternatives considered:
    • Separate subdomain → deferred (explicitly “later”).

6) /system/login rate limiting (SR-003)

  • Decision: Implement rate limiting inside app/Filament/System/Pages/Auth/Login.php (override authenticate()) using a combined key: ip + normalized(email) at 10/min.
  • Rationale:
    • The System login already overrides authenticate() to add auditing.
    • Implementing rate limiting here keeps the policy tightly scoped to the System login surface.
  • Alternatives considered:
    • Global route middleware throttle → possible, but harder to scope precisely to this Filament auth page.

7) 404 vs 403 semantics for platform capability checks (SR-002)

  • Decision: Keep cross-plane denial as 404 (existing EnsureCorrectGuard), but missing platform capability should return 403.
  • Rationale:
    • Spec requires: wrong plane → 404; platform lacking capability → 403.
    • Current EnsurePlatformCapability aborts(404), which conflicts with spec.
  • Alternatives considered:
    • Return 404 for missing platform capability → rejected because it contradicts the agreed spec.

8) Failure notifications (FR-009)

  • Decision: On run failure, emit:
    1. the canonical terminal DB notification (OperationRunCompleted) to the initiating platform operator (in-app)
    2. an Alerts event (Teams / Email) if alert routing is configured
  • Rationale:
    • Alerts system already exists (AlertDispatchService + queued deliveries). It can route to Teams webhook / Email.
    • OperationRunCompleted already formats the correct persistent DB notification payload via OperationUxPresenter.
  • Alternatives considered:
    • Send Teams webhook directly from job → rejected; bypasses alert rules/cooldowns/quiet hours.

Notes for implementation

  • Platform capabilities must be defined in the registry (app/Support/Auth/PlatformCapabilities.php) and referenced via constants.
  • The System panel currently does not call ->databaseNotifications(). If we want in-app notifications for platform operators, add it.
  • OperationRun.user_id cannot point to platform_users; use context fields to record platform initiator metadata.