TenantAtlas/specs/391-operations-hub-stability-debug-safe-runtime/plan.md
Ahmed Darrazi 6918b8af5a
Some checks failed
PR Fast Feedback / fast-feedback (pull_request) Failing after 1m34s
feat: add operations hub stability and safety runtime checks
2026-06-20 16:15:55 +02:00

17 KiB

Implementation Plan: Spec 391 - Operations Hub Stability and Debug-Safe Runtime

Branch: 391-operations-hub-stability-debug-safe-runtime | Date: 2026-06-20 | Spec: specs/391-operations-hub-stability-debug-safe-runtime/spec.md
Input: Feature specification from /specs/391-operations-hub-stability-debug-safe-runtime/spec.md

Summary

Stabilize the existing admin Operations hub so the environment-filtered route renders quickly and safely, then add focused productization browser-smoke guardrails for the exact debug/runtime leakage observed in BUG-001 and BUG-009. The work stays inside the Operations render/query/runtime-smoke surface and must not change Evidence, Provider, Review Pack, Restore, dashboard, provider mutation, export, or customer delivery semantics.

Technical Context

Language/Version: PHP 8.4.15, Laravel 12.52, Filament 5.2.1, Livewire 4.1.4.
Primary Dependencies: Filament v5, Livewire v4, Pest 4, PostgreSQL, existing browser smoke helpers.
Storage: Existing PostgreSQL operation_runs, workspaces, and managed_environments; no new storage expected.
Testing: Pest 4 feature/Livewire/browser tests.
Validation Lanes: fast-feedback/confidence for feature tests; browser for productization smoke; targeted formatting.
Target Platform: Laravel admin panel at /admin, local Sail/Dokploy-style container runtime.
Project Type: Laravel monolith under apps/platform.
Performance Goals: Operations route under 3 seconds after auth for audited data shape; bounded/paginated index render.
Constraints: No migrations unless proven and spec/plan updated first; no seeders; no queues/jobs that mutate provider/customer state; no Graph/provider calls in render; do not increase PHP max execution time.
Scale/Scope: Existing Operations hub, environment-filtered route, runtime-smoke checks.

UI / Surface Guardrail Plan

  • Guardrail scope: changed existing operator-facing Operations surface plus workflow-only productization browser smoke guardrail.
  • Affected routes/pages/actions/states/navigation/panel/provider surfaces:
    • /admin/workspaces/{workspace}/operations
    • /admin/workspaces/{workspace}/operations?environment_id={managedEnvironment}
    • App\Filament\Pages\Monitoring\Operations
    • App\Filament\Resources\OperationRunResource
    • Existing dashboard/workspace drilldowns that link to Operations
    • Productization-smoke browser route checks
  • No-impact class, if applicable: N/A.
  • Native vs custom classification summary: Native Filament page/table/resource plus existing Operations Blade composition; no new visual system.
  • Shared-family relevance: OperationRun monitoring/detail family, action links, status badges, browser-smoke runtime guard.
  • State layers in scope: URL-query environment_id, page/table filters, session filter state where already used, browser console/network/DOM assertions.
  • Audience modes in scope: operator-MSP, manager, support-platform.
  • Decision/diagnostic/raw hierarchy plan: Operations default-visible list/workbench remains decision-first; raw context, stack traces, provider payloads, and source links remain diagnostic-only or absent from productization-smoke output.
  • Raw/support gating plan: no new raw/support exposure; smoke must fail if debug pages/source links/raw stack traces become visible.
  • One-primary-action / duplicate-truth control: preserve existing open/detail action as the dominant safe next step; do not add competing retry/export/destructive actions.
  • Handling modes by drift class or surface: review-mandatory for Operations render path and runtime-smoke guard; report-only for existing UI-016 coverage unless implementation materially changes route/archetype.
  • Repository-signal treatment: review-mandatory because this touches a strategic monitoring surface and adds Browser lane proof.
  • Special surface test profiles: monitoring-state-page and global-context-shell.
  • Required tests or manual smoke: Feature/Livewire render/scoping/bounded tests plus Browser productization smoke.
  • Exception path and spread control: none expected.
  • Active feature PR close-out entry: Guardrail / Exception / Smoke Coverage.
  • UI/Productization coverage decision: Existing UI-016 coverage remains valid; implementation must update audit registry only if visible archetype/route changes exceed stability-state changes.
  • Coverage artifacts to update: none by default; screenshots under the spec artifacts folder for final browser verification.
  • No-impact rationale: N/A.
  • Navigation / Filament provider-panel handling: no panel provider changes; provider registration remains apps/platform/bootstrap/providers.php.
  • Screenshot or page-report need: screenshot required for final smoke evidence; no full page report unless implementation changes the Operations page archetype.

Shared Pattern & System Fit

  • Cross-cutting feature marker: yes, bounded.
  • Systems touched: Operations hub, OperationRunResource table/list rendering, OperationRun links/presenters, productization browser smoke, Debugbar/Vite asset-smoke controls.
  • Shared abstractions reused: OperationRunLinks, OperationUxPresenter, BadgeCatalog, BadgeRenderer, TablePaginationProfiles, SuppressDebugbarForSmokeRequests, PanelThemeAsset, existing Pest Browser smoke patterns.
  • New abstraction introduced? why?: none expected. If needed, add only a small test/support helper for productization-smoke runtime assertions.
  • Why the existing abstraction was sufficient or insufficient: Existing OperationRun UI semantics are sufficient; existing smoke coverage missed BUG-001/BUG-009 under the audited route and runtime mode.
  • Bounded deviation / spread control: Any new smoke helper must be test/support-local, explicitly opt-in, and must not disable normal local Debugbar/Vite behavior.

OperationRun UX Impact

  • Touches OperationRun start/completion/link UX?: yes, link/render path only.
  • Central contract reused: OperationRunLinks, existing tenantless OperationRun detail viewer, OperationRunResource table conventions.
  • Delegated UX behaviors: Open operation / View run URL resolution stays delegated to existing helpers; no queued toast or terminal notification change.
  • Surface-owned behavior kept local: environment filter application, bounded list rendering, controlled empty/error/loading state, browser runtime assertions.
  • Queued DB-notification policy: N/A.
  • Terminal notification path: N/A.
  • Exception path: none.

Provider Boundary & Portability Fit

  • Shared provider/platform boundary touched?: no.
  • Provider-owned seams: none.
  • Platform-core seams: OperationRun execution truth and Operations monitoring view only.
  • Neutral platform terms / contracts preserved: workspace, managed environment, operation, OperationRun, execution truth.
  • Retained provider-specific semantics and why: none added.
  • Bounded extraction or follow-up path: none.

Constitution Check

  • Inventory-first: N/A, no inventory truth changes.
  • Read/write separation: read-only render/smoke work only; no provider/customer mutations.
  • Graph contract path: no Graph calls; render path must remain DB-only.
  • Deterministic capabilities: existing entitlement/capability paths retained.
  • RBAC-UX: admin plane route, workspace membership, environment entitlement, 404 not-found semantics for non-entitled scopes; UI visibility is not authorization.
  • Workspace isolation: Operations query and summary/filter options must scope by current workspace before rows render.
  • Tenant isolation: tenant-bound runs must be visible only when actor is entitled to referenced managed environment.
  • Run observability: no new OperationRun creation/status transition; existing OperationRun truth remains the source.
  • OperationRun start UX: no start UX change; links reuse central helpers.
  • Ops-UX lifecycle: no OperationRun.status / OperationRun.outcome transitions.
  • Ops-UX summary counts: no new keys; default list render must not parse large summary/context payloads unnecessarily.
  • Automation: no queues/jobs are triggered by this spec.
  • Data minimization: debug pages, stack traces, raw context, provider payloads, _debugbar, and source links must not appear in productization-smoke mode.
  • Test governance: Feature + Browser lanes are explicit and bounded.
  • Proportionality: no new persistence, domain abstraction, status family, taxonomy, or cross-domain framework.
  • Filament-native UI: preserve native Filament table/page/resource semantics; no new ad-hoc status styling.
  • UI/Productization coverage: existing UI-016 coverage remains valid unless implementation discovers material route/archetype change.

Test Governance Check

  • Test purpose / classification by changed surface: Feature/Livewire for render/scoping/bounded query proof; Browser for runtime/debug leakage; Unit only if a helper is introduced.
  • Affected validation lanes: fast-feedback/confidence and browser.
  • Why this lane mix is the narrowest sufficient proof: Feature tests catch deterministic server render/scoping/performance issues; Browser test catches JS globals, Vite dev-client, Debugbar/source-link, and visible debug page regressions.
  • Narrowest proving command(s):
    • cd apps/platform && php vendor/bin/pest tests/Feature/Monitoring/Spec391OperationsHubRendersWithEnvironmentFilterTest.php
    • cd apps/platform && php vendor/bin/pest tests/Feature/Monitoring/Spec391OperationRunResourceIndexPerformanceTest.php
    • cd apps/platform && php artisan test --compact tests/Browser/Spec391OperationsHubProductizationSmokeTest.php
    • cd apps/platform && php vendor/bin/pint --test <touched php files>
    • git diff --check
  • Fixture / helper / factory / seed / context cost risks: Use factories and smoke-login helpers; no seeders; no provider setup; no real Graph; no queue mutation.
  • Expensive defaults or shared helper growth introduced?: no; any browser helper must be explicit and local.
  • Heavy-family additions, promotions, or visibility changes: one explicit browser smoke file.
  • Surface-class relief / special coverage rule: special monitoring-state-page / global-context-shell coverage required.
  • Closing validation and reviewer handoff: reviewers should check render timing/query bounds, runtime smoke assertions, and no unrelated semantic changes.
  • Budget / baseline / trend follow-up: document actual render timing and whether lower-level guard substitutes for CI browser timing.
  • Review-stop questions: Did implementation fix the expensive path, or merely catch/mask it? Did any helper widen browser/default setup? Did any provider/evidence/review/restore semantics change?
  • Escalation path: document-in-feature.
  • Active feature PR close-out entry: Guardrail / Exception / Smoke Coverage.
  • Why no dedicated follow-up spec is needed: This is a direct audit-regression fix with bounded smoke guardrails; broader BUG-009/system branding follow-up remains separate if needed.

Project Structure

Documentation (this feature)

specs/391-operations-hub-stability-debug-safe-runtime/
├── spec.md
├── plan.md
├── tasks.md
├── checklists/
│   └── requirements.md
└── artifacts/
    ├── verification.md
    └── screenshots/

Source Code (repository root)

Implementation is expected to remain in existing Laravel app and test paths:

apps/platform/app/Filament/Pages/Monitoring/Operations.php
apps/platform/app/Filament/Resources/OperationRunResource.php
apps/platform/app/Models/OperationRun.php
apps/platform/app/Http/Middleware/SuppressDebugbarForSmokeRequests.php
apps/platform/app/Support/Filament/PanelThemeAsset.php
apps/platform/tests/Feature/Monitoring/
apps/platform/tests/Browser/
apps/platform/tests/Unit/Filament/

Structure Decision: Existing Laravel/Filament app structure under apps/platform; no new base folders and no migrations expected.

Complexity Tracking

Violation Why Needed Simpler Alternative Rejected Because
N/A No constitution violation planned N/A

Proportionality Review

  • Current operator problem: A common Operations drilldown fails with timeout/500/debug page and productization browser validation is polluted by debug/runtime leakage.
  • Existing structure is insufficient because: Existing route/tests did not catch environment-filtered render-path cost or productization-smoke runtime leakage.
  • Narrowest correct implementation: Stabilize existing query/render path and add focused runtime leak assertions.
  • Ownership cost created: Small targeted test/browser smoke upkeep.
  • Alternative intentionally rejected: Increase timeout, hide route, broad catch-all, broad UI redesign, broad productization infrastructure rewrite.
  • Release truth: Current productization blocker.

Technical Approach

  1. Reproduce or confirm BUG-001 in browser/Playwright or by targeted route render before editing.
  2. Inspect the current render path:
    • Operations::decisionWorkbench()
    • Operations::selectedWorkbenchOperation()
    • Operations::topOperationFromQuery()
    • Operations::summaryCount()
    • Operations::table()
    • OperationRunResource::table()
    • OperationRun accessors/casts used by status/outcome/next-action/scope columns.
  3. Identify the expensive path rather than masking it. Likely investigation areas:
    • dashboardNeedsFollowUp() and current terminal/actionability scopes.
    • topOperationFromQuery() fetching up to 50 full rows and sorting with requiresOperatorReview() / problemClass() in PHP.
    • Table columns invoking actionDecision(), primaryActionUrl(), targetScopeDisplay(), history*Description(), or badge renderers for every visible row.
    • context, failure_summary, and summary_counts JSON casts hydrated by select *.
    • Filter option queries for type/initiator scanning historical rows.
    • Relationship access for tenant/user/related artifacts.
  4. Fix by bounding and scoping:
    • Apply workspace/environment entitlement in base queries.
    • Keep pagination and page-size profile.
    • Use selective eager loading only for relationships actually displayed.
    • Avoid full JSON hydration on index rows where possible.
    • Move heavy proof/diagnostic work to detail or collapsed/support surfaces.
    • Replace PHP sorting over hydrated runs with query-level ordering or a smaller deterministic candidate set when possible.
  5. Add controlled states:
    • No-runs empty state for active scope.
    • Productization-safe non-debug failure assertions.
    • No false health claims.
  6. Add productization-smoke path:
    • Prefer existing smoke-login and SuppressDebugbarForSmokeRequests.
    • Prefer existing PanelThemeAsset / built asset fallback behavior.
    • Fail on the exact BUG-009 signatures in smoke mode only.

Data / Migration Implications

  • No migrations are expected.
  • If an index becomes necessary to meet the render budget, stop and update spec.md and plan.md with the proven query plan, migration safety, rollback/forward notes, and PostgreSQL lane coverage before implementing the migration.

Rollout Considerations

  • No environment variables are expected unless implementation proves a narrow productization-smoke-only flag is needed.
  • No queue, scheduler, storage, or provider credential changes.
  • Normal local Debugbar/Vite developer workflow must remain unchanged outside explicit productization-smoke sessions.
  • Deployment asset strategy remains normal Filament/Vite deployment; if assets are registered or changed, include cd apps/platform && php artisan filament:assets in deploy notes.

Risk Controls

  • Do not change OperationRun lifecycle/status/outcome semantics.
  • Do not add new operation types or summary-count keys.
  • Do not add unscoped cache.
  • Do not call Graph or remote provider clients from render.
  • Do not dispatch provider/restore/export jobs.
  • Do not rewrite completed Operations productization specs.
  • Use browser as final source of truth for route status/runtime leakage.

Implementation Phases

Phase 1 - Baseline and focused regression tests

Confirm current failure or relevant logs, then add failing feature/browser tests around environment-filtered render, scoping, bounded rows, and runtime leakage.

Phase 2 - Operations render-path stabilization

Optimize only the existing Operations query/table/workbench path. Preserve user-visible workbench semantics while eliminating unbounded scans, heavy per-row JSON/accessor work, and unrelated relationship traversal.

Ensure empty/error/loading states are clear and that safe OperationRun detail links still work for authorized records.

Phase 4 - Productization-smoke runtime guardrail

Make the browser smoke fail on BUG-009 signatures in productization-smoke mode without breaking normal local development.

Phase 5 - Verification and close-out

Run targeted tests, formatting checks, browser smoke, direct route verification, and complete artifacts/verification.md.