Ahmed Darrazi 75979e7995 chore(worker): add structured logging, job events, worker health endpoint and health-check script

2025-12-09 12:22:16 +01:00

22 KiB

Raw Blame History

Implementation Plan: Backend Architecture Pivot

Feature Branch: 005-backend-arch-pivot
Created: 2025-12-09
Spec: spec.md
Status: Ready for Implementation

Executive Summary

Goal: Migrate from n8n Low-Code backend to TypeScript Code-First backend with BullMQ job queue for Policy synchronization.

Impact: Removes external n8n dependency, improves maintainability, enables AI-assisted refactoring, and provides foundation for future scheduled sync features.

Complexity: HIGH - Requires new infrastructure (Redis, BullMQ), worker process deployment, and careful data transformation logic porting.

Technical Context

Current Architecture (n8n-based)

User clicks "Sync Now"
    ↓
Server Action: triggerPolicySync()
    ↓
HTTP POST → n8n Webhook (N8N_SYNC_WEBHOOK_URL)
    ↓
n8n Workflow:
  1. Microsoft Graph Authentication
  2. Fetch Policies (4 endpoints with pagination)
  3. JavaScript Code Node: Deep Flattening Logic
  4. HTTP POST → TenantPilot Ingestion API
    ↓
API Route: /api/policy-settings (validates POLICY_API_SECRET)
    ↓
Drizzle ORM: Insert/Update policy_settings table

Problems:

External dependency (n8n instance required)
Complex transformation logic hidden in n8n Code Node
Hard to test, version control, and refactor
No AI assistance for n8n code
Additional API security layer needed (POLICY_API_SECRET)

Target Architecture (BullMQ-based)

User clicks "Sync Now"
    ↓
Server Action: triggerPolicySync()
    ↓
BullMQ: Add job to Redis queue "intune-sync-queue"
    ↓
Worker Process (TypeScript):
  1. Microsoft Graph Authentication (@azure/identity)
  2. Fetch Policies (4 endpoints with pagination)
  3. TypeScript: Deep Flattening Logic
  4. Drizzle ORM: Direct Insert/Update
    ↓
Database: policy_settings table

Benefits:

No external dependencies (Redis only)
All logic in TypeScript (version-controlled, testable)
AI-assisted refactoring possible
Simpler security model (no API bridge)
Foundation for scheduled syncs

Constitution Check (mandatory)

Compliance Verification

Principle	Status	Notes
I. Server-First Architecture	✅ COMPLIANT	Worker uses Server Actions pattern (background job processing), no client fetches
II. TypeScript Strict Mode	✅ COMPLIANT	All worker code in TypeScript strict mode, fully typed Graph API responses
III. Drizzle ORM Integration	✅ COMPLIANT	Worker uses Drizzle for all DB operations, no raw SQL
IV. Shadcn UI Components	✅ COMPLIANT	No UI changes (frontend only triggers job, uses existing components)
V. Azure AD Multi-Tenancy	✅ COMPLIANT	Uses existing Azure AD Client Credentials for Graph API access

Risk Assessment

HIGH RISK: Worker deployment as separate process (requires Docker Compose update, PM2/Systemd config)

MEDIUM RISK: Graph API rate limiting handling (needs robust retry logic)

LOW RISK: BullMQ integration (well-documented library, standard Redis setup)

Justification

Architecture pivot necessary to:

Remove external n8n dependency (reduces operational complexity)
Enable AI-assisted development (TypeScript vs. n8n visual flows)
Improve testability (unit/integration tests for worker logic)
Prepare for Phase 2 features (scheduled syncs, multi-tenant parallel processing)

Approved: Constitution compliance verified, complexity justified by maintainability gains.

File Tree & Changes

tenantpilot/
├── .env                             # [MODIFIED] Add REDIS_URL, remove POLICY_API_SECRET + N8N_SYNC_WEBHOOK_URL
├── (Redis provided by deployment)   # No `docker-compose.yml` required; ensure `REDIS_URL` is set by Dokploy
├── package.json                     # [MODIFIED] Add bullmq, ioredis, @azure/identity, tsx dependencies
│
├── lib/
│   ├── env.mjs                      # [MODIFIED] Add REDIS_URL validation, remove POLICY_API_SECRET + N8N_SYNC_WEBHOOK_URL
│   ├── queue/
│   │   ├── redis.ts                 # [NEW] Redis connection for BullMQ
│   │   └── syncQueue.ts             # [NEW] BullMQ Queue definition for "intune-sync-queue"
│   └── actions/
│       └── policySettings.ts        # [MODIFIED] Replace n8n webhook call with BullMQ job creation
│
├── worker/
│   ├── index.ts                     # [NEW] BullMQ Worker entry point
│   ├── jobs/
│   │   ├── syncPolicies.ts          # [NEW] Main sync orchestration logic
│   │   ├── graphAuth.ts             # [NEW] Azure AD token acquisition
│   │   ├── graphFetch.ts            # [NEW] Microsoft Graph API calls with pagination
│   │   ├── policyParser.ts          # [NEW] Deep flattening & transformation logic
│   │   └── dbUpsert.ts              # [NEW] Drizzle ORM upsert operations
│   └── utils/
│       ├── humanizer.ts             # [NEW] Setting ID humanization
│       └── retry.ts                 # [NEW] Exponential backoff retry logic
│
├── app/api/
│   ├── policy-settings/
│   │   └── route.ts                 # [DELETED] n8n ingestion API no longer needed
│   └── admin/
│       └── tenants/
│           └── route.ts             # [DELETED] n8n polling API no longer needed
│
└── specs/005-backend-arch-pivot/
    ├── spec.md                      # ✅ Complete
    ├── plan.md                      # 📝 This file
    ├── technical-notes.md           # ✅ Complete (implementation reference)
    └── tasks.md                     # 🔜 Generated next

Phase Breakdown

Phase 1: Setup & Infrastructure (T001-T008)

Goal: Prepare environment, install dependencies, and wire the app to the provisioned Redis instance

Tasks:

T001: Confirm REDIS_URL is provided by Dokploy and obtain connection details
T002-T004: Add REDIS_URL to local .env (for development) and to lib/env.mjs runtime validation
T005: Install npm packages: bullmq, ioredis, @azure/identity, tsx
T006-T007: Create Redis connection and BullMQ Queue
T008: Test infrastructure (connect to provided Redis from local/dev environment)

Deliverables:

Connection details for Redis from Dokploy documented
Environment variables validated (local + deploy)
Dependencies in package.json
Queue operational using the provided Redis

Exit Criteria: npm run dev starts without env validation errors and the queue accepts jobs against the provided Redis

Phase 2: Worker Process Skeleton (T009-T014)ntry point and basic job processing infrastructure

Tasks:

T009: Create worker/index.ts - BullMQ Worker entry point
T010-T012: Add npm script, event handlers, structured logging
T013: Create sync orchestration skeleton
T014: Test worker startup

Deliverables:

Worker process can be started via npm run worker:start
Jobs flow from queue → worker
Event logging operational

Exit Criteria: Worker logs "Processing job X" when job is added to queue

Phase 3: Microsoft Graph Integration (T015-T023)ion and Microsoft Graph API data fetching with pagination

Tasks:

T015-T017: Create worker/jobs/graphAuth.ts - Azure AD token acquisition
T018-T021: Create worker/jobs/graphFetch.ts - Fetch from 4 endpoints with pagination
T022: Create worker/utils/retry.ts - Exponential backoff
T023: Test with real tenant data

Deliverables:

getGraphAccessToken() returns valid token
fetchAllPolicies() returns all policies from 4 endpoints
Pagination handled correctly (follows @odata.nextLink)
Rate limiting handled with retry

Exit Criteria: Worker successfully fetches >50 policies for test tenant

Phase 4: Data Transformation (T024-T035)

Goal: Port n8n flattening logic to TypeScript

Tasks:

Create worker/jobs/policyParser.ts - Policy type detection & routing
Implement Settings Catalog parser (settings[] → flat key-value)
Implement OMA-URI parser (omaSettings[] → flat key-value)
Create worker/utils/humanizer.ts - Setting ID humanization
Handle empty policies (default placeholder setting)
Test: Parse sample policies, verify output structure

Deliverables:

parsePolicySettings() converts Graph response → FlattenedSetting[]
Humanizer converts technical IDs → readable names
Empty policies get "(No settings configured)" entry

Exit Criteria: 95%+ of sample settings are correctly extracted and formatted

Phase 5: Database Persistence (T036-T043)

Goal: Implement Drizzle ORM upsert logic

Tasks:

Create worker/jobs/dbUpsert.ts - Batch upsert with conflict resolution
Use existing policy_settings table schema
Leverage policy_settings_upsert_unique constraint (tenantId + graphPolicyId + settingName)
Update lastSyncedAt on every sync
Test: Run full sync, verify data in DB

Deliverables:

upsertPolicySettings() inserts new & updates existing settings
No duplicate settings created
lastSyncedAt updated correctly

Exit Criteria: Full sync for test tenant completes successfully, data visible in DB

Phase 6: Frontend Integration (T044-T051)

Goal: Replace n8n webhook with BullMQ job creation

Tasks:

Modify lib/actions/policySettings.ts → triggerPolicySync()
Remove n8n webhook call (fetch(env.N8N_SYNC_WEBHOOK_URL))
Replace with BullMQ job creation (syncQueue.add(...))
Return job ID to frontend
Test: Click "Sync Now", verify job created & processed

Deliverables:

"Sync Now" button triggers BullMQ job
User sees immediate feedback (no blocking)
Worker processes job in background

Exit Criteria: End-to-end sync works from UI → Queue → Worker → DB

Phase 7: Legacy Cleanup (T052-T056)

Goal: Remove all n8n-related code and configuration

Tasks:

Delete app/api/policy-settings/route.ts (n8n ingestion API)
Delete app/api/admin/tenants/route.ts (n8n polling API)
Remove POLICY_API_SECRET from .env and lib/env.mjs
Remove N8N_SYNC_WEBHOOK_URL from .env and lib/env.mjs
Grep search for remaining references (should be 0)
Update documentation (remove n8n setup instructions)

Deliverables:

No n8n-related files in codebase
No n8n-related env vars
Clean grep search results

Exit Criteria: grep -r "POLICY_API_SECRET\|N8N_SYNC_WEBHOOK_URL" . returns 0 results (excluding specs/)

Phase 8: Testing & Validation (T057-T061)

Goal: Comprehensive testing of new architecture

Tasks:

Unit tests for flattening logic
Integration tests for worker jobs
End-to-end test: UI → Queue → Worker → DB
Load test: 100+ policies sync
Error handling test: Graph API failures, Redis unavailable
Memory leak test: Worker runs 1+ hour with 10+ jobs

Deliverables:

Test suite with >80% coverage for worker code
All edge cases verified
Performance benchmarks met (SC-001 to SC-008)

Exit Criteria: All tests pass, no regressions in existing features

Phase 9: Deployment (T062-T066)

Goal: Deploy worker process to production

Tasks:

Ensure REDIS_URL is set in production (provided by Dokploy) — no Docker Compose Redis required
Configure worker as background service (PM2, Systemd, or Docker)
Set REDIS_URL in production environment
Monitor worker logs for first production sync
Verify sync completes successfully
Document worker deployment process

Deliverables:

Worker running as persistent service
Redis accessible from worker
Production sync successful

Exit Criteria: Production sync works end-to-end, no errors in logs

Key Technical Decisions

1. BullMQ vs. Other Queue Libraries

Decision: Use BullMQ

Rationale:

Modern, actively maintained (vs. Kue, Bull)
TypeScript-first design
Built-in retry, rate limiting, priority queues
Excellent documentation
Redis-based (simpler than RabbitMQ/Kafka)

Alternatives Considered:

Bee-Queue: Lighter but less features
Agenda: MongoDB-based (adds extra dependency)
AWS SQS: Vendor lock-in, requires AWS setup

2. Worker Process Architecture

Decision: Single worker process, sequential job processing (concurrency: 1)

Rationale:

Simpler implementation (no race conditions)
Microsoft Graph rate limits per tenant
Database upsert logic easier without concurrency
Can scale later if needed (multiple workers)

Alternatives Considered:

Parallel Processing: Higher complexity, potential conflicts
Lambda/Serverless: Cold starts, harder debugging

3. Token Acquisition Strategy

Decision: Use @azure/identity ClientSecretCredential

Rationale:

Official Microsoft library
Handles token refresh automatically
TypeScript support
Simpler than manual OAuth flow

Alternatives Considered:

Manual fetch(): More code, error-prone
MSAL Node: Overkill for server-side client credentials

4. Flattening Algorithm

Decision: Port n8n logic 1:1 initially, refactor later

Rationale:

Minimize risk (proven logic)
Faster migration (no re-design needed)
Can optimize in Phase 2 after validation

Alternatives Considered:

Re-design from scratch: Higher risk, longer timeline

5. Database Schema Changes

Decision: No schema changes needed

Rationale:

Existing policy_settings table has required fields
UNIQUE constraint already supports upsert logic
lastSyncedAt field exists for tracking

Alternatives Considered:

Add job tracking table: Overkill for MVP (BullMQ handles this)

Data Flow Diagrams

Current Flow (n8n)

sequenceDiagram
    participant User
    participant UI as Next.js UI
    participant SA as Server Action
    participant n8n as n8n Webhook
    participant API as Ingestion API
    participant DB as PostgreSQL

    User->>UI: Click "Sync Now"
    UI->>SA: triggerPolicySync(tenantId)
    SA->>n8n: POST /webhook
    n8n->>n8n: Fetch Graph API
    n8n->>n8n: Transform Data
    n8n->>API: POST /api/policy-settings
    API->>API: Validate API Secret
    API->>DB: Insert/Update
    DB-->>API: Success
    API-->>n8n: 200 OK
    n8n-->>SA: 200 OK
    SA-->>UI: Success
    UI-->>User: Toast "Sync started"

Target Flow (BullMQ)

sequenceDiagram
    participant User
    participant UI as Next.js UI
    participant SA as Server Action
    participant Queue as Redis Queue
    participant Worker as Worker Process
    participant Graph as MS Graph API
    participant DB as PostgreSQL

    User->>UI: Click "Sync Now"
    UI->>SA: triggerPolicySync(tenantId)
    SA->>Queue: Add job "sync-tenant"
    Queue-->>SA: Job ID
    SA-->>UI: Success (immediate)
    UI-->>User: Toast "Sync started"
    
    Note over Worker: Background Processing
    Worker->>Queue: Pick job
    Worker->>Graph: Fetch policies
    Graph-->>Worker: Policy data
    Worker->>Worker: Transform data
    Worker->>DB: Upsert settings
    DB-->>Worker: Success
    Worker->>Queue: Mark job complete

Environment Variables

Changes Required

Add:

REDIS_URL=redis://localhost:6379

Remove:

# Delete these lines:
POLICY_API_SECRET=...
N8N_SYNC_WEBHOOK_URL=...

Updated `lib/env.mjs`

export const env = createEnv({
  server: {
    DATABASE_URL: z.string().url(),
    NEXTAUTH_SECRET: z.string().min(1),
    NEXTAUTH_URL: z.string().url(),
    AZURE_AD_CLIENT_ID: z.string().min(1),
    AZURE_AD_CLIENT_SECRET: z.string().min(1),
    REDIS_URL: z.string().url(), // ADD THIS
    RESEND_API_KEY: z.string().optional(),
    STRIPE_SECRET_KEY: z.string().optional(),
    // ... other Stripe vars
    // REMOVE: POLICY_API_SECRET
    // REMOVE: N8N_SYNC_WEBHOOK_URL
  },
  client: {},
  runtimeEnv: {
    DATABASE_URL: process.env.DATABASE_URL,
    NEXTAUTH_SECRET: process.env.NEXTAUTH_SECRET,
    NEXTAUTH_URL: process.env.NEXTAUTH_URL,
    AZURE_AD_CLIENT_ID: process.env.AZURE_AD_CLIENT_ID,
    AZURE_AD_CLIENT_SECRET: process.env.AZURE_AD_CLIENT_SECRET,
    REDIS_URL: process.env.REDIS_URL, // ADD THIS
    RESEND_API_KEY: process.env.RESEND_API_KEY,
    STRIPE_SECRET_KEY: process.env.STRIPE_SECRET_KEY,
    // ... other vars
  },
});

Testing Strategy

Unit Tests

Target Coverage: 80%+ for worker code

Files to Test:

worker/utils/humanizer.ts - Setting ID transformation
worker/jobs/policyParser.ts - Flattening logic
worker/utils/retry.ts - Backoff algorithm

Example:

describe('humanizeSettingId', () => {
  it('removes vendor prefix', () => {
    expect(humanizeSettingId('device_vendor_msft_policy_config_wifi'))
      .toBe('Wifi');
  });
});

Integration Tests

Target: Full worker job processing

Scenario:

Mock Microsoft Graph API responses
Add job to queue
Verify worker processes job
Check database for inserted settings

Example:

describe('syncPolicies', () => {
  it('fetches and stores policies', async () => {
    await syncPolicies('test-tenant-123');
    const settings = await db.query.policySettings.findMany({
      where: eq(policySettings.tenantId, 'test-tenant-123'),
    });
    expect(settings.length).toBeGreaterThan(0);
  });
});

End-to-End Test

Scenario:

Start Redis + Worker
Login to UI
Navigate to /search
Click "Sync Now"
Verify:
- Job created in Redis
- Worker picks up job
- Database updated
- UI shows success message

Rollback Plan

If migration fails in production:

Immediate: Revert to previous Docker image (with n8n integration)
Restore env vars: Re-add POLICY_API_SECRET and N8N_SYNC_WEBHOOK_URL
Verify: n8n webhook accessible, sync works
Post-mortem: Document failure reason, plan fixes

Data Safety: No data loss risk (upsert logic preserves existing data)

Performance Targets

Based on Success Criteria (SC-001 to SC-008):

Metric	Target	Measurement
Job Creation	<200ms	Server Action response time
Sync Duration (50 policies)	<30s	Worker job duration
Setting Extraction	>95%	Manual validation with sample data
Worker Stability	1+ hour, 10+ jobs	Memory profiling
Pagination	100%	Test with 100+ policies tenant

Dependencies

npm Packages

{
  "dependencies": {
    "bullmq": "^5.0.0",
    "ioredis": "^5.3.0",
    "@azure/identity": "^4.0.0"
  },
  "devDependencies": {
    "tsx": "^4.0.0"
  }
}

Infrastructure

Redis: 7.x (via Docker or external service)
Node.js: 20+ (for worker process)

Monitoring & Observability

Worker Logs

Format: Structured JSON logs

Key Events:

Job started: { event: "job_start", jobId, tenantId, timestamp }
Job completed: { event: "job_complete", jobId, duration, settingsCount }
Job failed: { event: "job_failed", jobId, error, stack }

Storage: Write to file or stdout (captured by Docker/PM2)

Health Check Endpoint

Path: /api/worker-health

Response:

{
  "status": "healthy",
  "queue": {
    "waiting": 2,
    "active": 1,
    "completed": 45,
    "failed": 3
  }
}

Use Case: Monitoring dashboard, uptime checks

Documentation Updates

Files to Update:

README.md - Add worker deployment instructions
DEPLOYMENT.md - Document Redis setup, worker config
specs/002-manual-policy-sync/ - Mark as superseded by 005

New Documentation:

docs/worker-deployment.md - Step-by-step worker setup
docs/troubleshooting.md - Common worker issues & fixes

Open Questions & Risks

Q1: Redis Hosting Strategy

Question: Self-hosted Redis or managed service (e.g., Upstash, Redis Cloud)?

Options:

Docker Compose (simple, dev-friendly)
Upstash (serverless, paid but simple)
Self-hosted on VPS (more control, more ops)

Recommendation: Start with Docker Compose, migrate to managed service if scaling needed

Q2: Worker Deployment Method

Question: How to deploy worker in production?

Options:

PM2 (Node process manager)
Systemd (Linux service)
Docker container (consistent with app)

Recommendation: Docker container (matches Next.js deployment strategy)

Q3: Job Failure Notifications

Question: How to notify admins when sync jobs fail?

Options:

Email via Resend (already integrated)
In-app notification system (Phase 2)
External monitoring (e.g., Sentry)

Recommendation: Start with logs only, add notifications in Phase 2

Success Metrics

Metric	Target	Status
n8n dependency removed	Yes	🔜
All tests passing	100%	🔜
Production sync successful	Yes	🔜
Worker uptime	>99%	🔜
Zero data loss	Yes	🔜

Timeline Estimate

Phase	Duration	Dependencies
0. Pre-Implementation	1h	None
1. Queue Infrastructure	2h	Phase 0
2. Graph Integration	4h	Phase 1
3. Data Transformation	6h	Phase 2
4. Database Persistence	3h	Phase 3
5. Frontend Integration	2h	Phase 4
6. Legacy Cleanup	2h	Phase 5
7. Testing & Validation	4h	Phases 1-6
8. Deployment	3h	Phase 7
Total	~27h	~3-4 days

Next Steps

✅ Generate tasks.md with detailed task breakdown
🔜 Start Phase 0: Install Redis, update env vars
🔜 Implement Phase 1: Queue infrastructure
🔜 Continue through Phase 8: Deployment

Plan Status: ✅ Ready for Task Generation
Approved by: Technical Lead (pending)
Last Updated: 2025-12-09

22 KiB Raw Blame History

Implementation Plan: Backend Architecture Pivot

Executive Summary

Technical Context

Current Architecture (n8n-based)

Target Architecture (BullMQ-based)

Constitution Check (mandatory)

Compliance Verification

Risk Assessment

Justification

File Tree & Changes

Phase Breakdown

Phase 1: Setup & Infrastructure (T001-T008)

Phase 2: Worker Process Skeleton (T009-T014)ntry point and basic job processing infrastructure

Phase 3: Microsoft Graph Integration (T015-T023)ion and Microsoft Graph API data fetching with pagination

Phase 4: Data Transformation (T024-T035)

Phase 5: Database Persistence (T036-T043)

Phase 6: Frontend Integration (T044-T051)

Phase 7: Legacy Cleanup (T052-T056)

Phase 8: Testing & Validation (T057-T061)

Phase 9: Deployment (T062-T066)

Key Technical Decisions

1. BullMQ vs. Other Queue Libraries

2. Worker Process Architecture

3. Token Acquisition Strategy

4. Flattening Algorithm

5. Database Schema Changes

Data Flow Diagrams

Current Flow (n8n)

Target Flow (BullMQ)

Environment Variables

Changes Required

Updated lib/env.mjs

Testing Strategy

Unit Tests

Integration Tests

End-to-End Test

Rollback Plan

Performance Targets

Dependencies

npm Packages

Infrastructure

Monitoring & Observability

Worker Logs

Health Check Endpoint

Documentation Updates

Open Questions & Risks

Q1: Redis Hosting Strategy

Q2: Worker Deployment Method

Q3: Job Failure Notifications

Success Metrics

Timeline Estimate

Next Steps

22 KiB

Raw Blame History

Updated `lib/env.mjs`