Hermes 284313f908
Some checks are pending
Bidi Control Character Guard / bidi-control-guard (push) Waiting to run
Circular Dependency Check / Check for new circular dependencies (push) Waiting to run
Citus Migration Smoke / Combined migrations on single-node Citus (push) Waiting to run
E2E Fresh Install Tests / fresh-install-e2e (push) Waiting to run
ext-v2 guardrails / Run ext-v2 guard and ESLint (push) Waiting to run
Integration Tests / Check for relevant changes (push) Waiting to run
Integration Tests / ${{ (github.event_name == 'schedule' || github.event.inputs.suite == 'full') && 'Full integration suite' || 'Tier-1 integration subset' }} (push) Blocked by required conditions
Mobile checks / Mobile lint + typecheck (push) Waiting to run
Mobile checks / Mobile unit tests (push) Waiting to run
Mobile checks / Mobile dependency audit (report) (push) Waiting to run
Mobile checks / Mobile reproducibility checks (push) Waiting to run
Secrets guard (env backups) / Ensure no tracked env backup files (push) Waiting to run
Temporal Readiness / fast-readiness (push) Waiting to run
Temporal Readiness / docker-parity (push) Waiting to run
TypeScript Type Check / Nx affected typecheck (push) Waiting to run
Unit Tests / Skipped-test budget (push) Waiting to run
Unit Tests / Nx affected unit tests (push) Waiting to run
Unit Tests / Server unit coverage (informational) (push) Waiting to run
Validate Tenant Management Schema / Check for relevant changes (push) Waiting to run
Validate Tenant Management Schema / Validate Tenant Management Schema (push) Blocked by required conditions
EE Workflows Build Guard / ee-workflows-build-guard (push) Waiting to run
Initial import of AlgaPSA codebase from PSA server
Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz

Source: /opt/alga-psa on psa.joliet.tech
2026-06-22 16:12:17 -05:00

231 lines
11 KiB
Markdown

# PRD — SLA Temporal Workflow Architecture (CE/EE Split)
- Slug: `sla-temporal-workflow-architecture`
- Date: `2026-02-03`
- Status: Draft
## Summary
Move SLA time calculation and monitoring from the current pgboss-only polling approach to a dual-backend architecture where Enterprise Edition (EE) uses Temporal workflows for durable, event-driven SLA tracking while Community Edition (CE) continues using pgboss for periodic polling.
## Problem
The current SLA implementation uses a pgboss job (`sla-timer`) that polls every 5 minutes to check all active tickets for threshold crossings. This approach has limitations:
1. **Polling inefficiency**: Checking all tickets every 5 minutes regardless of their actual SLA deadlines
2. **Resolution granularity**: Can only detect breaches within 5-minute windows
3. **No per-ticket orchestration**: Cannot wake up precisely when a specific ticket's threshold is about to be crossed
4. **Limited durability**: Job failures require re-scanning all tickets rather than resuming per-ticket workflows
5. **Business hours complexity**: Minute-by-minute iteration for business hours calculations is inefficient for long periods
For EE customers with large ticket volumes and strict SLA requirements, a Temporal-based approach provides:
- Per-ticket workflow orchestration with durable timers
- Precise threshold notifications timed to exact business-hours-adjusted deadlines
- Workflow state queries for real-time SLA status
- Automatic retry and recovery
## Goals
1. **EE: Temporal-based SLA workflows** — Create a dedicated Temporal workflow per ticket that sleeps until each SLA threshold and wakes to send notifications or escalate
2. **CE: Keep pgboss polling** — Maintain the existing `sla-timer` job for CE edition with no changes to functionality
3. **Unified service interface** — Abstract the backend choice behind a common interface so SLA services remain edition-agnostic
4. **Business hours-aware timers** — Both backends must correctly calculate wake times respecting business hours, holidays, and pause states
5. **Graceful degradation** — EE falls back to pgboss if Temporal is unavailable
## Non-goals
- Changing the SLA data model or database schema
- Modifying the existing SLA policy configuration UI
- Adding new notification channels beyond what exists
- Real-time WebSocket push for SLA status updates
- Historical SLA analytics or reporting changes
- Migration tooling for existing tickets (new tickets get workflows; existing tickets continue with polling)
## Users and Primary Flows
### Primary Users
- **MSP technicians**: See accurate SLA status badges and receive timely threshold notifications
- **Board managers**: Receive escalation notifications at configured thresholds
- **System administrators**: Configure SLA policies, business hours, and escalation rules
### Primary Flows
#### Flow 1: Ticket Created (EE with Temporal)
1. Ticket is created with SLA policy assigned
2. `slaService.startSlaForTicket()` is called
3. Service detects EE edition and starts a `SlaTicketWorkflow` via Temporal
4. Workflow calculates response/resolution deadlines considering business hours
5. Workflow sleeps until first threshold (e.g., 50% of response time)
6. On wake, sends notification and sleeps until next threshold
7. Continues until ticket is resolved or SLA is breached
#### Flow 2: Ticket Created (CE with pgboss)
1. Ticket is created with SLA policy assigned
2. `slaService.startSlaForTicket()` is called
3. Service detects CE edition and records deadlines in ticket record
4. Existing `sla-timer` pgboss job polls every 5 minutes
5. Job calculates elapsed percentage and sends notifications for crossed thresholds
#### Flow 3: SLA Paused/Resumed
1. Ticket status changes to one configured to pause SLA (or awaiting_client)
2. `slaPauseService.pauseSla()` is called
3. **EE**: Signal sent to workflow to pause; workflow records pause start and cancels pending timers
4. **CE**: `sla_paused_at` timestamp set; polling job skips paused tickets
5. On resume: **EE**: Signal sent; workflow recalculates remaining time and sets new timers. **CE**: `sla_total_pause_minutes` incremented; polling resumes
#### Flow 4: Ticket Resolved
1. Ticket is closed/resolved
2. `slaService.recordResolution()` is called
3. **EE**: Signal sent to workflow to complete; workflow records met/breached status and terminates
4. **CE**: SLA fields updated; polling job naturally excludes resolved tickets
## UX / UI Notes
No UI changes required. The SLA status badges and notifications work identically regardless of backend. The only user-facing difference is potentially more precise notification timing in EE.
## Requirements
### Functional Requirements
#### FR-1: Edition Detection and Routing
- [ ] Create `SlaBackendFactory` that returns appropriate backend based on edition
- [ ] Factory checks `isEnterprise` from `server/src/lib/features.ts`
- [ ] Returns `TemporalSlaBackend` for EE, `PgBossSlaBackend` for CE
#### FR-2: Common SLA Backend Interface
- [ ] Define `ISlaBackend` interface with methods:
- `startSlaTracking(ticketId, policyId, targets, schedule)`: Start tracking for a ticket
- `pauseSla(ticketId, reason)`: Pause SLA timer
- `resumeSla(ticketId)`: Resume SLA timer
- `completeSla(ticketId, type: 'response' | 'resolution', met: boolean)`: Complete response or resolution
- `cancelSla(ticketId)`: Cancel SLA tracking (ticket deleted or policy removed)
- `getSlaStatus(ticketId)`: Get current SLA status
#### FR-3: Temporal Workflow for EE (SlaTicketWorkflow)
- [ ] Create `sla-ticket-workflow.ts` in `ee/temporal-workflows/src/workflows/`
- [ ] Workflow input: ticket ID, tenant ID, policy targets, business hours schedule
- [ ] Workflow maintains state: current phase (response/resolution), pause state, notified thresholds
- [ ] Workflow uses Temporal timers (`sleep()`) to wake at threshold times
- [ ] On wake: send notification via activity, check for escalation, sleep until next threshold
- [ ] Support signals: `pause`, `resume`, `completeResponse`, `completeResolution`, `cancel`
- [ ] Support queries: `getState` (returns current status, remaining time, etc.)
#### FR-4: Business Hours Timer Calculation
- [ ] Create activity `calculateNextWakeTime(currentTime, targetTime, schedule, pauseMinutes)`
- [ ] Activity uses existing `businessHoursCalculator` logic
- [ ] Returns actual wall-clock time to sleep until, accounting for business hours
- [ ] Handles edge cases: start outside business hours, holidays, timezone changes
#### FR-5: Temporal Activities for SLA Operations
- [ ] `sendSlaNotification`: Sends threshold notification (reuses existing notification service)
- [ ] `checkAndEscalate`: Checks escalation thresholds and escalates if needed
- [ ] `updateSlaStatus`: Updates ticket SLA fields in database
- [ ] `recordSlaAuditLog`: Writes to sla_audit_log table
#### FR-6: pgboss Backend for CE
- [ ] Create `PgBossSlaBackend` implementing `ISlaBackend`
- [ ] `startSlaTracking`: No-op (deadlines already stored by slaService)
- [ ] `pauseSla`/`resumeSla`: Delegate to existing `slaPauseService`
- [ ] `completeSla`: Delegate to existing `slaService` methods
- [ ] Existing `sla-timer` job continues polling for threshold notifications
#### FR-7: Integration with Existing SLA Services
- [ ] Modify `slaService.startSlaForTicket()` to call backend's `startSlaTracking()`
- [ ] Modify `slaPauseService.pauseSla()`/`resumeSla()` to call backend methods
- [ ] Modify `slaService.recordFirstResponse()`/`recordResolution()` to signal backend
#### FR-8: Workflow Lifecycle Management
- [ ] Workflow ID format: `sla-ticket-{tenantId}-{ticketId}`
- [ ] On ticket deletion: cancel workflow if running
- [ ] On SLA policy change: cancel existing workflow and start new one
- [ ] Workflow gracefully handles duplicate starts (idempotent)
### Non-functional Requirements
#### NFR-1: Fallback Behavior
- [ ] EE with Temporal unavailable falls back to pgboss polling
- [ ] Log warning when fallback occurs
- [ ] Existing tickets continue with polling; new tickets also use polling until Temporal recovers
#### NFR-2: Workflow Durability
- [ ] Workflows survive Temporal worker restarts
- [ ] Workflow state reconstructed from Temporal history on replay
## Data / API / Integrations
### Database
No schema changes. Uses existing:
- `tickets` table: `sla_*` fields for tracking
- `sla_audit_log`: Audit trail
- `sla_policies`, `sla_policy_targets`: Policy configuration
- `business_hours_schedules`, `business_hours_entries`, `holidays`: Business hours
### Temporal Task Queue
- Queue name: `sla-workflows` (or reuse `alga-jobs` with workflow routing)
- Workflows: `SlaTicketWorkflow`
- Activities: `sla-activities` (notification, escalation, DB updates)
### Integration Points
- `@alga-psa/notifications`: For sending threshold notifications
- `@alga-psa/sla`: For business hours calculations
- `JobRunnerFactory`: For fallback to pgboss when Temporal unavailable
## Security / Permissions
No changes. Existing tenant isolation via `tenant` column and `runWithTenant()` context.
## Observability
- Temporal UI provides workflow visibility for debugging
- Existing logger calls in activities for audit trail
- Workflow queries allow real-time status inspection
## Rollout / Migration
### Phase 1: Backend Interface
1. Define `ISlaBackend` interface
2. Implement `PgBossSlaBackend` wrapping existing behavior
3. Wire up factory with CE-only support
4. Verify no behavioral changes
### Phase 2: Temporal Workflow Implementation
1. Create `SlaTicketWorkflow` and activities
2. Implement `TemporalSlaBackend`
3. Add EE branch to factory
4. Test in EE environment
### Phase 3: Integration
1. Integrate backend calls into existing SLA services
2. Handle pause/resume signals
3. Handle ticket resolution signals
4. Test full lifecycle
### Migration Strategy
- New tickets get workflows (EE) or polling (CE)
- Existing active tickets continue with polling (no migration needed)
- Workflows only started for tickets created after deployment
## Open Questions
1. **Q: Should we migrate existing active tickets to workflows?**
A: No - too complex, let them naturally resolve via polling. Only new tickets get workflows.
2. **Q: How long should workflows remain after ticket resolution?**
A: Workflow completes immediately on resolution. Temporal history retained per server config.
3. **Q: Should workflow query replace database reads for SLA status?**
A: No - database remains source of truth. Workflow is for timer orchestration only.
## Acceptance Criteria (Definition of Done)
1. [ ] EE edition starts Temporal workflow for new tickets with SLA policies
2. [ ] Workflow wakes at correct business-hours-adjusted times for each threshold
3. [ ] Notifications sent at configured thresholds (50%, 75%, 90%, 100%)
4. [ ] Escalations triggered when thresholds crossed
5. [ ] Pause signal stops workflow timers; resume recalculates remaining time
6. [ ] Resolution signal completes workflow and records met/breached status
7. [ ] CE edition continues using pgboss polling with no changes
8. [ ] Fallback to pgboss works when Temporal unavailable in EE
9. [ ] Unit tests for backend interface and workflow logic
10. [ ] Integration test for full ticket SLA lifecycle in EE