Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz Source: /opt/alga-psa on psa.joliet.tech
11 KiB
PRD — SLA Temporal Workflow Architecture (CE/EE Split)
- Slug:
sla-temporal-workflow-architecture - Date:
2026-02-03 - Status: Draft
Summary
Move SLA time calculation and monitoring from the current pgboss-only polling approach to a dual-backend architecture where Enterprise Edition (EE) uses Temporal workflows for durable, event-driven SLA tracking while Community Edition (CE) continues using pgboss for periodic polling.
Problem
The current SLA implementation uses a pgboss job (sla-timer) that polls every 5 minutes to check all active tickets for threshold crossings. This approach has limitations:
- Polling inefficiency: Checking all tickets every 5 minutes regardless of their actual SLA deadlines
- Resolution granularity: Can only detect breaches within 5-minute windows
- No per-ticket orchestration: Cannot wake up precisely when a specific ticket's threshold is about to be crossed
- Limited durability: Job failures require re-scanning all tickets rather than resuming per-ticket workflows
- Business hours complexity: Minute-by-minute iteration for business hours calculations is inefficient for long periods
For EE customers with large ticket volumes and strict SLA requirements, a Temporal-based approach provides:
- Per-ticket workflow orchestration with durable timers
- Precise threshold notifications timed to exact business-hours-adjusted deadlines
- Workflow state queries for real-time SLA status
- Automatic retry and recovery
Goals
- EE: Temporal-based SLA workflows — Create a dedicated Temporal workflow per ticket that sleeps until each SLA threshold and wakes to send notifications or escalate
- CE: Keep pgboss polling — Maintain the existing
sla-timerjob for CE edition with no changes to functionality - Unified service interface — Abstract the backend choice behind a common interface so SLA services remain edition-agnostic
- Business hours-aware timers — Both backends must correctly calculate wake times respecting business hours, holidays, and pause states
- Graceful degradation — EE falls back to pgboss if Temporal is unavailable
Non-goals
- Changing the SLA data model or database schema
- Modifying the existing SLA policy configuration UI
- Adding new notification channels beyond what exists
- Real-time WebSocket push for SLA status updates
- Historical SLA analytics or reporting changes
- Migration tooling for existing tickets (new tickets get workflows; existing tickets continue with polling)
Users and Primary Flows
Primary Users
- MSP technicians: See accurate SLA status badges and receive timely threshold notifications
- Board managers: Receive escalation notifications at configured thresholds
- System administrators: Configure SLA policies, business hours, and escalation rules
Primary Flows
Flow 1: Ticket Created (EE with Temporal)
- Ticket is created with SLA policy assigned
slaService.startSlaForTicket()is called- Service detects EE edition and starts a
SlaTicketWorkflowvia Temporal - Workflow calculates response/resolution deadlines considering business hours
- Workflow sleeps until first threshold (e.g., 50% of response time)
- On wake, sends notification and sleeps until next threshold
- Continues until ticket is resolved or SLA is breached
Flow 2: Ticket Created (CE with pgboss)
- Ticket is created with SLA policy assigned
slaService.startSlaForTicket()is called- Service detects CE edition and records deadlines in ticket record
- Existing
sla-timerpgboss job polls every 5 minutes - Job calculates elapsed percentage and sends notifications for crossed thresholds
Flow 3: SLA Paused/Resumed
- Ticket status changes to one configured to pause SLA (or awaiting_client)
slaPauseService.pauseSla()is called- EE: Signal sent to workflow to pause; workflow records pause start and cancels pending timers
- CE:
sla_paused_attimestamp set; polling job skips paused tickets - On resume: EE: Signal sent; workflow recalculates remaining time and sets new timers. CE:
sla_total_pause_minutesincremented; polling resumes
Flow 4: Ticket Resolved
- Ticket is closed/resolved
slaService.recordResolution()is called- EE: Signal sent to workflow to complete; workflow records met/breached status and terminates
- CE: SLA fields updated; polling job naturally excludes resolved tickets
UX / UI Notes
No UI changes required. The SLA status badges and notifications work identically regardless of backend. The only user-facing difference is potentially more precise notification timing in EE.
Requirements
Functional Requirements
FR-1: Edition Detection and Routing
- Create
SlaBackendFactorythat returns appropriate backend based on edition - Factory checks
isEnterprisefromserver/src/lib/features.ts - Returns
TemporalSlaBackendfor EE,PgBossSlaBackendfor CE
FR-2: Common SLA Backend Interface
- Define
ISlaBackendinterface with methods:startSlaTracking(ticketId, policyId, targets, schedule): Start tracking for a ticketpauseSla(ticketId, reason): Pause SLA timerresumeSla(ticketId): Resume SLA timercompleteSla(ticketId, type: 'response' | 'resolution', met: boolean): Complete response or resolutioncancelSla(ticketId): Cancel SLA tracking (ticket deleted or policy removed)getSlaStatus(ticketId): Get current SLA status
FR-3: Temporal Workflow for EE (SlaTicketWorkflow)
- Create
sla-ticket-workflow.tsinee/temporal-workflows/src/workflows/ - Workflow input: ticket ID, tenant ID, policy targets, business hours schedule
- Workflow maintains state: current phase (response/resolution), pause state, notified thresholds
- Workflow uses Temporal timers (
sleep()) to wake at threshold times - On wake: send notification via activity, check for escalation, sleep until next threshold
- Support signals:
pause,resume,completeResponse,completeResolution,cancel - Support queries:
getState(returns current status, remaining time, etc.)
FR-4: Business Hours Timer Calculation
- Create activity
calculateNextWakeTime(currentTime, targetTime, schedule, pauseMinutes) - Activity uses existing
businessHoursCalculatorlogic - Returns actual wall-clock time to sleep until, accounting for business hours
- Handles edge cases: start outside business hours, holidays, timezone changes
FR-5: Temporal Activities for SLA Operations
sendSlaNotification: Sends threshold notification (reuses existing notification service)checkAndEscalate: Checks escalation thresholds and escalates if neededupdateSlaStatus: Updates ticket SLA fields in databaserecordSlaAuditLog: Writes to sla_audit_log table
FR-6: pgboss Backend for CE
- Create
PgBossSlaBackendimplementingISlaBackend startSlaTracking: No-op (deadlines already stored by slaService)pauseSla/resumeSla: Delegate to existingslaPauseServicecompleteSla: Delegate to existingslaServicemethods- Existing
sla-timerjob continues polling for threshold notifications
FR-7: Integration with Existing SLA Services
- Modify
slaService.startSlaForTicket()to call backend'sstartSlaTracking() - Modify
slaPauseService.pauseSla()/resumeSla()to call backend methods - Modify
slaService.recordFirstResponse()/recordResolution()to signal backend
FR-8: Workflow Lifecycle Management
- Workflow ID format:
sla-ticket-{tenantId}-{ticketId} - On ticket deletion: cancel workflow if running
- On SLA policy change: cancel existing workflow and start new one
- Workflow gracefully handles duplicate starts (idempotent)
Non-functional Requirements
NFR-1: Fallback Behavior
- EE with Temporal unavailable falls back to pgboss polling
- Log warning when fallback occurs
- Existing tickets continue with polling; new tickets also use polling until Temporal recovers
NFR-2: Workflow Durability
- Workflows survive Temporal worker restarts
- Workflow state reconstructed from Temporal history on replay
Data / API / Integrations
Database
No schema changes. Uses existing:
ticketstable:sla_*fields for trackingsla_audit_log: Audit trailsla_policies,sla_policy_targets: Policy configurationbusiness_hours_schedules,business_hours_entries,holidays: Business hours
Temporal Task Queue
- Queue name:
sla-workflows(or reusealga-jobswith workflow routing) - Workflows:
SlaTicketWorkflow - Activities:
sla-activities(notification, escalation, DB updates)
Integration Points
@alga-psa/notifications: For sending threshold notifications@alga-psa/sla: For business hours calculationsJobRunnerFactory: For fallback to pgboss when Temporal unavailable
Security / Permissions
No changes. Existing tenant isolation via tenant column and runWithTenant() context.
Observability
- Temporal UI provides workflow visibility for debugging
- Existing logger calls in activities for audit trail
- Workflow queries allow real-time status inspection
Rollout / Migration
Phase 1: Backend Interface
- Define
ISlaBackendinterface - Implement
PgBossSlaBackendwrapping existing behavior - Wire up factory with CE-only support
- Verify no behavioral changes
Phase 2: Temporal Workflow Implementation
- Create
SlaTicketWorkflowand activities - Implement
TemporalSlaBackend - Add EE branch to factory
- Test in EE environment
Phase 3: Integration
- Integrate backend calls into existing SLA services
- Handle pause/resume signals
- Handle ticket resolution signals
- Test full lifecycle
Migration Strategy
- New tickets get workflows (EE) or polling (CE)
- Existing active tickets continue with polling (no migration needed)
- Workflows only started for tickets created after deployment
Open Questions
-
Q: Should we migrate existing active tickets to workflows? A: No - too complex, let them naturally resolve via polling. Only new tickets get workflows.
-
Q: How long should workflows remain after ticket resolution? A: Workflow completes immediately on resolution. Temporal history retained per server config.
-
Q: Should workflow query replace database reads for SLA status? A: No - database remains source of truth. Workflow is for timer orchestration only.
Acceptance Criteria (Definition of Done)
- EE edition starts Temporal workflow for new tickets with SLA policies
- Workflow wakes at correct business-hours-adjusted times for each threshold
- Notifications sent at configured thresholds (50%, 75%, 90%, 100%)
- Escalations triggered when thresholds crossed
- Pause signal stops workflow timers; resume recalculates remaining time
- Resolution signal completes workflow and records met/breached status
- CE edition continues using pgboss polling with no changes
- Fallback to pgboss works when Temporal unavailable in EE
- Unit tests for backend interface and workflow logic
- Integration test for full ticket SLA lifecycle in EE