Some checks are pending
Bidi Control Character Guard / bidi-control-guard (push) Waiting to run
Circular Dependency Check / Check for new circular dependencies (push) Waiting to run
Citus Migration Smoke / Combined migrations on single-node Citus (push) Waiting to run
E2E Fresh Install Tests / fresh-install-e2e (push) Waiting to run
ext-v2 guardrails / Run ext-v2 guard and ESLint (push) Waiting to run
Integration Tests / Check for relevant changes (push) Waiting to run
Integration Tests / ${{ (github.event_name == 'schedule' || github.event.inputs.suite == 'full') && 'Full integration suite' || 'Tier-1 integration subset' }} (push) Blocked by required conditions
Mobile checks / Mobile lint + typecheck (push) Waiting to run
Mobile checks / Mobile unit tests (push) Waiting to run
Mobile checks / Mobile dependency audit (report) (push) Waiting to run
Mobile checks / Mobile reproducibility checks (push) Waiting to run
Secrets guard (env backups) / Ensure no tracked env backup files (push) Waiting to run
Temporal Readiness / fast-readiness (push) Waiting to run
Temporal Readiness / docker-parity (push) Waiting to run
TypeScript Type Check / Nx affected typecheck (push) Waiting to run
Unit Tests / Skipped-test budget (push) Waiting to run
Unit Tests / Nx affected unit tests (push) Waiting to run
Unit Tests / Server unit coverage (informational) (push) Waiting to run
Validate Tenant Management Schema / Check for relevant changes (push) Waiting to run
Validate Tenant Management Schema / Validate Tenant Management Schema (push) Blocked by required conditions
EE Workflows Build Guard / ee-workflows-build-guard (push) Waiting to run
Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz Source: /opt/alga-psa on psa.joliet.tech
231 lines
11 KiB
Markdown
231 lines
11 KiB
Markdown
# PRD — SLA Temporal Workflow Architecture (CE/EE Split)
|
|
|
|
- Slug: `sla-temporal-workflow-architecture`
|
|
- Date: `2026-02-03`
|
|
- Status: Draft
|
|
|
|
## Summary
|
|
|
|
Move SLA time calculation and monitoring from the current pgboss-only polling approach to a dual-backend architecture where Enterprise Edition (EE) uses Temporal workflows for durable, event-driven SLA tracking while Community Edition (CE) continues using pgboss for periodic polling.
|
|
|
|
## Problem
|
|
|
|
The current SLA implementation uses a pgboss job (`sla-timer`) that polls every 5 minutes to check all active tickets for threshold crossings. This approach has limitations:
|
|
|
|
1. **Polling inefficiency**: Checking all tickets every 5 minutes regardless of their actual SLA deadlines
|
|
2. **Resolution granularity**: Can only detect breaches within 5-minute windows
|
|
3. **No per-ticket orchestration**: Cannot wake up precisely when a specific ticket's threshold is about to be crossed
|
|
4. **Limited durability**: Job failures require re-scanning all tickets rather than resuming per-ticket workflows
|
|
5. **Business hours complexity**: Minute-by-minute iteration for business hours calculations is inefficient for long periods
|
|
|
|
For EE customers with large ticket volumes and strict SLA requirements, a Temporal-based approach provides:
|
|
- Per-ticket workflow orchestration with durable timers
|
|
- Precise threshold notifications timed to exact business-hours-adjusted deadlines
|
|
- Workflow state queries for real-time SLA status
|
|
- Automatic retry and recovery
|
|
|
|
## Goals
|
|
|
|
1. **EE: Temporal-based SLA workflows** — Create a dedicated Temporal workflow per ticket that sleeps until each SLA threshold and wakes to send notifications or escalate
|
|
2. **CE: Keep pgboss polling** — Maintain the existing `sla-timer` job for CE edition with no changes to functionality
|
|
3. **Unified service interface** — Abstract the backend choice behind a common interface so SLA services remain edition-agnostic
|
|
4. **Business hours-aware timers** — Both backends must correctly calculate wake times respecting business hours, holidays, and pause states
|
|
5. **Graceful degradation** — EE falls back to pgboss if Temporal is unavailable
|
|
|
|
## Non-goals
|
|
|
|
- Changing the SLA data model or database schema
|
|
- Modifying the existing SLA policy configuration UI
|
|
- Adding new notification channels beyond what exists
|
|
- Real-time WebSocket push for SLA status updates
|
|
- Historical SLA analytics or reporting changes
|
|
- Migration tooling for existing tickets (new tickets get workflows; existing tickets continue with polling)
|
|
|
|
## Users and Primary Flows
|
|
|
|
### Primary Users
|
|
- **MSP technicians**: See accurate SLA status badges and receive timely threshold notifications
|
|
- **Board managers**: Receive escalation notifications at configured thresholds
|
|
- **System administrators**: Configure SLA policies, business hours, and escalation rules
|
|
|
|
### Primary Flows
|
|
|
|
#### Flow 1: Ticket Created (EE with Temporal)
|
|
1. Ticket is created with SLA policy assigned
|
|
2. `slaService.startSlaForTicket()` is called
|
|
3. Service detects EE edition and starts a `SlaTicketWorkflow` via Temporal
|
|
4. Workflow calculates response/resolution deadlines considering business hours
|
|
5. Workflow sleeps until first threshold (e.g., 50% of response time)
|
|
6. On wake, sends notification and sleeps until next threshold
|
|
7. Continues until ticket is resolved or SLA is breached
|
|
|
|
#### Flow 2: Ticket Created (CE with pgboss)
|
|
1. Ticket is created with SLA policy assigned
|
|
2. `slaService.startSlaForTicket()` is called
|
|
3. Service detects CE edition and records deadlines in ticket record
|
|
4. Existing `sla-timer` pgboss job polls every 5 minutes
|
|
5. Job calculates elapsed percentage and sends notifications for crossed thresholds
|
|
|
|
#### Flow 3: SLA Paused/Resumed
|
|
1. Ticket status changes to one configured to pause SLA (or awaiting_client)
|
|
2. `slaPauseService.pauseSla()` is called
|
|
3. **EE**: Signal sent to workflow to pause; workflow records pause start and cancels pending timers
|
|
4. **CE**: `sla_paused_at` timestamp set; polling job skips paused tickets
|
|
5. On resume: **EE**: Signal sent; workflow recalculates remaining time and sets new timers. **CE**: `sla_total_pause_minutes` incremented; polling resumes
|
|
|
|
#### Flow 4: Ticket Resolved
|
|
1. Ticket is closed/resolved
|
|
2. `slaService.recordResolution()` is called
|
|
3. **EE**: Signal sent to workflow to complete; workflow records met/breached status and terminates
|
|
4. **CE**: SLA fields updated; polling job naturally excludes resolved tickets
|
|
|
|
## UX / UI Notes
|
|
|
|
No UI changes required. The SLA status badges and notifications work identically regardless of backend. The only user-facing difference is potentially more precise notification timing in EE.
|
|
|
|
## Requirements
|
|
|
|
### Functional Requirements
|
|
|
|
#### FR-1: Edition Detection and Routing
|
|
- [ ] Create `SlaBackendFactory` that returns appropriate backend based on edition
|
|
- [ ] Factory checks `isEnterprise` from `server/src/lib/features.ts`
|
|
- [ ] Returns `TemporalSlaBackend` for EE, `PgBossSlaBackend` for CE
|
|
|
|
#### FR-2: Common SLA Backend Interface
|
|
- [ ] Define `ISlaBackend` interface with methods:
|
|
- `startSlaTracking(ticketId, policyId, targets, schedule)`: Start tracking for a ticket
|
|
- `pauseSla(ticketId, reason)`: Pause SLA timer
|
|
- `resumeSla(ticketId)`: Resume SLA timer
|
|
- `completeSla(ticketId, type: 'response' | 'resolution', met: boolean)`: Complete response or resolution
|
|
- `cancelSla(ticketId)`: Cancel SLA tracking (ticket deleted or policy removed)
|
|
- `getSlaStatus(ticketId)`: Get current SLA status
|
|
|
|
#### FR-3: Temporal Workflow for EE (SlaTicketWorkflow)
|
|
- [ ] Create `sla-ticket-workflow.ts` in `ee/temporal-workflows/src/workflows/`
|
|
- [ ] Workflow input: ticket ID, tenant ID, policy targets, business hours schedule
|
|
- [ ] Workflow maintains state: current phase (response/resolution), pause state, notified thresholds
|
|
- [ ] Workflow uses Temporal timers (`sleep()`) to wake at threshold times
|
|
- [ ] On wake: send notification via activity, check for escalation, sleep until next threshold
|
|
- [ ] Support signals: `pause`, `resume`, `completeResponse`, `completeResolution`, `cancel`
|
|
- [ ] Support queries: `getState` (returns current status, remaining time, etc.)
|
|
|
|
#### FR-4: Business Hours Timer Calculation
|
|
- [ ] Create activity `calculateNextWakeTime(currentTime, targetTime, schedule, pauseMinutes)`
|
|
- [ ] Activity uses existing `businessHoursCalculator` logic
|
|
- [ ] Returns actual wall-clock time to sleep until, accounting for business hours
|
|
- [ ] Handles edge cases: start outside business hours, holidays, timezone changes
|
|
|
|
#### FR-5: Temporal Activities for SLA Operations
|
|
- [ ] `sendSlaNotification`: Sends threshold notification (reuses existing notification service)
|
|
- [ ] `checkAndEscalate`: Checks escalation thresholds and escalates if needed
|
|
- [ ] `updateSlaStatus`: Updates ticket SLA fields in database
|
|
- [ ] `recordSlaAuditLog`: Writes to sla_audit_log table
|
|
|
|
#### FR-6: pgboss Backend for CE
|
|
- [ ] Create `PgBossSlaBackend` implementing `ISlaBackend`
|
|
- [ ] `startSlaTracking`: No-op (deadlines already stored by slaService)
|
|
- [ ] `pauseSla`/`resumeSla`: Delegate to existing `slaPauseService`
|
|
- [ ] `completeSla`: Delegate to existing `slaService` methods
|
|
- [ ] Existing `sla-timer` job continues polling for threshold notifications
|
|
|
|
#### FR-7: Integration with Existing SLA Services
|
|
- [ ] Modify `slaService.startSlaForTicket()` to call backend's `startSlaTracking()`
|
|
- [ ] Modify `slaPauseService.pauseSla()`/`resumeSla()` to call backend methods
|
|
- [ ] Modify `slaService.recordFirstResponse()`/`recordResolution()` to signal backend
|
|
|
|
#### FR-8: Workflow Lifecycle Management
|
|
- [ ] Workflow ID format: `sla-ticket-{tenantId}-{ticketId}`
|
|
- [ ] On ticket deletion: cancel workflow if running
|
|
- [ ] On SLA policy change: cancel existing workflow and start new one
|
|
- [ ] Workflow gracefully handles duplicate starts (idempotent)
|
|
|
|
### Non-functional Requirements
|
|
|
|
#### NFR-1: Fallback Behavior
|
|
- [ ] EE with Temporal unavailable falls back to pgboss polling
|
|
- [ ] Log warning when fallback occurs
|
|
- [ ] Existing tickets continue with polling; new tickets also use polling until Temporal recovers
|
|
|
|
#### NFR-2: Workflow Durability
|
|
- [ ] Workflows survive Temporal worker restarts
|
|
- [ ] Workflow state reconstructed from Temporal history on replay
|
|
|
|
## Data / API / Integrations
|
|
|
|
### Database
|
|
No schema changes. Uses existing:
|
|
- `tickets` table: `sla_*` fields for tracking
|
|
- `sla_audit_log`: Audit trail
|
|
- `sla_policies`, `sla_policy_targets`: Policy configuration
|
|
- `business_hours_schedules`, `business_hours_entries`, `holidays`: Business hours
|
|
|
|
### Temporal Task Queue
|
|
- Queue name: `sla-workflows` (or reuse `alga-jobs` with workflow routing)
|
|
- Workflows: `SlaTicketWorkflow`
|
|
- Activities: `sla-activities` (notification, escalation, DB updates)
|
|
|
|
### Integration Points
|
|
- `@alga-psa/notifications`: For sending threshold notifications
|
|
- `@alga-psa/sla`: For business hours calculations
|
|
- `JobRunnerFactory`: For fallback to pgboss when Temporal unavailable
|
|
|
|
## Security / Permissions
|
|
|
|
No changes. Existing tenant isolation via `tenant` column and `runWithTenant()` context.
|
|
|
|
## Observability
|
|
|
|
- Temporal UI provides workflow visibility for debugging
|
|
- Existing logger calls in activities for audit trail
|
|
- Workflow queries allow real-time status inspection
|
|
|
|
## Rollout / Migration
|
|
|
|
### Phase 1: Backend Interface
|
|
1. Define `ISlaBackend` interface
|
|
2. Implement `PgBossSlaBackend` wrapping existing behavior
|
|
3. Wire up factory with CE-only support
|
|
4. Verify no behavioral changes
|
|
|
|
### Phase 2: Temporal Workflow Implementation
|
|
1. Create `SlaTicketWorkflow` and activities
|
|
2. Implement `TemporalSlaBackend`
|
|
3. Add EE branch to factory
|
|
4. Test in EE environment
|
|
|
|
### Phase 3: Integration
|
|
1. Integrate backend calls into existing SLA services
|
|
2. Handle pause/resume signals
|
|
3. Handle ticket resolution signals
|
|
4. Test full lifecycle
|
|
|
|
### Migration Strategy
|
|
- New tickets get workflows (EE) or polling (CE)
|
|
- Existing active tickets continue with polling (no migration needed)
|
|
- Workflows only started for tickets created after deployment
|
|
|
|
## Open Questions
|
|
|
|
1. **Q: Should we migrate existing active tickets to workflows?**
|
|
A: No - too complex, let them naturally resolve via polling. Only new tickets get workflows.
|
|
|
|
2. **Q: How long should workflows remain after ticket resolution?**
|
|
A: Workflow completes immediately on resolution. Temporal history retained per server config.
|
|
|
|
3. **Q: Should workflow query replace database reads for SLA status?**
|
|
A: No - database remains source of truth. Workflow is for timer orchestration only.
|
|
|
|
## Acceptance Criteria (Definition of Done)
|
|
|
|
1. [ ] EE edition starts Temporal workflow for new tickets with SLA policies
|
|
2. [ ] Workflow wakes at correct business-hours-adjusted times for each threshold
|
|
3. [ ] Notifications sent at configured thresholds (50%, 75%, 90%, 100%)
|
|
4. [ ] Escalations triggered when thresholds crossed
|
|
5. [ ] Pause signal stops workflow timers; resume recalculates remaining time
|
|
6. [ ] Resolution signal completes workflow and records met/breached status
|
|
7. [ ] CE edition continues using pgboss polling with no changes
|
|
8. [ ] Fallback to pgboss works when Temporal unavailable in EE
|
|
9. [ ] Unit tests for backend interface and workflow logic
|
|
10. [ ] Integration test for full ticket SLA lifecycle in EE
|