Hermes 284313f908
Some checks are pending
Bidi Control Character Guard / bidi-control-guard (push) Waiting to run
Circular Dependency Check / Check for new circular dependencies (push) Waiting to run
Citus Migration Smoke / Combined migrations on single-node Citus (push) Waiting to run
E2E Fresh Install Tests / fresh-install-e2e (push) Waiting to run
ext-v2 guardrails / Run ext-v2 guard and ESLint (push) Waiting to run
Integration Tests / Check for relevant changes (push) Waiting to run
Integration Tests / ${{ (github.event_name == 'schedule' || github.event.inputs.suite == 'full') && 'Full integration suite' || 'Tier-1 integration subset' }} (push) Blocked by required conditions
Mobile checks / Mobile lint + typecheck (push) Waiting to run
Mobile checks / Mobile unit tests (push) Waiting to run
Mobile checks / Mobile dependency audit (report) (push) Waiting to run
Mobile checks / Mobile reproducibility checks (push) Waiting to run
Secrets guard (env backups) / Ensure no tracked env backup files (push) Waiting to run
Temporal Readiness / fast-readiness (push) Waiting to run
Temporal Readiness / docker-parity (push) Waiting to run
TypeScript Type Check / Nx affected typecheck (push) Waiting to run
Unit Tests / Skipped-test budget (push) Waiting to run
Unit Tests / Nx affected unit tests (push) Waiting to run
Unit Tests / Server unit coverage (informational) (push) Waiting to run
Validate Tenant Management Schema / Check for relevant changes (push) Waiting to run
Validate Tenant Management Schema / Validate Tenant Management Schema (push) Blocked by required conditions
EE Workflows Build Guard / ee-workflows-build-guard (push) Waiting to run
Initial import of AlgaPSA codebase from PSA server
Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz

Source: /opt/alga-psa on psa.joliet.tech
2026-06-22 16:12:17 -05:00

173 lines
14 KiB
Markdown

# Scratchpad — SLA Temporal Workflow Architecture (CE/EE Split)
- Plan slug: `sla-temporal-workflow-architecture`
- Created: `2026-02-03`
## What This Is
Keep a lightweight, continuously-updated log of discoveries and decisions made while implementing this plan.
## Decisions
- (2026-02-03) Use separate `ISlaBackend` interface rather than extending `IJobRunner` - SLA tracking has different semantics (signals, queries, per-ticket workflows) vs. generic job execution
- (2026-02-03) Don't migrate existing tickets - new tickets get workflows (EE) or polling (CE), existing continue with polling
- (2026-02-03) Database remains source of truth for SLA status - workflow is for timer orchestration only
- (2026-02-03) Workflow ID format `sla-ticket-{tenantId}-{ticketId}` allows easy lookup and ensures uniqueness
- (2026-02-03) Added `ISlaBackend` interface in `packages/sla/src/services/backends/ISlaBackend.ts` and exported via services index
- (2026-02-03) Added `SlaBackendFactory` in `packages/sla/src/services/backends/SlaBackendFactory.ts` to select EE vs CE backend using `server/src/lib/features.ts`
- (2026-02-03) Implemented `PgBossSlaBackend` to delegate pause/resume/complete/status operations via existing SLA services using tenant resolution helpers
- (2026-02-03) Added `sla-ticket-workflow.ts` with response/resolution phase tracking and threshold orchestration (Temporal)
- (2026-02-03) SLA workflow input includes ticketId, tenantId, policyTargets, and businessHoursSchedule
- (2026-02-03) SLA workflow state tracks phase, pause state, notified thresholds, and response/resolution deadlines
- (2026-02-03) SLA workflow uses Temporal sleep + condition race to wake at threshold times
- (2026-02-03) Pause signal sets pauseStartedAt and pauses timers via condition wake-up
- (2026-02-03) Resume signal clears pauseStartedAt, increments totalPauseMinutes, and triggers recalculation on next loop
- (2026-02-03) Complete response signal sets response complete and transitions workflow to resolution phase
- (2026-02-03) Complete resolution signal marks workflow completed and allows termination
- (2026-02-03) Cancel signal marks workflow cancelled and stops further threshold handling
- (2026-02-03) Added getState query returning status, remaining time minutes, and pause state
- (2026-02-03) Added SLA activities including calculateNextWakeTime using business hours calculator
- (2026-02-03) calculateNextWakeTime leverages calculateDeadline to advance into next business period when starting outside hours
- (2026-02-03) calculateNextWakeTime adds accumulated pause minutes to computed deadline
- (2026-02-03) sendSlaNotification activity delegates to slaNotificationService.sendSlaNotification
- (2026-02-03) checkAndEscalate activity calls escalationService check + escalate paths
- (2026-02-03) updateSlaStatus activity updates ticket SLA met fields on breach
- (2026-02-03) recordSlaAuditLog activity writes entries to sla_audit_log
- (2026-02-03) Implemented EE TemporalSlaBackend to start SLA ticket workflows via Temporal client
- (2026-02-03) Temporal workflow IDs follow `sla-ticket-{tenantId}-{ticketId}` format in TemporalSlaBackend
- (2026-02-03) TemporalSlaBackend pause/resume/complete/cancel methods signal workflows
- (2026-02-03) TemporalSlaBackend resume signals resume to workflows
- (2026-02-03) TemporalSlaBackend completes SLA phases via completeResponse/completeResolution signals
- (2026-02-03) TemporalSlaBackend cancelSla sends cancel signal to workflow
- (2026-02-03) TemporalSlaBackend getSlaStatus queries workflow state via getState
- (2026-02-03) slaService.startSlaForTicket now triggers backend.startSlaTracking after recording SLA start
- (2026-02-03) slaPauseService pause/resume now signal SLA backend unless skipBackend is set
- (2026-02-03) slaPauseService.resumeSla now signals backend resume by default
- (2026-02-03) slaService.recordFirstResponse now signals backend completeSla('response') unless skipped
- (2026-02-03) slaService.recordResolution now signals backend completeSla('resolution') unless skipped
- (2026-02-03) SlaBackendFactory falls back to PgBoss backend when Temporal backend load fails, with warning log
- (2026-02-03) Fallback path logs warning via core logger
- (2026-02-03) TemporalSlaBackend start is idempotent by ignoring duplicate workflow start errors
- (2026-02-03) Ticket deletion actions now cancel SLA backend workflows via SlaBackendFactory
- (2026-02-03) SLA policy change handling restarts backend workflows via handlePolicyChange in slaService and slaSubscriber
- (2026-02-03) SLA workflow sends threshold notifications at 50/75/90% and breaches at 100%, with escalation checks
- (2026-02-03) Temporal worker now includes `sla-workflows` task queue by default
- (2026-02-03) SLA ticket workflow exported in Temporal workflows index for worker registration
- (2026-02-03) SLA activities exported in Temporal activities index for worker registration
- (2026-02-03) Added CE stub TemporalSlaBackend that throws Enterprise-only error
- (2026-02-03) PgBossSlaBackend startSlaTracking remains a no-op for CE polling
- (2026-02-03) PgBossSlaBackend cancelSla remains a no-op for CE polling
- (2026-02-03) PgBossSlaBackend getSlaStatus delegates to slaService.getSlaStatus
- (2026-02-03) SLA workflow respects 24x7 schedules via business hours calculator
- (2026-02-03) Tests: added ISlaBackend interface signature test (T001)
- (2026-02-03) Tests: SlaBackendFactory returns PgBoss in CE (T002)
- (2026-02-03) Tests: SlaBackendFactory returns Temporal backend in EE when available (T003)
- (2026-02-03) Tests: SlaBackendFactory falls back to PgBoss when Temporal unavailable (T004)
- (2026-02-03) Tests: PgBossSlaBackend startSlaTracking no-op (T005)
- (2026-02-03) Tests: PgBossSlaBackend.pauseSla delegates to slaPauseService (T006)
- (2026-02-03) Tests: PgBossSlaBackend.resumeSla delegates to slaPauseService (T007)
- (2026-02-03) Tests: PgBossSlaBackend.completeSla(response) delegates to slaService.recordFirstResponse (T008)
- (2026-02-03) Tests: PgBossSlaBackend.completeSla(resolution) delegates to slaService.recordResolution (T009)
- (2026-02-03) Tests: PgBossSlaBackend.cancelSla no-op (T010)
- (2026-02-03) Tests: PgBossSlaBackend.getSlaStatus delegates to slaService (T011)
- (2026-02-03) Tests: SLA ticket workflow initialization and input coverage (T012-T024)
- (2026-02-03) Tests: SLA workflow initializes with input parameters (T012)
- (2026-02-03) Tests: SLA workflow initial state phase and thresholds (T013)
- (2026-02-03) Tests: SLA workflow threshold calculations include 50% (T014)
- (2026-02-03) Tests: SLA workflow threshold calculations include 75% (T015)
- (2026-02-03) Tests: SLA workflow threshold calculations include 90% (T016)
- (2026-02-03) Tests: SLA workflow threshold calculations include 100% (T017)
- (2026-02-03) Tests: SLA workflow pause signal sets pauseStartedAt (T018)
- (2026-02-03) Tests: SLA workflow resume clears pauseStartedAt and increments total pause minutes (T019)
- (2026-02-03) Tests: SLA workflow recalculates wake time after resume with pause minutes (T020)
- (2026-02-03) Tests: SLA workflow transitions to resolution on completeResponse (T021)
- (2026-02-03) Tests: SLA workflow terminates on completeResolution (T022)
- (2026-02-03) Tests: SLA workflow cancel signal terminates workflow (T023)
- (2026-02-03) Tests: SLA workflow getState query returns status/remaining time (T024)
- (2026-02-03) Tests: calculateNextWakeTime weekday schedule (T025)
- (2026-02-03) Tests: calculateNextWakeTime advances across weekend (T026)
- (2026-02-03) Tests: calculateNextWakeTime skips holidays (T027)
- (2026-02-03) Tests: calculateNextWakeTime handles recurring holidays (T028)
- (2026-02-03) Tests: calculateNextWakeTime accounts for pause minutes (T029)
- (2026-02-03) Tests: calculateNextWakeTime for 24x7 schedule (T030)
- (2026-02-03) Tests: sendSlaNotification activity calls notification service (T031)
- (2026-02-03) Tests: checkAndEscalate activity calls escalation check (T032)
- (2026-02-03) Tests: checkAndEscalate triggers escalation when needed (T033)
- (2026-02-03) Tests: updateSlaStatus updates response met field (T034)
- (2026-02-03) Tests: updateSlaStatus updates resolution met field (T035)
- (2026-02-03) Tests: recordSlaAuditLog inserts audit entry (T036)
- (2026-02-03) Tests: TemporalSlaBackend starts SLA workflow with correct ID (T037)
- (2026-02-03) Tests: TemporalSlaBackend workflow ID format validated (T038)
- (2026-02-03) Tests: TemporalSlaBackend.pauseSla sends pause signal (T039)
- (2026-02-03) Tests: TemporalSlaBackend.resumeSla sends resume signal (T040)
- (2026-02-03) Tests: TemporalSlaBackend.completeSla(response) signals workflow (T041)
- (2026-02-03) Tests: TemporalSlaBackend.completeSla(resolution) signals workflow (T042)
- (2026-02-03) Tests: TemporalSlaBackend.cancelSla sends cancel signal (T043)
- (2026-02-03) Tests: TemporalSlaBackend.getSlaStatus queries workflow state (T044)
- (2026-02-03) Tests: slaService.startSlaForTicket calls backend.startSlaTracking (T045)
- (2026-02-03) Tests: slaPauseService.pauseSla calls backend.pauseSla (T046)
- (2026-02-03) Tests: slaPauseService.resumeSla calls backend.resumeSla (T047)
- (2026-02-03) Tests: slaService.recordFirstResponse calls backend.completeSla(response) (T048)
- (2026-02-03) Tests: slaService.recordResolution calls backend.completeSla(resolution) (T049)
- (2026-02-03) Tests: TemporalSlaBackend handles duplicate workflow ID (T050)
- (2026-02-03) Tests: CE TemporalSlaBackend stub throws Enterprise-only error (T051)
## Discoveries / Constraints
- (2026-02-03) Current `sla-timer` pgboss job runs every 5 minutes per tenant
- (2026-02-03) Business hours calculator uses minute-by-minute iteration - inefficient for long periods
- (2026-02-03) Existing `JobRunnerFactory` pattern can be reused for `SlaBackendFactory`
- (2026-02-03) `isEnterprise` check in `server/src/lib/features.ts` is the edition detection source
- (2026-02-03) Temporal worker already supports multiple task queues - can add `sla-workflows` queue
- (2026-02-03) Generic job workflow pattern shows how to structure activities and signals
- (2026-02-03) TIMEZONE BUG INVESTIGATION NEEDED - User reported mismatch in time calculations
## Commands / Runbooks
- Run SLA tests: `cd packages/sla && npm test`
- Run Temporal worker locally: `docker compose -f docker-compose.temporal.ee.yaml up`
- Check Temporal UI: http://localhost:8088
## Links / References
- Current SLA services: `packages/sla/src/services/`
- Business hours calculator: `packages/sla/src/services/businessHoursCalculator.ts`
- SLA timer handler: `server/src/lib/jobs/handlers/slaTimerHandler.ts`
- Temporal workflows: `ee/temporal-workflows/src/workflows/`
- Job runner factory: `server/src/lib/jobs/JobRunnerFactory.ts`
- Edition detection: `server/src/lib/features.ts`
- Generic job workflow pattern: `ee/temporal-workflows/src/workflows/generic-job-workflow.ts`
## Open Questions
- What specific timezone issues have been observed? Need to investigate and document
- Should workflow query replace database reads for real-time SLA status display?
- How to handle Temporal worker scaling for high-volume tenants?
## Progress Log
- (2026-02-03) Added integration coverage for EE ticket start triggering Temporal workflow in `packages/sla/src/services/__tests__/slaBackendIntegration.test.ts` (T052).
- (2026-02-03) Added Temporal workflow integration test covering 50% notification threshold in `ee/temporal-workflows/src/workflows/__tests__/sla-ticket-workflow.integration.test.ts` (T053).
- (2026-02-03) Validated 75% SLA threshold notification via workflow integration test (T054).
- (2026-02-03) Validated 90% SLA threshold notification via workflow integration test (T055).
- (2026-02-03) Confirmed 100% threshold breach update via workflow integration test (T056).
- (2026-02-03) Verified escalation checks at each SLA threshold via workflow integration test (T057).
- (2026-02-03) Added integration coverage confirming pause stops SLA notifications until resume (T058).
- (2026-02-03) Verified resume recalculates timers with pause minutes in workflow integration test (T059).
- (2026-02-03) Confirmed completeResponse transitions workflow into resolution phase in integration lifecycle test (T060).
- (2026-02-03) Verified completeResolution terminates the workflow in integration lifecycle test (T061).
- (2026-02-03) Added ticket deletion test ensuring SLA workflow cancellation in `packages/tickets/src/actions/__tests__/ticketActions.sla.test.ts` (T062).
- (2026-02-03) Confirmed policy change cancels and restarts SLA backend in integration test (T063).
- (2026-02-03) Verified CE ticket start avoids Temporal backend in integration test (T064).
- (2026-02-03) Added slaTimerHandler integration test confirming polling path still processes tickets (T065).
- (2026-02-03) Added worker registration test validating workflow export for Temporal startup (T066).
- (2026-02-03) Confirmed SLA activities are exported for worker startup (T067).
- (2026-02-03) Added full EE SLA lifecycle workflow integration test with pause/resume and resolution notifications (T068).
- (2026-02-03) Added CE lifecycle test covering create, poll notification, pause/resume, response, and resolution in `packages/sla/src/services/__tests__/slaCeLifecycle.test.ts` (T069).
- (2026-02-03) Verified EE fallback to PgBoss backend when Temporal is unavailable in integration test (T070).
- (2026-02-03) Marked America/New_York timezone calculation test for calculateNextWakeTime (T071).
- (2026-02-03) Marked Europe/London timezone calculation test for calculateNextWakeTime (T072).
- (2026-02-03) Marked DST transition coverage for calculateNextWakeTime (T073).
- (2026-02-03) Added worker restart integration test to validate workflow replay determinism (T074).
- (2026-02-03) Verified workflow continues after worker restart in integration test (T075).