Some checks are pending
Bidi Control Character Guard / bidi-control-guard (push) Waiting to run
Circular Dependency Check / Check for new circular dependencies (push) Waiting to run
Citus Migration Smoke / Combined migrations on single-node Citus (push) Waiting to run
E2E Fresh Install Tests / fresh-install-e2e (push) Waiting to run
ext-v2 guardrails / Run ext-v2 guard and ESLint (push) Waiting to run
Integration Tests / Check for relevant changes (push) Waiting to run
Integration Tests / ${{ (github.event_name == 'schedule' || github.event.inputs.suite == 'full') && 'Full integration suite' || 'Tier-1 integration subset' }} (push) Blocked by required conditions
Mobile checks / Mobile lint + typecheck (push) Waiting to run
Mobile checks / Mobile unit tests (push) Waiting to run
Mobile checks / Mobile dependency audit (report) (push) Waiting to run
Mobile checks / Mobile reproducibility checks (push) Waiting to run
Secrets guard (env backups) / Ensure no tracked env backup files (push) Waiting to run
Temporal Readiness / fast-readiness (push) Waiting to run
Temporal Readiness / docker-parity (push) Waiting to run
TypeScript Type Check / Nx affected typecheck (push) Waiting to run
Unit Tests / Skipped-test budget (push) Waiting to run
Unit Tests / Nx affected unit tests (push) Waiting to run
Unit Tests / Server unit coverage (informational) (push) Waiting to run
Validate Tenant Management Schema / Check for relevant changes (push) Waiting to run
Validate Tenant Management Schema / Validate Tenant Management Schema (push) Blocked by required conditions
EE Workflows Build Guard / ee-workflows-build-guard (push) Waiting to run
Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz Source: /opt/alga-psa on psa.joliet.tech
360 lines
21 KiB
Markdown
360 lines
21 KiB
Markdown
# PRD — Workflow V2 Temporal-Native Runtime
|
|
|
|
- Slug: `workflow-v2-temporal-native-runtime`
|
|
- Date: `2026-04-08`
|
|
- Status: Draft
|
|
|
|
## Summary
|
|
|
|
Replace the current database-driven Workflow Runtime V2 execution engine with a full Temporal-native interpreter for Enterprise Edition workflows. The workflow designer and authored DSL remain declarative and approachable for non-technical users, while Temporal becomes the authoritative execution runtime for all workflow runs in both hosted EE and appliance EE. Database tables remain product-facing read models and idempotency ledgers, not scheduler or resume authority.
|
|
|
|
## Problem
|
|
|
|
Workflow Runtime V2 currently executes through database-backed run state, wait rows, snapshots, and worker polling. That design was sufficient for early iteration, but it creates structural limitations:
|
|
|
|
1. **Polling-based waits are operationally weak** — `time.wait` requires a background polling worker instead of durable native timers.
|
|
2. **Execution authority is split across rows, snapshots, and worker leases** — the run model is harder to reason about and harder to recover cleanly.
|
|
3. **Long-running workflows are awkward** — event waits, retries, and child workflow semantics rely on custom scheduler logic rather than a durable orchestration engine.
|
|
4. **Runtime semantics are difficult to evolve safely** — replay, determinism, and versioning are implicit instead of explicit.
|
|
5. **Hosted and appliance need one durable runtime model** — workflows are an EE feature, and appliance includes Temporal, so diverging engines would create unnecessary complexity.
|
|
|
|
A Temporal-native interpreter solves the hardest workflow-runtime concerns directly:
|
|
- durable timers for `time.wait`
|
|
- signal-based wakeups for `event.wait` and `human.task`
|
|
- native child workflow orchestration
|
|
- cleaner cancellation and recovery semantics
|
|
- a single durable execution authority
|
|
|
|
## Goals
|
|
|
|
1. **Temporal-native execution authority** — every new Workflow V2 run executes as a Temporal workflow, and Temporal is the source of truth for execution state.
|
|
2. **Single EE runtime across hosted and appliance** — both hosted EE and appliance EE use the same interpreter model and semantics.
|
|
3. **Preserve intuitive workflow authoring** — the workflow DSL and designer remain declarative and user-friendly; Temporal concepts stay behind the runtime boundary.
|
|
4. **Hard cutover with minimal migration burden** — the current DB-backed runtime can be abandoned for new runs; no active-run migration is required.
|
|
5. **Durable, deterministic interpreter** — pinned definitions, explicit runtime semantics versioning, deterministic expression evaluation, and activity-only side effects.
|
|
6. **Keep the database as a first-class product read model** — run details, step timelines, waits, action invocations, and event audit remain queryable in the DB.
|
|
7. **Support all current workflow categories** — manual runs, event-triggered runs, one-time schedules, recurring schedules, waits, retries, child workflows, and human tasks.
|
|
8. **Establish a clean future architecture** — retire lease-based polling execution and DB-driven wait resolution so future workflow features build on a durable engine.
|
|
|
|
## Non-goals
|
|
|
|
- Redesigning the workflow designer around imperative Temporal concepts.
|
|
- Preserving exact internal runtime table semantics from the DB-backed engine.
|
|
- Migrating in-flight DB-runtime runs into Temporal.
|
|
- Maintaining a long-term hybrid runtime where DB execution and Temporal execution coexist as equal authorities.
|
|
- Building a broad new observability platform beyond the minimum run/query/projection surfaces needed for workflow support.
|
|
- Introducing parallel `forEach` execution in the first Temporal-native release.
|
|
- Replatforming Community Edition workflows onto Temporal.
|
|
|
|
## Users and Primary Flows
|
|
|
|
### Primary users
|
|
|
|
- **Workflow authors** creating automations in the visual workflow designer.
|
|
- **Operators and support engineers** inspecting workflow runs, waits, failures, and event routing.
|
|
- **Platform engineers** evolving the runtime, actions, and trigger orchestration safely.
|
|
|
|
### Primary flows
|
|
|
|
#### Flow 1: Manual or API run
|
|
1. A published workflow is started manually or via API.
|
|
2. The system allocates an Alga run ID and starts a Temporal interpreter workflow for that run.
|
|
3. The interpreter loads the pinned published definition version.
|
|
4. Steps execute through the interpreter until completion, wait, failure, or cancellation.
|
|
5. The database projection is updated so the run appears in run-detail UI and APIs.
|
|
|
|
#### Flow 2: Event-triggered workflow
|
|
1. A domain event enters the workflow event ingress path.
|
|
2. The event is validated, recorded for audit, and matched to published workflow triggers.
|
|
3. Matching workflows start new Temporal runs directly.
|
|
4. If an existing run is waiting on an `event.wait`, candidate runs are signaled.
|
|
5. Each signaled run decides inside Temporal whether the signal matches its active wait.
|
|
|
|
#### Flow 3: Time wait inside a workflow
|
|
1. A workflow reaches `time.wait`.
|
|
2. The interpreter computes `dueAt` deterministically.
|
|
3. The workflow projects a wait row for product visibility and then sleeps with a native Temporal timer.
|
|
4. On wake, the interpreter resolves the projected wait and continues execution.
|
|
|
|
#### Flow 4: Child workflow orchestration
|
|
1. A workflow reaches `control.callWorkflow`.
|
|
2. The parent interpreter starts a Temporal child interpreter workflow with mapped input.
|
|
3. The child completes or fails.
|
|
4. The parent maps outputs or handles the child failure using retry or catch semantics.
|
|
|
|
#### Flow 5: Schedule-triggered workflow
|
|
1. A published workflow is configured with a one-time or recurring schedule trigger.
|
|
2. The system reconciles schedule state to Temporal Schedules (or equivalent Temporal-native scheduling authority).
|
|
3. When the schedule fires, Temporal starts a workflow run.
|
|
4. The run appears in the database projection like any other run.
|
|
|
|
## UX / UI Notes
|
|
|
|
- The workflow designer should continue to present a declarative workflow model. Authors should never need to understand task queues, Temporal histories, activity retries, or workflow replay.
|
|
- Existing workflow step concepts remain product-facing: `action.call`, `control.if`, `control.forEach`, `control.tryCatch`, `control.callWorkflow`, `event.wait`, `time.wait`, `human.task`, and `control.return`.
|
|
- It is acceptable for some internal runtime semantics to evolve if the designer remains intuitive and authored workflows stay understandable.
|
|
- Run detail UIs should continue to show:
|
|
- current status
|
|
- current step
|
|
- current wait
|
|
- step timeline
|
|
- action invocation history
|
|
- terminal error context
|
|
- Replay/re-run in the UI should mean “start a new run from the pinned definition and input,” not “resume a DB-backed snapshot.”
|
|
|
|
## Requirements
|
|
|
|
### Functional Requirements
|
|
|
|
#### FR-1: Temporal as execution authority
|
|
- All new Workflow V2 runs must execute as Temporal workflows.
|
|
- Temporal must be the sole execution authority for run progression, waits, retries, and child workflow orchestration.
|
|
- The same Temporal-native runtime must be used for hosted EE and appliance EE.
|
|
- New runs must no longer depend on DB lease ownership, runnable-run polling, or DB snapshot resume logic.
|
|
|
|
#### FR-2: Stable run identity and pinned definition loading
|
|
- Every run must have a stable Alga `run_id` that is created before the Temporal workflow starts.
|
|
- Each Temporal execution must map deterministically to that `run_id`.
|
|
- Runs must pin `workflow_id`, `published_version`, `definition_hash`, and `runtime_semantics_version` at start.
|
|
- The interpreter must load the pinned published definition through an activity and execute only that version for the life of the run.
|
|
|
|
#### FR-3: Explicit interpreter state machine
|
|
- The Temporal-native interpreter must use an explicit frame-based execution model rather than DB `node_path` as execution authority.
|
|
- The interpreter must maintain serializable state for:
|
|
- execution frames
|
|
- workflow scope
|
|
- local lexical scopes
|
|
- current step
|
|
- pending wait descriptor
|
|
- terminal result or error
|
|
- The runtime must support safe continue-as-new checkpoints for long-lived executions.
|
|
|
|
#### FR-4: Deterministic expression and runtime semantics
|
|
- Workflow control decisions must be deterministic under Temporal replay.
|
|
- Expressions may read from workflow scopes and pinned metadata only.
|
|
- Workflow code must not directly read mutable DB state, secrets, or network resources.
|
|
- Runtime semantics must be versioned explicitly so support and future migrations can identify the interpreter contract used by a run.
|
|
|
|
#### FR-5: `action.call` activity execution and idempotency
|
|
- `action.call` must execute through a dedicated Temporal activity boundary.
|
|
- The runtime must compute deterministic idempotency keys for action executions.
|
|
- A durable action invocation ledger must suppress duplicate side effects across retries, activity re-execution, and worker restarts.
|
|
- Action outputs must be available for assignment/save behavior and run-detail projection.
|
|
- User-authored retry policy must remain visible and owned by the interpreter, not hidden inside uncontrolled Temporal activity retry behavior.
|
|
|
|
#### FR-6: Control-flow semantics inside the interpreter
|
|
- `control.if` must evaluate conditions deterministically and route execution to the correct branch.
|
|
- `control.tryCatch` must support catch-branch routing and `captureErrorAs` behavior for catchable runtime failures.
|
|
- `control.return` must terminate the current workflow successfully.
|
|
- `control.forEach` must be supported as sequential iteration in the first Temporal-native release.
|
|
- `control.forEach` loop bodies must support waits, actions, branching, and `onItemError` semantics.
|
|
- First-release publish validation or designer constraints must reject or hide `forEach.concurrency > 1`.
|
|
|
|
#### FR-7: Child workflow execution
|
|
- `control.callWorkflow` must execute as a Temporal child workflow, not inline in the parent interpreter.
|
|
- Child workflows must receive their own run IDs plus root/parent linkage metadata.
|
|
- Parent workflows must be able to map child outputs and handle child failures through retry or catch semantics.
|
|
|
|
#### FR-8: Native wait semantics
|
|
- `time.wait` must execute using native Temporal timers.
|
|
- `time.wait` must fast-path when `dueAt <= now` so already-due waits do not suspend unnecessarily.
|
|
- `event.wait` must execute using Temporal signal handling, not DB wait resolution.
|
|
- `event.wait` must continue to support event name, correlation key, payload filters, and timeout behavior.
|
|
- `human.task` must continue to behave as a signal-backed wait with response validation before resume.
|
|
|
|
#### FR-9: Event ingress and candidate signaling
|
|
- Incoming workflow events must be recorded in the database for audit/debugging.
|
|
- Event ingress must identify candidate waiting runs using database projection indexes, including tenant, event name, and correlation key.
|
|
- Event ingress must signal all candidate waiting runs rather than resolving a single DB wait row as the execution authority.
|
|
- Each Temporal workflow must decide whether the signaled event matches its active wait.
|
|
- External event delivery must be idempotent by `event_id`.
|
|
|
|
#### FR-10: Trigger execution model
|
|
- Manual and API-triggered runs must start Temporal workflow executions directly.
|
|
- Event-triggered workflows must start Temporal workflow executions directly from published trigger definitions.
|
|
- One-time schedule triggers must use a Temporal-native scheduling authority.
|
|
- Recurring schedule triggers must use Temporal Schedules or an equivalent Temporal-native scheduler model.
|
|
- Publish/unpublish/update of scheduled workflows must reconcile DB schedule state to Temporal schedule state.
|
|
|
|
#### FR-11: Database projection and product APIs
|
|
- `workflow_runs` must become a Temporal-backed run summary projection.
|
|
- `workflow_run_steps` must represent actual step execution attempts/timeline.
|
|
- `workflow_run_waits` must represent wait projection and event-routing index state.
|
|
- `workflow_action_invocations` must remain the durable side-effect idempotency ledger and product-facing action timeline.
|
|
- `workflow_runtime_events` must remain the event audit surface.
|
|
- Existing run-detail and listing APIs should continue to work through the projection model wherever practical.
|
|
|
|
#### FR-12: Queries, cancellation, and operator controls
|
|
- The runtime must support run cancellation with correct propagation to active child workflows.
|
|
- Cancellation must not be swallowed by normal workflow catch semantics.
|
|
- Replay/re-run must start a fresh Temporal-native run from the pinned definition and input.
|
|
- Operational debug queries should expose at least current step, current wait, and interpreter summary for support/debug tooling.
|
|
|
|
#### FR-13: Hard cutover and runtime retirement
|
|
- New Workflow V2 runs must hard-cut to the Temporal-native engine.
|
|
- No active-run migration from the DB runtime is required.
|
|
- The current DB-backed runnable-run acquisition, due-wait polling, and DB wait-resolution paths must be retired for new runs.
|
|
- Obsolete execution-authority fields and tables may remain temporarily for compatibility, but they must no longer drive execution.
|
|
|
|
### Non-functional Requirements
|
|
|
|
- The Temporal-native runtime must be deterministic under replay.
|
|
- The runtime must preserve tenant scoping for triggers, events, waits, and actions.
|
|
- Side effects must be at-least-once safe through durable idempotency design.
|
|
- The same authored workflow definition must execute with the same semantics in hosted EE and appliance EE.
|
|
- The first Temporal-native release must prefer correctness and clarity over speculative parallelism or optimization.
|
|
- The design must support long-running workflows through continue-as-new rather than unbounded history growth.
|
|
|
|
## Data / API / Integrations
|
|
|
|
### Existing relevant surfaces
|
|
|
|
- DB-backed runtime engine: [shared/workflow/runtime/runtime/workflowRuntimeV2.ts](/Users/roberisaacs/alga-psa/shared/workflow/runtime/runtime/workflowRuntimeV2.ts)
|
|
- DB polling worker: [shared/workflow/workers/WorkflowRuntimeV2Worker.ts](/Users/roberisaacs/alga-psa/shared/workflow/workers/WorkflowRuntimeV2Worker.ts)
|
|
- Workflow worker bootstrap: [services/workflow-worker/src/index.ts](/Users/roberisaacs/alga-psa/services/workflow-worker/src/index.ts)
|
|
- Event ingress worker: [services/workflow-worker/src/v2/WorkflowRuntimeV2EventStreamWorker.ts](/Users/roberisaacs/alga-psa/services/workflow-worker/src/v2/WorkflowRuntimeV2EventStreamWorker.ts)
|
|
- Workflow runtime actions: [ee/packages/workflows/src/actions/workflow-runtime-v2-actions.ts](/Users/roberisaacs/alga-psa/ee/packages/workflows/src/actions/workflow-runtime-v2-actions.ts)
|
|
- Existing workflow runtime tables: [server/migrations/20251221090000_create_workflow_runtime_v2_tables.cjs](/Users/roberisaacs/alga-psa/server/migrations/20251221090000_create_workflow_runtime_v2_tables.cjs)
|
|
- Temporal workflow package: [ee/temporal-workflows/README.md](/Users/roberisaacs/alga-psa/ee/temporal-workflows/README.md)
|
|
- Existing Temporal worker/client entrypoints:
|
|
- [ee/temporal-workflows/src/worker.ts](/Users/roberisaacs/alga-psa/ee/temporal-workflows/src/worker.ts)
|
|
- [ee/temporal-workflows/src/client.ts](/Users/roberisaacs/alga-psa/ee/temporal-workflows/src/client.ts)
|
|
|
|
### Recommended projection model
|
|
|
|
Keep these tables, but change their role:
|
|
|
|
- `workflow_runs`
|
|
- run summary projection
|
|
- Temporal workflow/run IDs
|
|
- pinned definition/version/hash
|
|
- current step and wait summary
|
|
- terminal status/error summary
|
|
- `workflow_run_steps`
|
|
- step execution timeline
|
|
- attempts
|
|
- durations
|
|
- failures
|
|
- `workflow_run_waits`
|
|
- historical and current wait projection
|
|
- event-routing index for active waits
|
|
- `workflow_action_invocations`
|
|
- action timeline projection
|
|
- durable side-effect idempotency ledger
|
|
- `workflow_runtime_events`
|
|
- inbound event audit trail
|
|
- delivery/routing trace metadata
|
|
|
|
`workflow_run_snapshots` should no longer be the resume authority. It may remain only as an optional debug checkpoint surface if still useful.
|
|
|
|
### Recommended new projection fields
|
|
|
|
Likely additions include:
|
|
- `engine = 'temporal'`
|
|
- `temporal_workflow_id`
|
|
- `temporal_run_id`
|
|
- `definition_hash`
|
|
- `runtime_semantics_version`
|
|
- `parent_run_id`
|
|
- `root_run_id`
|
|
- wait correlation tokens for idempotent wait-projection updates
|
|
|
|
### Temporal integration model
|
|
|
|
- Use a dedicated workflow runtime interpreter workflow type and task queue.
|
|
- Use activities for:
|
|
- loading pinned definitions
|
|
- executing actions
|
|
- writing projections
|
|
- validating human-task form responses when runtime metadata requires DB access
|
|
- reconciling Temporal Schedules
|
|
- Use child workflows for `control.callWorkflow`.
|
|
- Use workflow signals for:
|
|
- external events
|
|
- human task completion/admin resume
|
|
- cancellation and future runtime controls if needed
|
|
- Use Temporal Schedules for one-time and recurring workflow triggers.
|
|
|
|
## Security / Permissions
|
|
|
|
- Continue to use existing workflow permissions for authoring, publish, run, and inspect operations.
|
|
- Maintain tenant scoping across event routing, action execution, and read-model projections.
|
|
- Secret access must remain activity-only and must never happen directly in deterministic workflow code.
|
|
- Projected logs and errors must continue to honor redaction behavior for sensitive fields.
|
|
|
|
## Observability
|
|
|
|
This plan deliberately avoids a large new observability program. Minimum required visibility is:
|
|
- run summary and status
|
|
- current step
|
|
- current wait
|
|
- step execution timeline
|
|
- action invocation timeline
|
|
- inbound event audit
|
|
- current interpreter summary available via Temporal query for operator debugging
|
|
|
|
Temporal UI should be considered a support/engineering surface, not the primary end-user workflow run UI.
|
|
|
|
## Rollout / Migration
|
|
|
|
### Migration posture
|
|
|
|
- Hard cutover.
|
|
- All new Workflow V2 runs use the Temporal-native runtime.
|
|
- No migration of in-flight DB-runtime runs is required.
|
|
- Old DB-backed run records may remain for historical/debug reasons only.
|
|
|
|
### Recommended implementation phases
|
|
|
|
#### Phase 1: Runtime skeleton
|
|
- Start runs directly in Temporal.
|
|
- Load pinned definitions through activities.
|
|
- Support straight-line execution (`action.call`, `control.if`, `control.return`).
|
|
- Write run and step projections.
|
|
|
|
#### Phase 2: Core control flow
|
|
- Add sequential `control.forEach`.
|
|
- Add `control.tryCatch` and normalized runtime errors.
|
|
- Add `control.callWorkflow` as child workflows.
|
|
- Stabilize action idempotency ledger behavior.
|
|
|
|
#### Phase 3: Waits and signals
|
|
- Add `time.wait` using native Temporal timers.
|
|
- Add `event.wait` using signal handling and candidate signal fan-out.
|
|
- Add `human.task` resume signaling and validation.
|
|
- Stabilize wait projections and event-routing indexes.
|
|
|
|
#### Phase 4: Trigger platform
|
|
- Move event-triggered run start fully onto Temporal-native execution.
|
|
- Move one-time schedules to Temporal-native scheduling.
|
|
- Move recurring schedules to Temporal Schedules with reconciliation.
|
|
|
|
#### Phase 5: Cutover cleanup
|
|
- Disable DB polling/runtime execution authority for new runs.
|
|
- Remove or deprecate lease-based execution assumptions.
|
|
- Reduce `workflow_run_snapshots` to optional debug-only use if retained.
|
|
- Plan follow-on schema cleanup for obsolete execution-authority fields.
|
|
|
|
## Open Questions
|
|
|
|
1. Should `workflow_run_snapshots` survive as an operator debug surface, or should that move entirely to Temporal queries/history plus targeted redacted checkpoints?
|
|
2. Should the first Temporal-native release reject `forEach.concurrency > 1` at publish time, or hide that capability in the designer and reserve schema cleanup for follow-up?
|
|
3. For one-time schedule triggers, should the implementation use Temporal Schedules uniformly, or a simpler single-fire start mechanism wrapped in the same schedule reconciliation layer?
|
|
|
|
## Acceptance Criteria (Definition of Done)
|
|
|
|
1. [ ] A newly started Workflow V2 run executes as a Temporal workflow and is no longer scheduled by the DB polling worker.
|
|
2. [ ] Hosted EE and appliance EE use the same Temporal-native interpreter semantics.
|
|
3. [ ] Each run pins workflow definition ID, version, hash, and runtime semantics version at start.
|
|
4. [ ] `action.call` executes through activities with durable idempotency so duplicate side effects are suppressed across retries.
|
|
5. [ ] `control.if`, `control.tryCatch`, sequential `control.forEach`, and `control.return` execute correctly inside the interpreter.
|
|
6. [ ] `control.callWorkflow` executes as a Temporal child workflow with parent/root linkage and output mapping.
|
|
7. [ ] `time.wait` uses native Temporal timers and resumes without DB polling.
|
|
8. [ ] `event.wait` uses signal handling and resumes only when the active wait matches event name, correlation key, and filters.
|
|
9. [ ] Event ingress records inbound events, identifies candidate waiting runs, and signals those runs idempotently.
|
|
10. [ ] One-time and recurring schedule triggers are reconciled to Temporal-native scheduling authority.
|
|
11. [ ] Database run, step, wait, action, and event surfaces remain usable as product-facing read models.
|
|
12. [ ] Run cancelation works correctly and propagates to child workflows.
|
|
13. [ ] Replay/re-run starts a fresh Temporal-native run from pinned definition/input rather than DB snapshot resume.
|
|
14. [ ] DB-backed wait polling and lease-based execution are no longer required for new runs.
|
|
15. [ ] High-signal Temporal and DB-backed integration tests cover interpreter semantics, waits, event routing, schedules, and idempotency. |