PSA/ee/docs/plans/2026-06-09-temporal-only-workflow-engine-design.md
Hermes 284313f908
Some checks are pending
Bidi Control Character Guard / bidi-control-guard (push) Waiting to run
Circular Dependency Check / Check for new circular dependencies (push) Waiting to run
Citus Migration Smoke / Combined migrations on single-node Citus (push) Waiting to run
E2E Fresh Install Tests / fresh-install-e2e (push) Waiting to run
ext-v2 guardrails / Run ext-v2 guard and ESLint (push) Waiting to run
Integration Tests / Check for relevant changes (push) Waiting to run
Integration Tests / ${{ (github.event_name == 'schedule' || github.event.inputs.suite == 'full') && 'Full integration suite' || 'Tier-1 integration subset' }} (push) Blocked by required conditions
Mobile checks / Mobile lint + typecheck (push) Waiting to run
Mobile checks / Mobile unit tests (push) Waiting to run
Mobile checks / Mobile dependency audit (report) (push) Waiting to run
Mobile checks / Mobile reproducibility checks (push) Waiting to run
Secrets guard (env backups) / Ensure no tracked env backup files (push) Waiting to run
Temporal Readiness / fast-readiness (push) Waiting to run
Temporal Readiness / docker-parity (push) Waiting to run
TypeScript Type Check / Nx affected typecheck (push) Waiting to run
Unit Tests / Skipped-test budget (push) Waiting to run
Unit Tests / Nx affected unit tests (push) Waiting to run
Unit Tests / Server unit coverage (informational) (push) Waiting to run
Validate Tenant Management Schema / Check for relevant changes (push) Waiting to run
Validate Tenant Management Schema / Validate Tenant Management Schema (push) Blocked by required conditions
EE Workflows Build Guard / ee-workflows-build-guard (push) Waiting to run
Initial import of AlgaPSA codebase from PSA server
Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz

Source: /opt/alga-psa on psa.joliet.tech
2026-06-22 16:12:17 -05:00

163 lines
7.5 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Temporal-only workflow engine
Workflow runtime V2 currently supports two execution engines: Temporal (the
default) and a legacy DB-polling engine. The split is selected per-process by
environment flags, which lets the API server and the workflow worker disagree
about who executes a run. When they disagree — or when the producer and worker
resolve different Temporal task queues — a click on **Run** or **Replay**
creates a `workflow_runs` row that nothing ever executes. The run sits in
`RUNNING` with zero steps, and to the operator the button "did nothing."
This change removes the DB engine entirely. Temporal becomes the only engine,
on one task queue, with no per-process configuration to get wrong.
## Failure modes this eliminates
1. **Engine flag mismatch.** The server defaults new runs to
`engine='temporal'` unless `WORKFLOW_RUNTIME_V2_ENABLE_TEMPORAL_POLLING` is
false (`ee/packages/workflows/src/lib/workflowRunLauncher.ts`), while the
worker only starts the DB poller when
`WORKFLOW_RUNTIME_V2_ENABLE_DB_POLLING=true`
(`services/workflow-worker/src/index.ts`). Set the first flag to false on
the server without setting the second on the worker and every new run is
stranded.
2. **Task-queue split brain.** The worker honors a
`WORKFLOW_RUNTIME_V2_TEMPORAL_TASK_QUEUE` override
(`services/workflow-worker/src/v2/WorkflowRuntimeV2TemporalWorker.ts`), but
the producer hardcodes the contract constant
(`ee/packages/workflows/src/lib/workflowRuntimeV2Temporal.ts`). Any
environment that sets the variable starts runs on a queue nobody polls.
3. **Legacy run controls that 409 on Temporal runs.** Retry, Resume, and
Requeue Event only work for DB-engine runs and hard-fail for Temporal runs
via `assertLegacyRunControlSupported`
(`ee/packages/workflows/src/actions/workflow-runtime-v2-actions.ts`), yet
the Run Studio shows Retry on any FAILED run.
4. **Replay submits redacted payloads.** The Replay dialog pre-fills its
payload editor from `run.input_json` as returned by `getWorkflowRunAction`,
which applies `applyRunStudioRedactions`. Submitting unedited sends the
redaction placeholders as an explicit payload override.
5. **Silent stuck runs.** A run accepted by Temporal but never picked up (a
down worker, a queue backlog) shows as `RUNNING` with no steps and no
warning.
## Design
### Launch path and task queue
`launchPublishedWorkflowRun` always writes `engine='temporal'` and always
starts the Temporal workflow; `isTemporalPollingEnabled()` is deleted.
`StartRunParams` loses its `engine` parameter — `startRun` writes
`'temporal'` unconditionally.
The task queue becomes the single constant in
`workflowRuntimeV2TemporalContract.ts`, used verbatim by both producer and
worker. The env override in `WorkflowRuntimeV2TemporalWorker` and the
`WORKFLOW_RUNTIME_V2_TEMPORAL_TASK_QUEUE` compose plumbing are removed.
Environment isolation remains the job of `TEMPORAL_ADDRESS` and
`TEMPORAL_NAMESPACE`.
### Worker service
`services/workflow-worker/src/index.ts` starts the Temporal worker and the
event-stream worker unconditionally. Both `WORKFLOW_RUNTIME_V2_ENABLE_*`
flags disappear. Both DB-poller classes are deleted:
- `services/workflow-worker/src/v2/WorkflowRuntimeV2Worker.ts`
- `shared/workflow/workers/WorkflowRuntimeV2Worker.ts`
The `workflow_data_store` expiry sweep currently embedded in the shared
poller is not engine work; it moves to a small standalone interval module in
the worker service.
### Server actions and event routing
Deleted, including API routes and UI entry points:
- `retryWorkflowRunAction` (`POST /api/workflow-runs/[runId]/retry`)
- `resumeWorkflowRunAction` (`POST /api/workflow-runs/[runId]/resume`)
- `requeueWorkflowRunEventWaitAction` (`POST /api/workflow-runs/[runId]/requeue`)
- `assertLegacyRunControlSupported` / `throwUnsupportedTemporalRunControlAction`
Simplified to the Temporal path only (engine branches removed):
- `cancelWorkflowRunAction`
- `resumeWorkflowRunFromQuotaPauseAction` (always signals quota resume)
- `submitWorkflowEventAction` and the event-stream worker's
`engine !== 'temporal'` skip
- `server/src/lib/jobs/handlers/workflowQuotaResumeScanHandler.ts`
Replay (`replayWorkflowRunAction`) remains the operator recovery path for
failed runs: a fresh Temporal run with the original payload.
### DB interpreter deletion
The live Temporal path executes runs through the interpreter in
`ee/temporal-workflows/src/workflows/` plus per-step activities; it never
calls `WorkflowRuntimeV2.executeRun`. The only `executeRun` callers are the
legacy actions, the DB pollers, the quota handler's else-branch, and the
exported-but-never-invoked `executeWorkflowRuntimeV2Run` activity. With those
gone, the following are deleted from
`shared/workflow/runtime/runtime/workflowRuntimeV2.ts`:
- `executeRun`, `acquireRunnableRun`, `resumeRunFromEvent`,
`resumeRunFromTimeout`
- the private step/action executor loop, `loadEnvelope`, `persistSnapshot`
- `executeWorkflowRuntimeV2Run` in
`ee/temporal-workflows/src/activities/workflow-runtime-v2-activities.ts`
`startRun` stays — the launcher and the child-run activity
(`startWorkflowRuntimeV2ChildRun`) both use it.
Kept as read-only history: the `engine` column and its `'db'` value, the
`workflow_run_snapshots` table and `WorkflowRunSnapshotModelV2` (Run Studio
still renders snapshots on pre-cutover runs; bulk-delete and tenant-deletion
cleanup still reference the table). Nothing writes new snapshots.
### Run Studio UI
`WorkflowRunDetailsPanel` drops the Retry, Resume, and Requeue Event buttons
and handlers; Replay, Cancel, and Export remain. Replay changes in two ways:
1. The payload is submitted only when the operator edits the pre-filled JSON.
Unedited replays send no payload, and the server falls back to the
original, unredacted `input_json`.
2. On success the UI navigates to the new run instead of staying on the old
one.
### Stuck-run visibility
The run details panel shows a warning banner when a run is `RUNNING` with
zero steps and `started_at` is more than ~60 seconds old: the run is queued
and no worker has picked it up. This uses data the panel already fetches.
### Data migration
A migration finalizes stranded non-Temporal runs: rows with `engine` null or
`'db'` still in `RUNNING`/`WAITING` get `status='CANCELED'`,
`completed_at=now()`, and an `error_json` note that the DB execution engine
was removed; their open `workflow_run_waits` are resolved. Terminal
historical rows are untouched.
### Deployment
Temporal services merge from `docker-compose.temporal.ee.yaml` into
`docker-compose.ee.yaml` and the overlay file is deleted — an EE stack
without Temporal is not a valid deployment. The playwright workflow-deps
compose moves from DB polling to the Temporal stack. All engine env flags
disappear from compose files.
### Testing
- Delete `workflowEngineReferenceWorkflows.db.test.ts`; its coverage lives in
the Temporal interpreter suite
(`workflow-runtime-v2-run-workflow.test.ts`).
- Rewrite or delete the e2e/integration tests that drive
`WorkflowRuntimeV2Worker`/`executeRun` directly
(`server/src/test/e2e/workflowRuntimeV2.e2e.test.ts`,
`server/src/test/integration/workflowRuntimeV2.control.integration.test.ts`,
`server/src/test/integration/workflowRuntimeV2.publish.integration.test.ts`)
— roughly 6080 cases. Temporal-specific cases in those files survive.
- Remove flag-toggling cases from the launcher and worker-startup unit tests.
- Add: launcher always-temporal assertions, migration coverage, replay
payload-dirty logic, stuck-run banner rendering.