Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz Source: /opt/alga-psa on psa.joliet.tech
4.8 KiB
PRD: Temporal-only workflow engine
Design: ../2026-06-09-temporal-only-workflow-engine-design.md
Branch: fix/workflow-replay-noop
Problem
Workflow runtime V2 ships two execution engines — Temporal and a legacy
DB-polling engine — selected per-process by environment flags. The API server
and the workflow worker can disagree about which engine is live, and the
Temporal producer and worker can resolve different task queues. Either
mismatch strands new runs: a workflow_runs row is created, stays RUNNING
with zero steps, and the operator experiences the Run/Replay button as doing
nothing. A customer reported exactly this (with video); internal testing on a
correctly-configured stack could not reproduce it.
Adjacent defects in the same surface: the Run Studio shows Retry/Resume/
Requeue buttons that hard-fail with 409 on Temporal runs; Replay submits the
redacted input_json as an explicit payload override when redaction is
configured; and a queued-but-unworked run gives no visual indication that
anything is wrong.
User value
Operators get a run/replay surface where every visible control works on every run, misconfiguration can no longer silently strand runs, and a genuinely stuck run announces itself instead of impersonating a healthy one.
Goals
- Temporal is the only execution engine; no flag combination can produce a run that nothing executes.
- One task queue constant shared by producer and worker; no env override.
- Legacy DB-only run controls (retry/resume/requeue) and the dead DB interpreter are deleted, not stranded as dead code.
- Replay never submits redacted payloads and lands the operator on the new run.
- Runs that are queued but unworked for >60s are visibly flagged.
- Stranded non-Temporal runs in existing databases are finalized honestly (CANCELED with an explanatory error).
- Every EE compose stack includes Temporal; the overlay file disappears.
Non-goals
- Temporal-native retry-from-failed-step (Replay is the recovery path).
- Re-launching historical stranded runs on Temporal.
- Deleting the
enginecolumn,'db'historical values, or theworkflow_run_snapshotstable (kept for reading pre-cutover runs). - Refactoring the Temporal interpreter/activities themselves.
- New monitoring/alerting beyond the in-app stuck-run banner.
Primary flows
- Run now: designer or runs-list → Run dialog →
startWorkflowRunAction→launchPublishedWorkflowRun→ alwaysengine='temporal', alwaysclient.workflow.starton the contract task queue → navigate to run page. - Replay: run details → Replay → unedited payload sends nothing (server
uses original unredacted
input_json); edited payload sends the edit → new Temporal run → UI navigates to it. - Cancel / quota resume: always the Temporal signal path.
- Stuck run: run page shows "queued, waiting for worker" banner when
RUNNING+ zero steps + age > 60s.
Data / migration
One Knex migration: workflow_runs where engine is null or 'db' and
status in (RUNNING,WAITING) → status='CANCELED',
completed_at=now(), error_json noting the temporal-only cutover; resolve
their open workflow_run_waits. Terminal rows untouched.
Risks
- Test rewrite breadth: ~60–80 e2e/integration cases drive the DB executor directly; rewrites must preserve coverage of run semantics, not just delete it. Mitigation: the ~50-case Temporal interpreter suite already covers step semantics; migrate only control-surface tests.
- Hidden DB-engine dependents: any deployment or script still setting
WORKFLOW_RUNTIME_V2_ENABLE_TEMPORAL_POLLING=falsewill now run Temporal regardless. Compose defaults are updated in this change; external/infra repos must be swept separately. - Migration on large tables: status update is bounded by an index on status; affected rows expected to be few.
Acceptance criteria
grep -r WORKFLOW_RUNTIME_V2_ENABLEover the repo returns nothing (code, compose, tests).- No code path writes
engineother than'temporal'. - Producer and worker both compile against the single task-queue constant; the env var is gone.
- Retry/Resume/Requeue buttons, actions, and routes are gone; Replay, Cancel, Export remain and work on Temporal runs.
- Unedited Replay of a redaction-configured run executes with the original payload (verified by run input of the new run).
- Replay navigates to the new run.
- A run with zero steps older than 60s renders the queued-warning banner.
- Migration cancels a seeded stranded run and resolves its waits; leaves terminal rows and Temporal runs untouched.
docker compose -f docker-compose.ee.yamlbrings up a stack where a manually started workflow executes end to end.- Full test suite green.