Some checks are pending
Bidi Control Character Guard / bidi-control-guard (push) Waiting to run
Circular Dependency Check / Check for new circular dependencies (push) Waiting to run
Citus Migration Smoke / Combined migrations on single-node Citus (push) Waiting to run
E2E Fresh Install Tests / fresh-install-e2e (push) Waiting to run
ext-v2 guardrails / Run ext-v2 guard and ESLint (push) Waiting to run
Integration Tests / Check for relevant changes (push) Waiting to run
Integration Tests / ${{ (github.event_name == 'schedule' || github.event.inputs.suite == 'full') && 'Full integration suite' || 'Tier-1 integration subset' }} (push) Blocked by required conditions
Mobile checks / Mobile lint + typecheck (push) Waiting to run
Mobile checks / Mobile unit tests (push) Waiting to run
Mobile checks / Mobile dependency audit (report) (push) Waiting to run
Mobile checks / Mobile reproducibility checks (push) Waiting to run
Secrets guard (env backups) / Ensure no tracked env backup files (push) Waiting to run
Temporal Readiness / fast-readiness (push) Waiting to run
Temporal Readiness / docker-parity (push) Waiting to run
TypeScript Type Check / Nx affected typecheck (push) Waiting to run
Unit Tests / Skipped-test budget (push) Waiting to run
Unit Tests / Nx affected unit tests (push) Waiting to run
Unit Tests / Server unit coverage (informational) (push) Waiting to run
Validate Tenant Management Schema / Check for relevant changes (push) Waiting to run
Validate Tenant Management Schema / Validate Tenant Management Schema (push) Blocked by required conditions
EE Workflows Build Guard / ee-workflows-build-guard (push) Waiting to run
Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz Source: /opt/alga-psa on psa.joliet.tech
102 lines
4.8 KiB
Markdown
102 lines
4.8 KiB
Markdown
# PRD: Temporal-only workflow engine
|
||
|
||
Design: [`../2026-06-09-temporal-only-workflow-engine-design.md`](../2026-06-09-temporal-only-workflow-engine-design.md)
|
||
Branch: `fix/workflow-replay-noop`
|
||
|
||
## Problem
|
||
|
||
Workflow runtime V2 ships two execution engines — Temporal and a legacy
|
||
DB-polling engine — selected per-process by environment flags. The API server
|
||
and the workflow worker can disagree about which engine is live, and the
|
||
Temporal producer and worker can resolve different task queues. Either
|
||
mismatch strands new runs: a `workflow_runs` row is created, stays `RUNNING`
|
||
with zero steps, and the operator experiences the Run/Replay button as doing
|
||
nothing. A customer reported exactly this (with video); internal testing on a
|
||
correctly-configured stack could not reproduce it.
|
||
|
||
Adjacent defects in the same surface: the Run Studio shows Retry/Resume/
|
||
Requeue buttons that hard-fail with 409 on Temporal runs; Replay submits the
|
||
redacted `input_json` as an explicit payload override when redaction is
|
||
configured; and a queued-but-unworked run gives no visual indication that
|
||
anything is wrong.
|
||
|
||
## User value
|
||
|
||
Operators get a run/replay surface where every visible control works on every
|
||
run, misconfiguration can no longer silently strand runs, and a genuinely
|
||
stuck run announces itself instead of impersonating a healthy one.
|
||
|
||
## Goals
|
||
|
||
1. Temporal is the only execution engine; no flag combination can produce a
|
||
run that nothing executes.
|
||
2. One task queue constant shared by producer and worker; no env override.
|
||
3. Legacy DB-only run controls (retry/resume/requeue) and the dead DB
|
||
interpreter are deleted, not stranded as dead code.
|
||
4. Replay never submits redacted payloads and lands the operator on the new
|
||
run.
|
||
5. Runs that are queued but unworked for >60s are visibly flagged.
|
||
6. Stranded non-Temporal runs in existing databases are finalized honestly
|
||
(CANCELED with an explanatory error).
|
||
7. Every EE compose stack includes Temporal; the overlay file disappears.
|
||
|
||
## Non-goals
|
||
|
||
- Temporal-native retry-from-failed-step (Replay is the recovery path).
|
||
- Re-launching historical stranded runs on Temporal.
|
||
- Deleting the `engine` column, `'db'` historical values, or the
|
||
`workflow_run_snapshots` table (kept for reading pre-cutover runs).
|
||
- Refactoring the Temporal interpreter/activities themselves.
|
||
- New monitoring/alerting beyond the in-app stuck-run banner.
|
||
|
||
## Primary flows
|
||
|
||
1. **Run now**: designer or runs-list → Run dialog → `startWorkflowRunAction`
|
||
→ `launchPublishedWorkflowRun` → always `engine='temporal'`, always
|
||
`client.workflow.start` on the contract task queue → navigate to run page.
|
||
2. **Replay**: run details → Replay → unedited payload sends nothing (server
|
||
uses original unredacted `input_json`); edited payload sends the edit →
|
||
new Temporal run → UI navigates to it.
|
||
3. **Cancel / quota resume**: always the Temporal signal path.
|
||
4. **Stuck run**: run page shows "queued, waiting for worker" banner when
|
||
`RUNNING` + zero steps + age > 60s.
|
||
|
||
## Data / migration
|
||
|
||
One Knex migration: `workflow_runs` where `engine` is null or `'db'` and
|
||
`status` in (`RUNNING`,`WAITING`) → `status='CANCELED'`,
|
||
`completed_at=now()`, `error_json` noting the temporal-only cutover; resolve
|
||
their open `workflow_run_waits`. Terminal rows untouched.
|
||
|
||
## Risks
|
||
|
||
- **Test rewrite breadth**: ~60–80 e2e/integration cases drive the DB
|
||
executor directly; rewrites must preserve coverage of run semantics, not
|
||
just delete it. Mitigation: the ~50-case Temporal interpreter suite already
|
||
covers step semantics; migrate only control-surface tests.
|
||
- **Hidden DB-engine dependents**: any deployment or script still setting
|
||
`WORKFLOW_RUNTIME_V2_ENABLE_TEMPORAL_POLLING=false` will now run Temporal
|
||
regardless. Compose defaults are updated in this change; external/infra
|
||
repos must be swept separately.
|
||
- **Migration on large tables**: status update is bounded by an index on
|
||
status; affected rows expected to be few.
|
||
|
||
## Acceptance criteria
|
||
|
||
- `grep -r WORKFLOW_RUNTIME_V2_ENABLE` over the repo returns nothing
|
||
(code, compose, tests).
|
||
- No code path writes `engine` other than `'temporal'`.
|
||
- Producer and worker both compile against the single task-queue constant;
|
||
the env var is gone.
|
||
- Retry/Resume/Requeue buttons, actions, and routes are gone; Replay, Cancel,
|
||
Export remain and work on Temporal runs.
|
||
- Unedited Replay of a redaction-configured run executes with the original
|
||
payload (verified by run input of the new run).
|
||
- Replay navigates to the new run.
|
||
- A run with zero steps older than 60s renders the queued-warning banner.
|
||
- Migration cancels a seeded stranded run and resolves its waits; leaves
|
||
terminal rows and Temporal runs untouched.
|
||
- `docker compose -f docker-compose.ee.yaml` brings up a stack where a
|
||
manually started workflow executes end to end.
|
||
- Full test suite green.
|