Hermes 284313f908
Some checks are pending
Bidi Control Character Guard / bidi-control-guard (push) Waiting to run
Circular Dependency Check / Check for new circular dependencies (push) Waiting to run
Citus Migration Smoke / Combined migrations on single-node Citus (push) Waiting to run
E2E Fresh Install Tests / fresh-install-e2e (push) Waiting to run
ext-v2 guardrails / Run ext-v2 guard and ESLint (push) Waiting to run
Integration Tests / Check for relevant changes (push) Waiting to run
Integration Tests / ${{ (github.event_name == 'schedule' || github.event.inputs.suite == 'full') && 'Full integration suite' || 'Tier-1 integration subset' }} (push) Blocked by required conditions
Mobile checks / Mobile lint + typecheck (push) Waiting to run
Mobile checks / Mobile unit tests (push) Waiting to run
Mobile checks / Mobile dependency audit (report) (push) Waiting to run
Mobile checks / Mobile reproducibility checks (push) Waiting to run
Secrets guard (env backups) / Ensure no tracked env backup files (push) Waiting to run
Temporal Readiness / fast-readiness (push) Waiting to run
Temporal Readiness / docker-parity (push) Waiting to run
TypeScript Type Check / Nx affected typecheck (push) Waiting to run
Unit Tests / Skipped-test budget (push) Waiting to run
Unit Tests / Nx affected unit tests (push) Waiting to run
Unit Tests / Server unit coverage (informational) (push) Waiting to run
Validate Tenant Management Schema / Check for relevant changes (push) Waiting to run
Validate Tenant Management Schema / Validate Tenant Management Schema (push) Blocked by required conditions
EE Workflows Build Guard / ee-workflows-build-guard (push) Waiting to run
Initial import of AlgaPSA codebase from PSA server
Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz

Source: /opt/alga-psa on psa.joliet.tech
2026-06-22 16:12:17 -05:00

102 lines
4.8 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# PRD: Temporal-only workflow engine
Design: [`../2026-06-09-temporal-only-workflow-engine-design.md`](../2026-06-09-temporal-only-workflow-engine-design.md)
Branch: `fix/workflow-replay-noop`
## Problem
Workflow runtime V2 ships two execution engines — Temporal and a legacy
DB-polling engine — selected per-process by environment flags. The API server
and the workflow worker can disagree about which engine is live, and the
Temporal producer and worker can resolve different task queues. Either
mismatch strands new runs: a `workflow_runs` row is created, stays `RUNNING`
with zero steps, and the operator experiences the Run/Replay button as doing
nothing. A customer reported exactly this (with video); internal testing on a
correctly-configured stack could not reproduce it.
Adjacent defects in the same surface: the Run Studio shows Retry/Resume/
Requeue buttons that hard-fail with 409 on Temporal runs; Replay submits the
redacted `input_json` as an explicit payload override when redaction is
configured; and a queued-but-unworked run gives no visual indication that
anything is wrong.
## User value
Operators get a run/replay surface where every visible control works on every
run, misconfiguration can no longer silently strand runs, and a genuinely
stuck run announces itself instead of impersonating a healthy one.
## Goals
1. Temporal is the only execution engine; no flag combination can produce a
run that nothing executes.
2. One task queue constant shared by producer and worker; no env override.
3. Legacy DB-only run controls (retry/resume/requeue) and the dead DB
interpreter are deleted, not stranded as dead code.
4. Replay never submits redacted payloads and lands the operator on the new
run.
5. Runs that are queued but unworked for >60s are visibly flagged.
6. Stranded non-Temporal runs in existing databases are finalized honestly
(CANCELED with an explanatory error).
7. Every EE compose stack includes Temporal; the overlay file disappears.
## Non-goals
- Temporal-native retry-from-failed-step (Replay is the recovery path).
- Re-launching historical stranded runs on Temporal.
- Deleting the `engine` column, `'db'` historical values, or the
`workflow_run_snapshots` table (kept for reading pre-cutover runs).
- Refactoring the Temporal interpreter/activities themselves.
- New monitoring/alerting beyond the in-app stuck-run banner.
## Primary flows
1. **Run now**: designer or runs-list → Run dialog → `startWorkflowRunAction`
`launchPublishedWorkflowRun` → always `engine='temporal'`, always
`client.workflow.start` on the contract task queue → navigate to run page.
2. **Replay**: run details → Replay → unedited payload sends nothing (server
uses original unredacted `input_json`); edited payload sends the edit →
new Temporal run → UI navigates to it.
3. **Cancel / quota resume**: always the Temporal signal path.
4. **Stuck run**: run page shows "queued, waiting for worker" banner when
`RUNNING` + zero steps + age > 60s.
## Data / migration
One Knex migration: `workflow_runs` where `engine` is null or `'db'` and
`status` in (`RUNNING`,`WAITING`) → `status='CANCELED'`,
`completed_at=now()`, `error_json` noting the temporal-only cutover; resolve
their open `workflow_run_waits`. Terminal rows untouched.
## Risks
- **Test rewrite breadth**: ~6080 e2e/integration cases drive the DB
executor directly; rewrites must preserve coverage of run semantics, not
just delete it. Mitigation: the ~50-case Temporal interpreter suite already
covers step semantics; migrate only control-surface tests.
- **Hidden DB-engine dependents**: any deployment or script still setting
`WORKFLOW_RUNTIME_V2_ENABLE_TEMPORAL_POLLING=false` will now run Temporal
regardless. Compose defaults are updated in this change; external/infra
repos must be swept separately.
- **Migration on large tables**: status update is bounded by an index on
status; affected rows expected to be few.
## Acceptance criteria
- `grep -r WORKFLOW_RUNTIME_V2_ENABLE` over the repo returns nothing
(code, compose, tests).
- No code path writes `engine` other than `'temporal'`.
- Producer and worker both compile against the single task-queue constant;
the env var is gone.
- Retry/Resume/Requeue buttons, actions, and routes are gone; Replay, Cancel,
Export remain and work on Temporal runs.
- Unedited Replay of a redaction-configured run executes with the original
payload (verified by run input of the new run).
- Replay navigates to the new run.
- A run with zero steps older than 60s renders the queued-warning banner.
- Migration cancels a seeded stranded run and resolves its waits; leaves
terminal rows and Temporal runs untouched.
- `docker compose -f docker-compose.ee.yaml` brings up a stack where a
manually started workflow executes end to end.
- Full test suite green.