# Scratchpad: temporal-only workflow engine

## Origin

Customer video: Run button "did nothing"; internal testing couldn't reproduce.
Investigation concluded the UI path always toasts/navigates/disables — the
silent failure modes are downstream (run created, nothing executes it).
Design: `../2026-06-09-temporal-only-workflow-engine-design.md`.

## Key discoveries (verified against code, 2026-06-09)

- Producer hardcodes the task queue (`workflowRuntimeV2Temporal.ts:42-46`);
  worker honors `WORKFLOW_RUNTIME_V2_TEMPORAL_TASK_QUEUE`
  (`WorkflowRuntimeV2TemporalWorker.ts:61`). Override ⇒ split brain.
- Server defaults engine to temporal unless
  `WORKFLOW_RUNTIME_V2_ENABLE_TEMPORAL_POLLING` is falsy
  (`workflowRunLauncher.ts:20-22`); worker DB poller is opt-in via
  `WORKFLOW_RUNTIME_V2_ENABLE_DB_POLLING` (`index.ts:62-92`). Mismatch ⇒
  stranded `engine='db'` runs.
- `executeWorkflowRuntimeV2Run` (activities:36) is exported but never invoked
  by any workflow or worker registration — the live Temporal path is the
  interpreter + per-step activities. Hence the whole DB interpreter
  (`executeRun` & co.) is orphaned once legacy actions/workers go.
- Temporal activities never write `workflow_run_snapshots` (grep: zero
  snapshot references in activities). Snapshots are DB-interpreter artifacts;
  keep tables/models for historical reads only.
- TWO DB poller copies: `services/workflow-worker/src/v2/` (stripped) and
  `shared/workflow/workers/` (canonical; also sweeps `workflow_data_store`
  expiry — must be relocated, it is not engine work).
- Legacy controls hard-409 on temporal runs via
  `assertLegacyRunControlSupported` (actions:1259), but the panel shows
  Retry for ANY FAILED run (`canRetry`, panel:906) — guaranteed 409 trap.
- Replay pre-fills payload from `getWorkflowRunAction`'s **redacted**
  `input_json` (actions:2326-2332; panel:517-525) and always submits it
  (`hasExplicitReplayPayload`, actions:3253) ⇒ replays run with `[REDACTED]`
  placeholders when redaction configured.
- Replay success only toasts and refreshes the OLD run (panel:1002-1004) —
  no navigation to the new run. Big contributor to "did nothing" perception.
- `docker-compose.ee.yaml:240` sets TEMPORAL_POLLING=false — base EE compose
  relies on the DB engine unless `docker-compose.temporal.ee.yaml` overlay is
  applied. Appliance flux profile sets neither flag (defaults = temporal).
- Toasts are fine: `ThemedToaster` mounted in root layout at zIndex 999999,
  above dialog z-70. UI can't silently swallow errors.

## Decisions (Robert, 2026-06-09)

1. Scope: full engine removal (not config-only, not interpreter rewrite).
2. Retry/Resume/Requeue deleted; Replay is the operator recovery story.
3. Migration cancels stranded non-temporal RUNNING/WAITING runs.
4. Bundle the redacted-replay-payload fix and the stuck-run banner.
5. Merge Temporal services into base `docker-compose.ee.yaml`; delete overlay.

## Open questions / watch-outs for implementation

- `launchPublishedWorkflowRun`'s `execute?: boolean` param: check remaining
  callers (schedules?) before assuming always-start; `execute:false` + db
  engine used to mean "poller will pick it up" — that semantic dies with the
  poller.
- `workflowRunStartLimiter` and concurrency checks are duplicated between
  `startWorkflowRunAction` and `launchPublishedWorkflowRun` — possible
  follow-up simplification, out of scope here.
- i18n: removing panel buttons orphans `runDetails.actions.retry/resume/
  requeueEvent` + dialog keys across locale files — sweep them.
- e2e rewrite needs a Temporal test target: check what
  `WorkflowRuntimeV2TemporalWorker.integration.test.ts` uses (likely
  TestWorkflowEnvironment) and reuse the harness.
- External infra repos may still set the deleted env flags — harmless after
  removal (code ignores them), but sweep separately.
- Stranded-run migration: also check `workflow_run_waits` rows whose run is
  being canceled — resolve with a status that the run studio renders sanely.

## Implementation notes (2026-06-09)

- `launchPublishedWorkflowRun`'s `execute` flag was only ever passed as `true`
  (4 call sites) — removed along with the engine ternary. `executionKey` is
  genuinely used (schedules, webhooks, event launch) and stays.
- `WorkflowRuntimeV2` is now only the run-row projection writer (`startRun`);
  the whole DB interpreter (~1,300 lines) was deleted after confirming the
  Temporal interpreter never called it (`executeWorkflowRuntimeV2Run` was
  exported but never registered/invoked).
- `services/workflow-worker/src/v2/WorkflowRuntimeV2Worker.ts` was already
  unreferenced (index.ts imported the shared copy) — both deleted.
- Bulk Resume in `WorkflowRunList` used the legacy resume action — removed
  with the per-run buttons; bulk Cancel stays.
- Replay payload-dirty detection compares the textarea string against the
  pre-filled pristine string (ref) — exact-match is sufficient because
  untouched textareas don't reformat.
- `server/src/test/unit/workflowRunLauncher.unit.test.ts` fails to LOAD even
  unmodified (pre-existing: `@alga-psa/db/workDate` unresolvable through
  `importOriginal` of the runtime in this worktree). Updated for the new API
  anyway; failure is environmental, also see quota test's stale `tenant_id`
  field which suggests these server unit tests aren't in the active CI gate.
- Playwright workflow suites: host-run server needs `TEMPORAL_ADDRESS`
  pointing at the new `temporal-playwright` service (host port 17233 via
  `PLAYWRIGHT_TEMPORAL_PORT`) for replay/run-start flows to function.
- ee/server full typecheck needs `NODE_OPTIONS=--max-old-space-size=12288`.

## Remaining follow-ups (tests.json items still false)

- Migration coverage (T046–T049): the stranded-run migration has no automated
  test; needs the DB-backed integration harness.
- Stuck-run banner (T042–T045) and removed-button absence (T033–T035): no
  component tests written; behavior is hand-verifiable in the run studio.
- Replay UI dirty-detection (T036/T039): server-side contract is covered;
  the client-side "send nothing when unedited" path is not unit-tested.
- Full-stack runs (T030/T050/T052): need a live compose stack with Temporal;
  `docker compose -f docker-compose.ee.yaml up` then start a manual run.
- DB-backed integration suites (control/publish/e2e) were verified by
  typecheck + vitest collection only in this worktree (no DB available);
  CI run pending.
- EventStreamWorker vitest suite (7 cases) fails in this worktree on real
  Redis/mock-resolution grounds — confirmed identical failure on pre-change
  code; not a regression.

## Commands

- Find engine references: `grep -rn "engine.*'db'\|'db'.*engine" --include="*.ts" shared ee services server | grep -v node_modules`
- Flag sweep: `grep -rn "WORKFLOW_RUNTIME_V2_ENABLE" --include="*" . | grep -v node_modules | grep -v docs/plans`
- Stranded runs (prod triage): `select run_id, status, engine, started_at from workflow_runs where (engine is null or engine='db') and status in ('RUNNING','WAITING');`