Some checks are pending
Bidi Control Character Guard / bidi-control-guard (push) Waiting to run
Circular Dependency Check / Check for new circular dependencies (push) Waiting to run
Citus Migration Smoke / Combined migrations on single-node Citus (push) Waiting to run
E2E Fresh Install Tests / fresh-install-e2e (push) Waiting to run
ext-v2 guardrails / Run ext-v2 guard and ESLint (push) Waiting to run
Integration Tests / Check for relevant changes (push) Waiting to run
Integration Tests / ${{ (github.event_name == 'schedule' || github.event.inputs.suite == 'full') && 'Full integration suite' || 'Tier-1 integration subset' }} (push) Blocked by required conditions
Mobile checks / Mobile lint + typecheck (push) Waiting to run
Mobile checks / Mobile unit tests (push) Waiting to run
Mobile checks / Mobile dependency audit (report) (push) Waiting to run
Mobile checks / Mobile reproducibility checks (push) Waiting to run
Secrets guard (env backups) / Ensure no tracked env backup files (push) Waiting to run
Temporal Readiness / fast-readiness (push) Waiting to run
Temporal Readiness / docker-parity (push) Waiting to run
TypeScript Type Check / Nx affected typecheck (push) Waiting to run
Unit Tests / Skipped-test budget (push) Waiting to run
Unit Tests / Nx affected unit tests (push) Waiting to run
Unit Tests / Server unit coverage (informational) (push) Waiting to run
Validate Tenant Management Schema / Check for relevant changes (push) Waiting to run
Validate Tenant Management Schema / Validate Tenant Management Schema (push) Blocked by required conditions
EE Workflows Build Guard / ee-workflows-build-guard (push) Waiting to run
Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz Source: /opt/alga-psa on psa.joliet.tech
120 lines
6.9 KiB
Markdown
120 lines
6.9 KiB
Markdown
# Scratchpad: temporal-only workflow engine
|
||
|
||
## Origin
|
||
|
||
Customer video: Run button "did nothing"; internal testing couldn't reproduce.
|
||
Investigation concluded the UI path always toasts/navigates/disables — the
|
||
silent failure modes are downstream (run created, nothing executes it).
|
||
Design: `../2026-06-09-temporal-only-workflow-engine-design.md`.
|
||
|
||
## Key discoveries (verified against code, 2026-06-09)
|
||
|
||
- Producer hardcodes the task queue (`workflowRuntimeV2Temporal.ts:42-46`);
|
||
worker honors `WORKFLOW_RUNTIME_V2_TEMPORAL_TASK_QUEUE`
|
||
(`WorkflowRuntimeV2TemporalWorker.ts:61`). Override ⇒ split brain.
|
||
- Server defaults engine to temporal unless
|
||
`WORKFLOW_RUNTIME_V2_ENABLE_TEMPORAL_POLLING` is falsy
|
||
(`workflowRunLauncher.ts:20-22`); worker DB poller is opt-in via
|
||
`WORKFLOW_RUNTIME_V2_ENABLE_DB_POLLING` (`index.ts:62-92`). Mismatch ⇒
|
||
stranded `engine='db'` runs.
|
||
- `executeWorkflowRuntimeV2Run` (activities:36) is exported but never invoked
|
||
by any workflow or worker registration — the live Temporal path is the
|
||
interpreter + per-step activities. Hence the whole DB interpreter
|
||
(`executeRun` & co.) is orphaned once legacy actions/workers go.
|
||
- Temporal activities never write `workflow_run_snapshots` (grep: zero
|
||
snapshot references in activities). Snapshots are DB-interpreter artifacts;
|
||
keep tables/models for historical reads only.
|
||
- TWO DB poller copies: `services/workflow-worker/src/v2/` (stripped) and
|
||
`shared/workflow/workers/` (canonical; also sweeps `workflow_data_store`
|
||
expiry — must be relocated, it is not engine work).
|
||
- Legacy controls hard-409 on temporal runs via
|
||
`assertLegacyRunControlSupported` (actions:1259), but the panel shows
|
||
Retry for ANY FAILED run (`canRetry`, panel:906) — guaranteed 409 trap.
|
||
- Replay pre-fills payload from `getWorkflowRunAction`'s **redacted**
|
||
`input_json` (actions:2326-2332; panel:517-525) and always submits it
|
||
(`hasExplicitReplayPayload`, actions:3253) ⇒ replays run with `[REDACTED]`
|
||
placeholders when redaction configured.
|
||
- Replay success only toasts and refreshes the OLD run (panel:1002-1004) —
|
||
no navigation to the new run. Big contributor to "did nothing" perception.
|
||
- `docker-compose.ee.yaml:240` sets TEMPORAL_POLLING=false — base EE compose
|
||
relies on the DB engine unless `docker-compose.temporal.ee.yaml` overlay is
|
||
applied. Appliance flux profile sets neither flag (defaults = temporal).
|
||
- Toasts are fine: `ThemedToaster` mounted in root layout at zIndex 999999,
|
||
above dialog z-70. UI can't silently swallow errors.
|
||
|
||
## Decisions (Robert, 2026-06-09)
|
||
|
||
1. Scope: full engine removal (not config-only, not interpreter rewrite).
|
||
2. Retry/Resume/Requeue deleted; Replay is the operator recovery story.
|
||
3. Migration cancels stranded non-temporal RUNNING/WAITING runs.
|
||
4. Bundle the redacted-replay-payload fix and the stuck-run banner.
|
||
5. Merge Temporal services into base `docker-compose.ee.yaml`; delete overlay.
|
||
|
||
## Open questions / watch-outs for implementation
|
||
|
||
- `launchPublishedWorkflowRun`'s `execute?: boolean` param: check remaining
|
||
callers (schedules?) before assuming always-start; `execute:false` + db
|
||
engine used to mean "poller will pick it up" — that semantic dies with the
|
||
poller.
|
||
- `workflowRunStartLimiter` and concurrency checks are duplicated between
|
||
`startWorkflowRunAction` and `launchPublishedWorkflowRun` — possible
|
||
follow-up simplification, out of scope here.
|
||
- i18n: removing panel buttons orphans `runDetails.actions.retry/resume/
|
||
requeueEvent` + dialog keys across locale files — sweep them.
|
||
- e2e rewrite needs a Temporal test target: check what
|
||
`WorkflowRuntimeV2TemporalWorker.integration.test.ts` uses (likely
|
||
TestWorkflowEnvironment) and reuse the harness.
|
||
- External infra repos may still set the deleted env flags — harmless after
|
||
removal (code ignores them), but sweep separately.
|
||
- Stranded-run migration: also check `workflow_run_waits` rows whose run is
|
||
being canceled — resolve with a status that the run studio renders sanely.
|
||
|
||
## Implementation notes (2026-06-09)
|
||
|
||
- `launchPublishedWorkflowRun`'s `execute` flag was only ever passed as `true`
|
||
(4 call sites) — removed along with the engine ternary. `executionKey` is
|
||
genuinely used (schedules, webhooks, event launch) and stays.
|
||
- `WorkflowRuntimeV2` is now only the run-row projection writer (`startRun`);
|
||
the whole DB interpreter (~1,300 lines) was deleted after confirming the
|
||
Temporal interpreter never called it (`executeWorkflowRuntimeV2Run` was
|
||
exported but never registered/invoked).
|
||
- `services/workflow-worker/src/v2/WorkflowRuntimeV2Worker.ts` was already
|
||
unreferenced (index.ts imported the shared copy) — both deleted.
|
||
- Bulk Resume in `WorkflowRunList` used the legacy resume action — removed
|
||
with the per-run buttons; bulk Cancel stays.
|
||
- Replay payload-dirty detection compares the textarea string against the
|
||
pre-filled pristine string (ref) — exact-match is sufficient because
|
||
untouched textareas don't reformat.
|
||
- `server/src/test/unit/workflowRunLauncher.unit.test.ts` fails to LOAD even
|
||
unmodified (pre-existing: `@alga-psa/db/workDate` unresolvable through
|
||
`importOriginal` of the runtime in this worktree). Updated for the new API
|
||
anyway; failure is environmental, also see quota test's stale `tenant_id`
|
||
field which suggests these server unit tests aren't in the active CI gate.
|
||
- Playwright workflow suites: host-run server needs `TEMPORAL_ADDRESS`
|
||
pointing at the new `temporal-playwright` service (host port 17233 via
|
||
`PLAYWRIGHT_TEMPORAL_PORT`) for replay/run-start flows to function.
|
||
- ee/server full typecheck needs `NODE_OPTIONS=--max-old-space-size=12288`.
|
||
|
||
## Remaining follow-ups (tests.json items still false)
|
||
|
||
- Migration coverage (T046–T049): the stranded-run migration has no automated
|
||
test; needs the DB-backed integration harness.
|
||
- Stuck-run banner (T042–T045) and removed-button absence (T033–T035): no
|
||
component tests written; behavior is hand-verifiable in the run studio.
|
||
- Replay UI dirty-detection (T036/T039): server-side contract is covered;
|
||
the client-side "send nothing when unedited" path is not unit-tested.
|
||
- Full-stack runs (T030/T050/T052): need a live compose stack with Temporal;
|
||
`docker compose -f docker-compose.ee.yaml up` then start a manual run.
|
||
- DB-backed integration suites (control/publish/e2e) were verified by
|
||
typecheck + vitest collection only in this worktree (no DB available);
|
||
CI run pending.
|
||
- EventStreamWorker vitest suite (7 cases) fails in this worktree on real
|
||
Redis/mock-resolution grounds — confirmed identical failure on pre-change
|
||
code; not a regression.
|
||
|
||
## Commands
|
||
|
||
- Find engine references: `grep -rn "engine.*'db'\|'db'.*engine" --include="*.ts" shared ee services server | grep -v node_modules`
|
||
- Flag sweep: `grep -rn "WORKFLOW_RUNTIME_V2_ENABLE" --include="*" . | grep -v node_modules | grep -v docs/plans`
|
||
- Stranded runs (prod triage): `select run_id, status, engine, started_at from workflow_runs where (engine is null or engine='db') and status in ('RUNNING','WAITING');`
|