Some checks are pending
Bidi Control Character Guard / bidi-control-guard (push) Waiting to run
Circular Dependency Check / Check for new circular dependencies (push) Waiting to run
Citus Migration Smoke / Combined migrations on single-node Citus (push) Waiting to run
E2E Fresh Install Tests / fresh-install-e2e (push) Waiting to run
ext-v2 guardrails / Run ext-v2 guard and ESLint (push) Waiting to run
Integration Tests / Check for relevant changes (push) Waiting to run
Integration Tests / ${{ (github.event_name == 'schedule' || github.event.inputs.suite == 'full') && 'Full integration suite' || 'Tier-1 integration subset' }} (push) Blocked by required conditions
Mobile checks / Mobile lint + typecheck (push) Waiting to run
Mobile checks / Mobile unit tests (push) Waiting to run
Mobile checks / Mobile dependency audit (report) (push) Waiting to run
Mobile checks / Mobile reproducibility checks (push) Waiting to run
Secrets guard (env backups) / Ensure no tracked env backup files (push) Waiting to run
Temporal Readiness / fast-readiness (push) Waiting to run
Temporal Readiness / docker-parity (push) Waiting to run
TypeScript Type Check / Nx affected typecheck (push) Waiting to run
Unit Tests / Skipped-test budget (push) Waiting to run
Unit Tests / Nx affected unit tests (push) Waiting to run
Unit Tests / Server unit coverage (informational) (push) Waiting to run
Validate Tenant Management Schema / Check for relevant changes (push) Waiting to run
Validate Tenant Management Schema / Validate Tenant Management Schema (push) Blocked by required conditions
EE Workflows Build Guard / ee-workflows-build-guard (push) Waiting to run
Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz Source: /opt/alga-psa on psa.joliet.tech
6.9 KiB
6.9 KiB
Scratchpad: temporal-only workflow engine
Origin
Customer video: Run button "did nothing"; internal testing couldn't reproduce.
Investigation concluded the UI path always toasts/navigates/disables — the
silent failure modes are downstream (run created, nothing executes it).
Design: ../2026-06-09-temporal-only-workflow-engine-design.md.
Key discoveries (verified against code, 2026-06-09)
- Producer hardcodes the task queue (
workflowRuntimeV2Temporal.ts:42-46); worker honorsWORKFLOW_RUNTIME_V2_TEMPORAL_TASK_QUEUE(WorkflowRuntimeV2TemporalWorker.ts:61). Override ⇒ split brain. - Server defaults engine to temporal unless
WORKFLOW_RUNTIME_V2_ENABLE_TEMPORAL_POLLINGis falsy (workflowRunLauncher.ts:20-22); worker DB poller is opt-in viaWORKFLOW_RUNTIME_V2_ENABLE_DB_POLLING(index.ts:62-92). Mismatch ⇒ strandedengine='db'runs. executeWorkflowRuntimeV2Run(activities:36) is exported but never invoked by any workflow or worker registration — the live Temporal path is the interpreter + per-step activities. Hence the whole DB interpreter (executeRun& co.) is orphaned once legacy actions/workers go.- Temporal activities never write
workflow_run_snapshots(grep: zero snapshot references in activities). Snapshots are DB-interpreter artifacts; keep tables/models for historical reads only. - TWO DB poller copies:
services/workflow-worker/src/v2/(stripped) andshared/workflow/workers/(canonical; also sweepsworkflow_data_storeexpiry — must be relocated, it is not engine work). - Legacy controls hard-409 on temporal runs via
assertLegacyRunControlSupported(actions:1259), but the panel shows Retry for ANY FAILED run (canRetry, panel:906) — guaranteed 409 trap. - Replay pre-fills payload from
getWorkflowRunAction's redactedinput_json(actions:2326-2332; panel:517-525) and always submits it (hasExplicitReplayPayload, actions:3253) ⇒ replays run with[REDACTED]placeholders when redaction configured. - Replay success only toasts and refreshes the OLD run (panel:1002-1004) — no navigation to the new run. Big contributor to "did nothing" perception.
docker-compose.ee.yaml:240sets TEMPORAL_POLLING=false — base EE compose relies on the DB engine unlessdocker-compose.temporal.ee.yamloverlay is applied. Appliance flux profile sets neither flag (defaults = temporal).- Toasts are fine:
ThemedToastermounted in root layout at zIndex 999999, above dialog z-70. UI can't silently swallow errors.
Decisions (Robert, 2026-06-09)
- Scope: full engine removal (not config-only, not interpreter rewrite).
- Retry/Resume/Requeue deleted; Replay is the operator recovery story.
- Migration cancels stranded non-temporal RUNNING/WAITING runs.
- Bundle the redacted-replay-payload fix and the stuck-run banner.
- Merge Temporal services into base
docker-compose.ee.yaml; delete overlay.
Open questions / watch-outs for implementation
launchPublishedWorkflowRun'sexecute?: booleanparam: check remaining callers (schedules?) before assuming always-start;execute:false+ db engine used to mean "poller will pick it up" — that semantic dies with the poller.workflowRunStartLimiterand concurrency checks are duplicated betweenstartWorkflowRunActionandlaunchPublishedWorkflowRun— possible follow-up simplification, out of scope here.- i18n: removing panel buttons orphans
runDetails.actions.retry/resume/ requeueEvent+ dialog keys across locale files — sweep them. - e2e rewrite needs a Temporal test target: check what
WorkflowRuntimeV2TemporalWorker.integration.test.tsuses (likely TestWorkflowEnvironment) and reuse the harness. - External infra repos may still set the deleted env flags — harmless after removal (code ignores them), but sweep separately.
- Stranded-run migration: also check
workflow_run_waitsrows whose run is being canceled — resolve with a status that the run studio renders sanely.
Implementation notes (2026-06-09)
launchPublishedWorkflowRun'sexecuteflag was only ever passed astrue(4 call sites) — removed along with the engine ternary.executionKeyis genuinely used (schedules, webhooks, event launch) and stays.WorkflowRuntimeV2is now only the run-row projection writer (startRun); the whole DB interpreter (~1,300 lines) was deleted after confirming the Temporal interpreter never called it (executeWorkflowRuntimeV2Runwas exported but never registered/invoked).services/workflow-worker/src/v2/WorkflowRuntimeV2Worker.tswas already unreferenced (index.ts imported the shared copy) — both deleted.- Bulk Resume in
WorkflowRunListused the legacy resume action — removed with the per-run buttons; bulk Cancel stays. - Replay payload-dirty detection compares the textarea string against the pre-filled pristine string (ref) — exact-match is sufficient because untouched textareas don't reformat.
server/src/test/unit/workflowRunLauncher.unit.test.tsfails to LOAD even unmodified (pre-existing:@alga-psa/db/workDateunresolvable throughimportOriginalof the runtime in this worktree). Updated for the new API anyway; failure is environmental, also see quota test's staletenant_idfield which suggests these server unit tests aren't in the active CI gate.- Playwright workflow suites: host-run server needs
TEMPORAL_ADDRESSpointing at the newtemporal-playwrightservice (host port 17233 viaPLAYWRIGHT_TEMPORAL_PORT) for replay/run-start flows to function. - ee/server full typecheck needs
NODE_OPTIONS=--max-old-space-size=12288.
Remaining follow-ups (tests.json items still false)
- Migration coverage (T046–T049): the stranded-run migration has no automated test; needs the DB-backed integration harness.
- Stuck-run banner (T042–T045) and removed-button absence (T033–T035): no component tests written; behavior is hand-verifiable in the run studio.
- Replay UI dirty-detection (T036/T039): server-side contract is covered; the client-side "send nothing when unedited" path is not unit-tested.
- Full-stack runs (T030/T050/T052): need a live compose stack with Temporal;
docker compose -f docker-compose.ee.yaml upthen start a manual run. - DB-backed integration suites (control/publish/e2e) were verified by typecheck + vitest collection only in this worktree (no DB available); CI run pending.
- EventStreamWorker vitest suite (7 cases) fails in this worktree on real Redis/mock-resolution grounds — confirmed identical failure on pre-change code; not a regression.
Commands
- Find engine references:
grep -rn "engine.*'db'\|'db'.*engine" --include="*.ts" shared ee services server | grep -v node_modules - Flag sweep:
grep -rn "WORKFLOW_RUNTIME_V2_ENABLE" --include="*" . | grep -v node_modules | grep -v docs/plans - Stranded runs (prod triage):
select run_id, status, engine, started_at from workflow_runs where (engine is null or engine='db') and status in ('RUNNING','WAITING');