Hermes 284313f908
Some checks are pending
Bidi Control Character Guard / bidi-control-guard (push) Waiting to run
Circular Dependency Check / Check for new circular dependencies (push) Waiting to run
Citus Migration Smoke / Combined migrations on single-node Citus (push) Waiting to run
E2E Fresh Install Tests / fresh-install-e2e (push) Waiting to run
ext-v2 guardrails / Run ext-v2 guard and ESLint (push) Waiting to run
Integration Tests / Check for relevant changes (push) Waiting to run
Integration Tests / ${{ (github.event_name == 'schedule' || github.event.inputs.suite == 'full') && 'Full integration suite' || 'Tier-1 integration subset' }} (push) Blocked by required conditions
Mobile checks / Mobile lint + typecheck (push) Waiting to run
Mobile checks / Mobile unit tests (push) Waiting to run
Mobile checks / Mobile dependency audit (report) (push) Waiting to run
Mobile checks / Mobile reproducibility checks (push) Waiting to run
Secrets guard (env backups) / Ensure no tracked env backup files (push) Waiting to run
Temporal Readiness / fast-readiness (push) Waiting to run
Temporal Readiness / docker-parity (push) Waiting to run
TypeScript Type Check / Nx affected typecheck (push) Waiting to run
Unit Tests / Skipped-test budget (push) Waiting to run
Unit Tests / Nx affected unit tests (push) Waiting to run
Unit Tests / Server unit coverage (informational) (push) Waiting to run
Validate Tenant Management Schema / Check for relevant changes (push) Waiting to run
Validate Tenant Management Schema / Validate Tenant Management Schema (push) Blocked by required conditions
EE Workflows Build Guard / ee-workflows-build-guard (push) Waiting to run
Initial import of AlgaPSA codebase from PSA server
Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz

Source: /opt/alga-psa on psa.joliet.tech
2026-06-22 16:12:17 -05:00

6.9 KiB
Raw Permalink Blame History

Scratchpad: temporal-only workflow engine

Origin

Customer video: Run button "did nothing"; internal testing couldn't reproduce. Investigation concluded the UI path always toasts/navigates/disables — the silent failure modes are downstream (run created, nothing executes it). Design: ../2026-06-09-temporal-only-workflow-engine-design.md.

Key discoveries (verified against code, 2026-06-09)

  • Producer hardcodes the task queue (workflowRuntimeV2Temporal.ts:42-46); worker honors WORKFLOW_RUNTIME_V2_TEMPORAL_TASK_QUEUE (WorkflowRuntimeV2TemporalWorker.ts:61). Override ⇒ split brain.
  • Server defaults engine to temporal unless WORKFLOW_RUNTIME_V2_ENABLE_TEMPORAL_POLLING is falsy (workflowRunLauncher.ts:20-22); worker DB poller is opt-in via WORKFLOW_RUNTIME_V2_ENABLE_DB_POLLING (index.ts:62-92). Mismatch ⇒ stranded engine='db' runs.
  • executeWorkflowRuntimeV2Run (activities:36) is exported but never invoked by any workflow or worker registration — the live Temporal path is the interpreter + per-step activities. Hence the whole DB interpreter (executeRun & co.) is orphaned once legacy actions/workers go.
  • Temporal activities never write workflow_run_snapshots (grep: zero snapshot references in activities). Snapshots are DB-interpreter artifacts; keep tables/models for historical reads only.
  • TWO DB poller copies: services/workflow-worker/src/v2/ (stripped) and shared/workflow/workers/ (canonical; also sweeps workflow_data_store expiry — must be relocated, it is not engine work).
  • Legacy controls hard-409 on temporal runs via assertLegacyRunControlSupported (actions:1259), but the panel shows Retry for ANY FAILED run (canRetry, panel:906) — guaranteed 409 trap.
  • Replay pre-fills payload from getWorkflowRunAction's redacted input_json (actions:2326-2332; panel:517-525) and always submits it (hasExplicitReplayPayload, actions:3253) ⇒ replays run with [REDACTED] placeholders when redaction configured.
  • Replay success only toasts and refreshes the OLD run (panel:1002-1004) — no navigation to the new run. Big contributor to "did nothing" perception.
  • docker-compose.ee.yaml:240 sets TEMPORAL_POLLING=false — base EE compose relies on the DB engine unless docker-compose.temporal.ee.yaml overlay is applied. Appliance flux profile sets neither flag (defaults = temporal).
  • Toasts are fine: ThemedToaster mounted in root layout at zIndex 999999, above dialog z-70. UI can't silently swallow errors.

Decisions (Robert, 2026-06-09)

  1. Scope: full engine removal (not config-only, not interpreter rewrite).
  2. Retry/Resume/Requeue deleted; Replay is the operator recovery story.
  3. Migration cancels stranded non-temporal RUNNING/WAITING runs.
  4. Bundle the redacted-replay-payload fix and the stuck-run banner.
  5. Merge Temporal services into base docker-compose.ee.yaml; delete overlay.

Open questions / watch-outs for implementation

  • launchPublishedWorkflowRun's execute?: boolean param: check remaining callers (schedules?) before assuming always-start; execute:false + db engine used to mean "poller will pick it up" — that semantic dies with the poller.
  • workflowRunStartLimiter and concurrency checks are duplicated between startWorkflowRunAction and launchPublishedWorkflowRun — possible follow-up simplification, out of scope here.
  • i18n: removing panel buttons orphans runDetails.actions.retry/resume/ requeueEvent + dialog keys across locale files — sweep them.
  • e2e rewrite needs a Temporal test target: check what WorkflowRuntimeV2TemporalWorker.integration.test.ts uses (likely TestWorkflowEnvironment) and reuse the harness.
  • External infra repos may still set the deleted env flags — harmless after removal (code ignores them), but sweep separately.
  • Stranded-run migration: also check workflow_run_waits rows whose run is being canceled — resolve with a status that the run studio renders sanely.

Implementation notes (2026-06-09)

  • launchPublishedWorkflowRun's execute flag was only ever passed as true (4 call sites) — removed along with the engine ternary. executionKey is genuinely used (schedules, webhooks, event launch) and stays.
  • WorkflowRuntimeV2 is now only the run-row projection writer (startRun); the whole DB interpreter (~1,300 lines) was deleted after confirming the Temporal interpreter never called it (executeWorkflowRuntimeV2Run was exported but never registered/invoked).
  • services/workflow-worker/src/v2/WorkflowRuntimeV2Worker.ts was already unreferenced (index.ts imported the shared copy) — both deleted.
  • Bulk Resume in WorkflowRunList used the legacy resume action — removed with the per-run buttons; bulk Cancel stays.
  • Replay payload-dirty detection compares the textarea string against the pre-filled pristine string (ref) — exact-match is sufficient because untouched textareas don't reformat.
  • server/src/test/unit/workflowRunLauncher.unit.test.ts fails to LOAD even unmodified (pre-existing: @alga-psa/db/workDate unresolvable through importOriginal of the runtime in this worktree). Updated for the new API anyway; failure is environmental, also see quota test's stale tenant_id field which suggests these server unit tests aren't in the active CI gate.
  • Playwright workflow suites: host-run server needs TEMPORAL_ADDRESS pointing at the new temporal-playwright service (host port 17233 via PLAYWRIGHT_TEMPORAL_PORT) for replay/run-start flows to function.
  • ee/server full typecheck needs NODE_OPTIONS=--max-old-space-size=12288.

Remaining follow-ups (tests.json items still false)

  • Migration coverage (T046T049): the stranded-run migration has no automated test; needs the DB-backed integration harness.
  • Stuck-run banner (T042T045) and removed-button absence (T033T035): no component tests written; behavior is hand-verifiable in the run studio.
  • Replay UI dirty-detection (T036/T039): server-side contract is covered; the client-side "send nothing when unedited" path is not unit-tested.
  • Full-stack runs (T030/T050/T052): need a live compose stack with Temporal; docker compose -f docker-compose.ee.yaml up then start a manual run.
  • DB-backed integration suites (control/publish/e2e) were verified by typecheck + vitest collection only in this worktree (no DB available); CI run pending.
  • EventStreamWorker vitest suite (7 cases) fails in this worktree on real Redis/mock-resolution grounds — confirmed identical failure on pre-change code; not a regression.

Commands

  • Find engine references: grep -rn "engine.*'db'\|'db'.*engine" --include="*.ts" shared ee services server | grep -v node_modules
  • Flag sweep: grep -rn "WORKFLOW_RUNTIME_V2_ENABLE" --include="*" . | grep -v node_modules | grep -v docs/plans
  • Stranded runs (prod triage): select run_id, status, engine, started_at from workflow_runs where (engine is null or engine='db') and status in ('RUNNING','WAITING');