Hermes 284313f908
Some checks are pending
Bidi Control Character Guard / bidi-control-guard (push) Waiting to run
Circular Dependency Check / Check for new circular dependencies (push) Waiting to run
Citus Migration Smoke / Combined migrations on single-node Citus (push) Waiting to run
E2E Fresh Install Tests / fresh-install-e2e (push) Waiting to run
ext-v2 guardrails / Run ext-v2 guard and ESLint (push) Waiting to run
Integration Tests / Check for relevant changes (push) Waiting to run
Integration Tests / ${{ (github.event_name == 'schedule' || github.event.inputs.suite == 'full') && 'Full integration suite' || 'Tier-1 integration subset' }} (push) Blocked by required conditions
Mobile checks / Mobile lint + typecheck (push) Waiting to run
Mobile checks / Mobile unit tests (push) Waiting to run
Mobile checks / Mobile dependency audit (report) (push) Waiting to run
Mobile checks / Mobile reproducibility checks (push) Waiting to run
Secrets guard (env backups) / Ensure no tracked env backup files (push) Waiting to run
Temporal Readiness / fast-readiness (push) Waiting to run
Temporal Readiness / docker-parity (push) Waiting to run
TypeScript Type Check / Nx affected typecheck (push) Waiting to run
Unit Tests / Skipped-test budget (push) Waiting to run
Unit Tests / Nx affected unit tests (push) Waiting to run
Unit Tests / Server unit coverage (informational) (push) Waiting to run
Validate Tenant Management Schema / Check for relevant changes (push) Waiting to run
Validate Tenant Management Schema / Validate Tenant Management Schema (push) Blocked by required conditions
EE Workflows Build Guard / ee-workflows-build-guard (push) Waiting to run
Initial import of AlgaPSA codebase from PSA server
Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz

Source: /opt/alga-psa on psa.joliet.tech
2026-06-22 16:12:17 -05:00

120 lines
6.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Scratchpad: temporal-only workflow engine
## Origin
Customer video: Run button "did nothing"; internal testing couldn't reproduce.
Investigation concluded the UI path always toasts/navigates/disables — the
silent failure modes are downstream (run created, nothing executes it).
Design: `../2026-06-09-temporal-only-workflow-engine-design.md`.
## Key discoveries (verified against code, 2026-06-09)
- Producer hardcodes the task queue (`workflowRuntimeV2Temporal.ts:42-46`);
worker honors `WORKFLOW_RUNTIME_V2_TEMPORAL_TASK_QUEUE`
(`WorkflowRuntimeV2TemporalWorker.ts:61`). Override ⇒ split brain.
- Server defaults engine to temporal unless
`WORKFLOW_RUNTIME_V2_ENABLE_TEMPORAL_POLLING` is falsy
(`workflowRunLauncher.ts:20-22`); worker DB poller is opt-in via
`WORKFLOW_RUNTIME_V2_ENABLE_DB_POLLING` (`index.ts:62-92`). Mismatch ⇒
stranded `engine='db'` runs.
- `executeWorkflowRuntimeV2Run` (activities:36) is exported but never invoked
by any workflow or worker registration — the live Temporal path is the
interpreter + per-step activities. Hence the whole DB interpreter
(`executeRun` & co.) is orphaned once legacy actions/workers go.
- Temporal activities never write `workflow_run_snapshots` (grep: zero
snapshot references in activities). Snapshots are DB-interpreter artifacts;
keep tables/models for historical reads only.
- TWO DB poller copies: `services/workflow-worker/src/v2/` (stripped) and
`shared/workflow/workers/` (canonical; also sweeps `workflow_data_store`
expiry — must be relocated, it is not engine work).
- Legacy controls hard-409 on temporal runs via
`assertLegacyRunControlSupported` (actions:1259), but the panel shows
Retry for ANY FAILED run (`canRetry`, panel:906) — guaranteed 409 trap.
- Replay pre-fills payload from `getWorkflowRunAction`'s **redacted**
`input_json` (actions:2326-2332; panel:517-525) and always submits it
(`hasExplicitReplayPayload`, actions:3253) ⇒ replays run with `[REDACTED]`
placeholders when redaction configured.
- Replay success only toasts and refreshes the OLD run (panel:1002-1004) —
no navigation to the new run. Big contributor to "did nothing" perception.
- `docker-compose.ee.yaml:240` sets TEMPORAL_POLLING=false — base EE compose
relies on the DB engine unless `docker-compose.temporal.ee.yaml` overlay is
applied. Appliance flux profile sets neither flag (defaults = temporal).
- Toasts are fine: `ThemedToaster` mounted in root layout at zIndex 999999,
above dialog z-70. UI can't silently swallow errors.
## Decisions (Robert, 2026-06-09)
1. Scope: full engine removal (not config-only, not interpreter rewrite).
2. Retry/Resume/Requeue deleted; Replay is the operator recovery story.
3. Migration cancels stranded non-temporal RUNNING/WAITING runs.
4. Bundle the redacted-replay-payload fix and the stuck-run banner.
5. Merge Temporal services into base `docker-compose.ee.yaml`; delete overlay.
## Open questions / watch-outs for implementation
- `launchPublishedWorkflowRun`'s `execute?: boolean` param: check remaining
callers (schedules?) before assuming always-start; `execute:false` + db
engine used to mean "poller will pick it up" — that semantic dies with the
poller.
- `workflowRunStartLimiter` and concurrency checks are duplicated between
`startWorkflowRunAction` and `launchPublishedWorkflowRun` — possible
follow-up simplification, out of scope here.
- i18n: removing panel buttons orphans `runDetails.actions.retry/resume/
requeueEvent` + dialog keys across locale files — sweep them.
- e2e rewrite needs a Temporal test target: check what
`WorkflowRuntimeV2TemporalWorker.integration.test.ts` uses (likely
TestWorkflowEnvironment) and reuse the harness.
- External infra repos may still set the deleted env flags — harmless after
removal (code ignores them), but sweep separately.
- Stranded-run migration: also check `workflow_run_waits` rows whose run is
being canceled — resolve with a status that the run studio renders sanely.
## Implementation notes (2026-06-09)
- `launchPublishedWorkflowRun`'s `execute` flag was only ever passed as `true`
(4 call sites) — removed along with the engine ternary. `executionKey` is
genuinely used (schedules, webhooks, event launch) and stays.
- `WorkflowRuntimeV2` is now only the run-row projection writer (`startRun`);
the whole DB interpreter (~1,300 lines) was deleted after confirming the
Temporal interpreter never called it (`executeWorkflowRuntimeV2Run` was
exported but never registered/invoked).
- `services/workflow-worker/src/v2/WorkflowRuntimeV2Worker.ts` was already
unreferenced (index.ts imported the shared copy) — both deleted.
- Bulk Resume in `WorkflowRunList` used the legacy resume action — removed
with the per-run buttons; bulk Cancel stays.
- Replay payload-dirty detection compares the textarea string against the
pre-filled pristine string (ref) — exact-match is sufficient because
untouched textareas don't reformat.
- `server/src/test/unit/workflowRunLauncher.unit.test.ts` fails to LOAD even
unmodified (pre-existing: `@alga-psa/db/workDate` unresolvable through
`importOriginal` of the runtime in this worktree). Updated for the new API
anyway; failure is environmental, also see quota test's stale `tenant_id`
field which suggests these server unit tests aren't in the active CI gate.
- Playwright workflow suites: host-run server needs `TEMPORAL_ADDRESS`
pointing at the new `temporal-playwright` service (host port 17233 via
`PLAYWRIGHT_TEMPORAL_PORT`) for replay/run-start flows to function.
- ee/server full typecheck needs `NODE_OPTIONS=--max-old-space-size=12288`.
## Remaining follow-ups (tests.json items still false)
- Migration coverage (T046T049): the stranded-run migration has no automated
test; needs the DB-backed integration harness.
- Stuck-run banner (T042T045) and removed-button absence (T033T035): no
component tests written; behavior is hand-verifiable in the run studio.
- Replay UI dirty-detection (T036/T039): server-side contract is covered;
the client-side "send nothing when unedited" path is not unit-tested.
- Full-stack runs (T030/T050/T052): need a live compose stack with Temporal;
`docker compose -f docker-compose.ee.yaml up` then start a manual run.
- DB-backed integration suites (control/publish/e2e) were verified by
typecheck + vitest collection only in this worktree (no DB available);
CI run pending.
- EventStreamWorker vitest suite (7 cases) fails in this worktree on real
Redis/mock-resolution grounds — confirmed identical failure on pre-change
code; not a regression.
## Commands
- Find engine references: `grep -rn "engine.*'db'\|'db'.*engine" --include="*.ts" shared ee services server | grep -v node_modules`
- Flag sweep: `grep -rn "WORKFLOW_RUNTIME_V2_ENABLE" --include="*" . | grep -v node_modules | grep -v docs/plans`
- Stranded runs (prod triage): `select run_id, status, engine, started_at from workflow_runs where (engine is null or engine='db') and status in ('RUNNING','WAITING');`