Hermes 284313f908
Some checks are pending
Bidi Control Character Guard / bidi-control-guard (push) Waiting to run
Circular Dependency Check / Check for new circular dependencies (push) Waiting to run
Citus Migration Smoke / Combined migrations on single-node Citus (push) Waiting to run
E2E Fresh Install Tests / fresh-install-e2e (push) Waiting to run
ext-v2 guardrails / Run ext-v2 guard and ESLint (push) Waiting to run
Integration Tests / Check for relevant changes (push) Waiting to run
Integration Tests / ${{ (github.event_name == 'schedule' || github.event.inputs.suite == 'full') && 'Full integration suite' || 'Tier-1 integration subset' }} (push) Blocked by required conditions
Mobile checks / Mobile lint + typecheck (push) Waiting to run
Mobile checks / Mobile unit tests (push) Waiting to run
Mobile checks / Mobile dependency audit (report) (push) Waiting to run
Mobile checks / Mobile reproducibility checks (push) Waiting to run
Secrets guard (env backups) / Ensure no tracked env backup files (push) Waiting to run
Temporal Readiness / fast-readiness (push) Waiting to run
Temporal Readiness / docker-parity (push) Waiting to run
TypeScript Type Check / Nx affected typecheck (push) Waiting to run
Unit Tests / Skipped-test budget (push) Waiting to run
Unit Tests / Nx affected unit tests (push) Waiting to run
Unit Tests / Server unit coverage (informational) (push) Waiting to run
Validate Tenant Management Schema / Check for relevant changes (push) Waiting to run
Validate Tenant Management Schema / Validate Tenant Management Schema (push) Blocked by required conditions
EE Workflows Build Guard / ee-workflows-build-guard (push) Waiting to run
Initial import of AlgaPSA codebase from PSA server
Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz

Source: /opt/alga-psa on psa.joliet.tech
2026-06-22 16:12:17 -05:00

18 KiB

SCRATCHPAD — Workflow V2 Temporal Hard-Cutover Remediation

Purpose

Focused follow-up plan for the remaining blockers after the broader Temporal-native runtime plan.

Parent plan:

  • ee/docs/plans/2026-04-08-workflow-v2-temporal-native-runtime/

User-approved scope decisions

  • Scope choice: B
    • include the 3 remaining review items plus adjacent operator/runtime cleanup needed to make hard cutover shippable
  • Legacy operator behavior for Temporal runs: A — hard fail
    • do not translate old DB-runtime-style actions for Temporal-backed runs
  • Event correlation source: C — support both
    • explicit correlation if present
    • configured derivation if explicit correlation is absent
  • Expression parity posture: A — exact parity
    • Temporal must use the same expression contract as the canonical workflow runtime
  • Cutover posture: hard cutover
    • shared / legacy DB-runtime authority can be removed for Workflow Runtime V2

Review findings this plan addresses

  1. Expression engine parity gap

    • ee/temporal-workflows/src/workflows/workflow-runtime-v2-run-workflow.ts
    • custom JSONata evaluator diverges from canonical expression contract in shared/workflow/runtime/expressionEngine.ts
    • known gap: missing nowIso() support
    • likely additional gap: skipped guardrails/allowed-function enforcement
  2. Remaining DB/legacy authority over Temporal runs

    • ee/packages/workflows/src/actions/workflow-runtime-v2-actions.ts
    • resumeWorkflowRunAction() still mutates waits/runs and calls WorkflowRuntimeV2.executeRun(...)
    • retryWorkflowRunAction() still calls WorkflowRuntimeV2.executeRun(...)
    • submitWorkflowEventAction() still resolves waits through DB authority patterns
    • cancelWorkflowRunAction() still projects CANCELED even if Temporal cancel fails
  3. Event ingress correlation bug

    • services/workflow-worker/src/v2/WorkflowRuntimeV2EventStreamWorker.ts
    • currently uses event.event_id as correlation key for audit, candidate lookup, and Temporal signal payload
    • real event.wait steps typically correlate on business keys like ticket/project/account identifiers

Constraints / implementation guardrails

  • Prefer explicit failure over silent fallback.
  • Do not preserve hybrid authority between Temporal and DB runtime for V2.
  • DB tables remain projection/index/audit surfaces only.
  • Candidate lookup may use DB indexes, but Temporal remains the only resume authority.
  • Unsupported Temporal actions should fail loudly and clearly for operators/support.

Likely affected files

  • ee/temporal-workflows/src/workflows/workflow-runtime-v2-run-workflow.ts
  • shared/workflow/runtime/expressionEngine.ts
  • ee/packages/workflows/src/actions/workflow-runtime-v2-actions.ts
  • ee/packages/workflows/src/lib/workflowRuntimeV2Temporal.ts
  • services/workflow-worker/src/v2/WorkflowRuntimeV2EventStreamWorker.ts
  • packages/event-bus/src/schemas/workflowEventSchema.ts
  • shared/workflow/persistence/workflowRunWaitModelV2.ts

Expected test areas

  • Temporal workflow/runtime unit tests
  • server/API unit tests for run-control actions
  • workflow worker event-ingress tests
  • possibly one DB-backed integration path for correlation routing/projection sanity

Notes

  • Existing broader runtime plan already covers the end-state architecture.
  • This focused plan is meant to track the remaining hard-cutover blockers as a shippable remediation slice.

Progress — 2026-04-09 (Run-control hard cutover)

Completed scope in this pass

  • Implemented hard-fail guardrails for legacy run-control actions on Temporal runs in:
    • ee/packages/workflows/src/actions/workflow-runtime-v2-actions.ts
  • Added explicit unsupported-action boundary for Temporal runs:
    • resumeWorkflowRunAction now fails with 409 + code WORKFLOW_TEMPORAL_ACTION_UNSUPPORTED
    • retryWorkflowRunAction now fails with 409 + code WORKFLOW_TEMPORAL_ACTION_UNSUPPORTED
    • requeueWorkflowRunEventWaitAction now fails with 409 + code WORKFLOW_TEMPORAL_ACTION_UNSUPPORTED
  • Fixed Temporal cancel authority semantics:
    • cancelWorkflowRunAction now calls Temporal cancel first and does not project DB CANCELED state if Temporal cancel fails.
    • Failure path now returns explicit 409 + code WORKFLOW_TEMPORAL_CANCEL_FAILED with actionable hint.

Key decisions and rationale

  • Decision: treat legacy admin controls (resume, retry, requeue_event_wait) as unsupported for engine='temporal'.
    • Rationale: enforces single execution authority (Temporal) and prevents split-brain DB projection mutations.
  • Decision: cancel remains supported for Temporal runs, but only when the Temporal cancel request succeeds.
    • Rationale: prevents false projection of cancellation when workflow execution is still active.

Tests added/updated

  • Updated server/src/test/integration/workflowRuntimeV2.control.integration.test.ts:
    • New: Temporal resume hard-fails with explicit unsupported-action details and no wait/run mutation.
    • New: Temporal retry hard-fails with explicit unsupported-action details and keeps failed run projection unchanged.
    • New: Temporal cancel failure leaves run/waits in WAITING state and returns explicit error details.
  • Added test mocking for Temporal cancel client call via:
    • vi.mock('@alga-psa/workflows/lib/workflowRuntimeV2Temporal', ...)

Commands / runbook used

  • Focused integration run (full file):
    • cd server && npm run -s test -- src/test/integration/workflowRuntimeV2.control.integration.test.ts
    • Result: suite has a pre-existing unrelated failure (STARTED invocation with stale lease is treated as TransientError) not introduced by these changes.
  • Focused validation for new Temporal tests only:
    • cd server && npm run -s test -- src/test/integration/workflowRuntimeV2.control.integration.test.ts -t "Temporal runs reject legacy|Temporal cancel failure"
    • Result: passed.

Current gaps / next work

  • Event-ingress correlation/authority items (F018+) remain open.
  • Expression parity items referenced here were addressed in the next section (Expression parity cutover).

Progress — 2026-04-09 (Expression parity cutover)

Completed scope in this pass

  • Replaced Temporal workflow-local expression evaluation with canonical workflow expression contract in:
    • ee/temporal-workflows/src/workflows/workflow-runtime-v2-run-workflow.ts
  • Removed local Temporal-only JSONata helpers from workflow execution path:
    • removed local evaluator/normalizer and now uses canonical compileExpression from @alga-psa/workflows/runtime/expressionEngine
  • Removed Temporal-only nowIso() control-flow ban.
    • Temporal workflow now accepts nowIso() with canonical function set/validation behavior.

Key decisions and rationale

  • Decision: import expression contract from @alga-psa/workflows/runtime/expressionEngine directly instead of @alga-psa/workflows/runtime barrel.
    • Rationale: avoids pulling unrelated transitive runtime package entry dependencies into temporal workflow unit test bundle (@alga-psa/storage resolution issue).
  • Decision: cache compiled expressions per source string in-workflow (Map<string, CompiledExpression>).
    • Rationale: preserves behavior and avoids repeated compile cost while remaining deterministic for replayed source strings.

Tests added/updated

  • Updated ee/temporal-workflows/src/workflows/__tests__/workflow-runtime-v2-run-workflow.test.ts:
    • nowIso parity: control.if accepts nowIso() via canonical engine.
    • normalization parity: == source normalization path works in Temporal execution.
    • disallowed-function parity: rejects unsupported functions with canonical validation.
    • output guardrail parity: expression result over max size fails with canonical guardrail error.

Commands / runbook used

  • Temporal workflow unit suite (targeted):
    • cd ee/temporal-workflows && npx vitest run src/workflows/__tests__/workflow-runtime-v2-run-workflow.test.ts
    • Result: passed (30 tests).

Current gaps / next work

  • Determinism replay proof item (F009 / T003) is still open:
    • need a dedicated replay-focused test that demonstrates no expression-contract drift across replay.
  • Event ingress correlation and API/stream alignment (F018+) remain open.

Progress — 2026-04-09 (Event-ingress authority guard for Temporal)

Completed scope in this pass

  • Added Temporal guard in API event ingestion path (submitWorkflowEventAction) to prevent legacy DB-authoritative resume for Temporal runs:
    • file: ee/packages/workflows/src/actions/workflow-runtime-v2-actions.ts
    • behavior: if a matched candidate wait belongs to engine='temporal', ingestion fails explicitly (409) and records an audit error message; no wait/run projection mutation is performed for that Temporal run.

Key decisions and rationale

  • Decision: hard-fail API resume attempts for Temporal candidate runs in the legacy DB event-resume code path.
    • Rationale: prevents WorkflowRuntimeV2.executeRun(...) from being an authority path for Temporal runs in operator/API control surfaces.

Tests added/updated

  • Updated server/src/test/integration/workflowRuntimeV2.control.integration.test.ts:
    • New: Submit workflow event rejects legacy DB-authoritative resume for Temporal runs
    • Asserts:
      • explicit 409 failure
      • wait remains WAITING
      • run remains WAITING
      • runtime event audit row includes Temporal-unsupported error metadata

Commands / runbook used

  • Focused server integration run for Temporal guard tests:
    • cd server && npm run -s test -- src/test/integration/workflowRuntimeV2.control.integration.test.ts -t "Temporal runs reject legacy|Temporal cancel failure|Submit workflow event rejects legacy DB-authoritative resume for Temporal runs"
    • Result: passed.

Progress — 2026-04-09 (Stream correlation contract)

Completed scope in this pass

  • Extended workflow event stream schema/contracts to carry explicit workflow correlation:
    • packages/event-schemas/src/schemas/workflowEventSchema.ts
    • packages/event-bus/src/schemas/workflowEventSchema.ts
    • packages/event-schemas/src/schemas/eventBusSchema.ts (WorkflowPublishHooks.correlationKey, convertToWorkflowEvent mapping)
    • packages/event-bus/src/eventBus.ts (publishes workflow_correlation_key in stream payload)
  • Updated stream ingestion worker correlation logic:
    • services/workflow-worker/src/v2/WorkflowRuntimeV2EventStreamWorker.ts
    • correlation resolution order:
      1. explicit (event.workflow_correlation_key, payload.workflowCorrelationKey, payload.correlationKey)
      2. configured derivation paths from WORKFLOW_RUNTIME_V2_EVENT_CORRELATION_PATHS_JSON (event-specific and * wildcard)
    • no fallback to event_id for wait routing
    • resolved key persisted in workflow_runtime_events.correlation_key
    • wait candidate lookup uses resolved key (tenant + event_name + resolved correlation)
    • Temporal signal payload carries resolved correlation key
    • unresolved correlation writes audit error metadata and skips wait routing/signaling

Key decisions and rationale

  • Decision: use env-configured derivation paths as the immediate configurable source (WORKFLOW_RUNTIME_V2_EVENT_CORRELATION_PATHS_JSON).
    • Rationale: provides explicit, event-type-specific derivation without introducing schema migrations in this remediation slice.
  • Decision: fail open for event-triggered workflow launches but fail closed for wait routing when correlation is missing.
    • Rationale: preserves event-triggered starts while preventing false-positive wait matches.

Tests added/updated

  • Updated services/workflow-worker/src/v2/WorkflowRuntimeV2EventStreamWorker.test.ts:
    • explicit correlation routes candidate waits/signals by resolved key (not event_id)
    • configured derivation from payload path routes correctly
    • unresolved correlation writes audit error and skips wait-routing/signaling

Commands / runbook used

  • Stream worker unit tests:
    • cd services/workflow-worker && npx vitest run src/v2/WorkflowRuntimeV2EventStreamWorker.test.ts
    • Result: passed (4 tests).

Current gaps / next work

  • API ingress still needs full correlation-derivation parity and Temporal-signal-only resume path alignment (F024, F025, F026, F028, T010).
  • Determinism replay test (F009 / T003) and cutover regression/projection tests (T011, T012) remain open.

Progress — 2026-04-09 (Support posture documentation)

Completed scope in this pass

  • Added hard-cutover support posture document:
    • ee/docs/plans/2026-04-08-workflow-v2-temporal-hard-cutover-remediation/HARD_CUTOVER_SUPPORT_POSTURE.md
  • Document includes:
    • Temporal authority model
    • operator/API run-control support matrix (cancel supported, resume/retry/requeue_event_wait unsupported)
    • event-ingress correlation posture
    • support/debugging guidance for projection fields

Progress — 2026-04-09 (API/stream correlation + Temporal signal-only ingress)

Completed scope in this pass

  • Aligned API event ingress with stream correlation contract using a shared resolver:
    • Added ee/packages/workflows/src/lib/workflowEventCorrelation.ts
    • Updated stream worker (services/workflow-worker/src/v2/WorkflowRuntimeV2EventStreamWorker.ts) to use shared resolver
    • Updated API action schema to accept explicit workflow correlation while supporting derivation fallback:
      • ee/packages/workflows/src/actions/workflow-runtime-v2-schemas.ts
  • Reworked submitWorkflowEventAction to keep Temporal authority in Temporal:
    • file: ee/packages/workflows/src/actions/workflow-runtime-v2-actions.ts
    • Temporal candidate waits are no longer DB-resolved/advanced.
    • Temporal event waits are routed via signalWorkflowRuntimeV2Event(...).
    • Temporal human waits are routed via new signalWorkflowRuntimeV2HumanTask(...).
    • DB candidate lookup remains index/routing aid; payload-filter matching is no longer used as final authority for Temporal waits.
    • Missing correlation writes clear runtime-event audit metadata and skips wait routing.
  • Added Temporal human-task signal helper:
    • ee/packages/workflows/src/lib/workflowRuntimeV2Temporal.ts

Determinism / replay test scope

  • Added replay/checkpoint determinism coverage for canonical expressions in Temporal workflow unit suite:
    • ee/temporal-workflows/src/workflows/__tests__/workflow-runtime-v2-run-workflow.test.ts
    • New test validates continue-as-new checkpoint replay path with canonical control.if expression semantics.

Tests added/updated

  • server/src/test/integration/workflowRuntimeV2.control.integration.test.ts
    • Temporal API event ingress signals event waits without DB-authoritative mutation.
    • Derived correlation (WORKFLOW_RUNTIME_V2_EVENT_CORRELATION_PATHS_JSON) routes Temporal waits and avoids executeRun.
    • Temporal payload-filter matching stays inside Temporal wait contract (API still signals candidate wait).
    • Temporal human waits route through Temporal human-task signal.
    • Resume/retry unsupported Temporal actions assert no WorkflowRuntimeV2.executeRun(...) call.
  • services/workflow-worker/src/v2/WorkflowRuntimeV2EventStreamWorker.test.ts
    • remains passing with shared correlation helper integration.

Commands / runbook used

  • cd services/workflow-worker && npx vitest run src/v2/WorkflowRuntimeV2EventStreamWorker.test.ts
  • cd ee/temporal-workflows && npx vitest run src/workflows/__tests__/workflow-runtime-v2-run-workflow.test.ts
  • cd server && mkdir -p coverage/.tmp && npm run -s test -- --coverage.enabled=false src/test/integration/workflowRuntimeV2.control.integration.test.ts -t "Submit workflow event|Temporal runs reject legacy admin resume|Temporal runs reject legacy retry"

Plan checklist updates

  • Marked implemented: F009, F024, F025, F026, F028, F036
  • Marked implemented: T003, T010, T011, T012

Remaining open items after this pass

  • Features: F029, F030, F031, F032
  • Tests: no remaining unchecked tests in this plan file.

Progress — 2026-04-09 (Legacy worker/runtime gating for Temporal runs)

Completed scope in this pass

  • Gated legacy scheduler/runtime execution paths from affecting Temporal runs:
    • shared/workflow/workers/WorkflowRuntimeV2Worker.ts
      • due retry/time/event-time waits are now explicitly skipped when run.engine === 'temporal'
    • shared/workflow/runtime/runtime/workflowRuntimeV2.ts
      • acquireRunnableRun(...) excludes Temporal runs from runnable acquisition
      • executeRun(...) now hard-fails for Temporal runs with explicit unsupported-runtime error
    • services/workflow-worker/src/v2/WorkflowRuntimeV2Worker.ts
      • mirrored Temporal wait skip guardrails for service worker code path

Projection/supportability impact

  • Temporal projections are no longer mutated by legacy worker timeout/retry polling paths.
  • Runtime event projections continue to persist resolved correlation and matched run metadata for Temporal signal routing.
  • Unsupported Temporal run-control actions continue returning actionable operator metadata (code/action/engine/runId).

Tests added/updated

  • Updated server/src/test/integration/workflowRuntimeV2.control.integration.test.ts:
    • Added worker timeout polling does not execute Temporal runs through legacy runtime authority
    • Validates temporal wait remains WAITING, run remains WAITING, and WorkflowRuntimeV2.executeRun(...) is not invoked.

Commands / runbook used

  • cd server && mkdir -p coverage/.tmp && npm run -s test -- --coverage.enabled=false src/test/integration/workflowRuntimeV2.control.integration.test.ts -t "Submit workflow event|Temporal runs reject legacy admin resume|Temporal runs reject legacy retry|timeout polling"
  • cd services/workflow-worker && npx vitest run src/v2/WorkflowRuntimeV2EventStreamWorker.test.ts src/index.startup.test.ts

Plan checklist updates

  • Marked implemented: F029, F030, F031, F032

Remaining open items after this pass

  • Features: none
  • Tests: none