PSA/docs/architecture/db-transaction-guardrails.md
Hermes 284313f908
Some checks are pending
Bidi Control Character Guard / bidi-control-guard (push) Waiting to run
Circular Dependency Check / Check for new circular dependencies (push) Waiting to run
Citus Migration Smoke / Combined migrations on single-node Citus (push) Waiting to run
E2E Fresh Install Tests / fresh-install-e2e (push) Waiting to run
ext-v2 guardrails / Run ext-v2 guard and ESLint (push) Waiting to run
Integration Tests / Check for relevant changes (push) Waiting to run
Integration Tests / ${{ (github.event_name == 'schedule' || github.event.inputs.suite == 'full') && 'Full integration suite' || 'Tier-1 integration subset' }} (push) Blocked by required conditions
Mobile checks / Mobile lint + typecheck (push) Waiting to run
Mobile checks / Mobile unit tests (push) Waiting to run
Mobile checks / Mobile dependency audit (report) (push) Waiting to run
Mobile checks / Mobile reproducibility checks (push) Waiting to run
Secrets guard (env backups) / Ensure no tracked env backup files (push) Waiting to run
Temporal Readiness / fast-readiness (push) Waiting to run
Temporal Readiness / docker-parity (push) Waiting to run
TypeScript Type Check / Nx affected typecheck (push) Waiting to run
Unit Tests / Skipped-test budget (push) Waiting to run
Unit Tests / Nx affected unit tests (push) Waiting to run
Unit Tests / Server unit coverage (informational) (push) Waiting to run
Validate Tenant Management Schema / Check for relevant changes (push) Waiting to run
Validate Tenant Management Schema / Validate Tenant Management Schema (push) Blocked by required conditions
EE Workflows Build Guard / ee-workflows-build-guard (push) Waiting to run
Initial import of AlgaPSA codebase from PSA server
Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz

Source: /opt/alga-psa on psa.joliet.tech
2026-06-22 16:12:17 -05:00

4.1 KiB

DB transaction guardrails and after-commit work

Rules and safety nets introduced by the SLA close/reopen deadlock fix (.ai/sla_close_deadlock_proper_fix_plan.md has the full investigation).

Rules for transactional code

  1. One DB writer per ticket row per logical operation. SLA column mutations happen exactly once, in the caller's transaction. The SLA "backend" (ISlaBackend) schedules external side effects only — it never re-does a DB write. The CE PgBossSlaBackend mutation hooks are no-ops.
  2. No network or cross-connection work inside an open transaction. Event publishing and backend scheduling run after commit:
    • registerAfterCommit(trx, hook, label?) (@alga-psa/db) queues work that the transaction-owning withTransaction frame flushes after a successful commit, in registration order. Hooks are dropped on rollback. Nested withTransaction frames share the owner's trx, so their hooks flush once, at the outer commit. Pass a label (e.g. "TICKET_CLOSED ticket=<id>") so a failed hook is traceable in logs. Hook failures are logged and swallowed: events are at-most-once — a publish that fails after commit is lost (no outbox), the committed write stands.
    • SLA write functions return backendActions; callers dispatch them with dispatchSlaBackendActions() (@alga-psa/sla) after their transaction resolves.
  3. SLA writes are serialized per ticket. Every SLA write entry point takes pg_advisory_xact_lock(hashtext('sla:<tenant>:<ticket>')) first (acquireTicketSlaLock). Transaction-scoped, so it is safe under pgbouncer transaction pooling and self-releases at commit/rollback.

Event bus poison resistance

  • Handler success is tracked per (event, handler) (Redis set processed_event_handlers:<tenant>), so one failing handler's redelivery never re-runs co-subscribers that already succeeded (e.g. outbound webhooks on the shared default-channel streams). Subscribers that share a stream with same-named handler functions must pass a distinct subscriberId to subscribe().
  • Messages delivered more than eventBus.maxDeliveries times (default 10, env REDIS_STREAM_MAX_DELIVERIES) are moved to <stream>:dead-letter and acked. Dead-letter entries keep the original payload plus sourceStream/sourceMessageId/deliveries/deadLetteredAt for inspection and replay. The write is idempotent (marker set dead_lettered_messages:<stream>, 3-day TTL), so an xAdd-succeeded / xAck-failed retry does not duplicate the entry. Monitor dead-letter volume.
  • A handler that throws gets a bounded retry (redelivery up to the cap), not an infinite storm.

Postgres timeouts (defense in depth)

Migration 20260609120000_set_app_role_db_guardrail_timeouts.cjs sets on the app role (DB_USER_SERVER, default app_user):

  • idle_in_transaction_session_timeout = 60s — a session idle mid-transaction is aborted and releases its locks. This fires on a single continuous 60s idle gap between statements, not on total transaction duration; steady statement loops are unaffected. 60s (not lower) leaves headroom for a slow external call awaited between statements — waiters are already protected by lock_timeout regardless of how long the holder sits.
  • lock_timeout = 8s — statements fail fast instead of queueing behind a stuck lock holder.

These are role-level GUCs (not pool afterCreate SETs) because pgbouncer runs pool_mode = transaction: session-level SETs issued at connection creation do not reliably follow a client across backend remapping, while role GUCs resolve server-side at backend session start. The admin/migration role is deliberately excluded so long-running DDL stays legal.

pgbouncer/pgbouncer.ini.template keeps idle_transaction_timeout = 120 as a last-resort reaper for whatever the role GUCs don't cover. It must stay above the role GUC so the gentler server-side abort fires before pgbouncer kills the connection.

Verify on a deployment:

-- as the app user, through pgbouncer
SHOW idle_in_transaction_session_timeout;  -- 60s
SHOW lock_timeout;                         -- 8s