PSA/docs/architecture/db-transaction-guardrails.md
Hermes 284313f908
Some checks are pending
Bidi Control Character Guard / bidi-control-guard (push) Waiting to run
Circular Dependency Check / Check for new circular dependencies (push) Waiting to run
Citus Migration Smoke / Combined migrations on single-node Citus (push) Waiting to run
E2E Fresh Install Tests / fresh-install-e2e (push) Waiting to run
ext-v2 guardrails / Run ext-v2 guard and ESLint (push) Waiting to run
Integration Tests / Check for relevant changes (push) Waiting to run
Integration Tests / ${{ (github.event_name == 'schedule' || github.event.inputs.suite == 'full') && 'Full integration suite' || 'Tier-1 integration subset' }} (push) Blocked by required conditions
Mobile checks / Mobile lint + typecheck (push) Waiting to run
Mobile checks / Mobile unit tests (push) Waiting to run
Mobile checks / Mobile dependency audit (report) (push) Waiting to run
Mobile checks / Mobile reproducibility checks (push) Waiting to run
Secrets guard (env backups) / Ensure no tracked env backup files (push) Waiting to run
Temporal Readiness / fast-readiness (push) Waiting to run
Temporal Readiness / docker-parity (push) Waiting to run
TypeScript Type Check / Nx affected typecheck (push) Waiting to run
Unit Tests / Skipped-test budget (push) Waiting to run
Unit Tests / Nx affected unit tests (push) Waiting to run
Unit Tests / Server unit coverage (informational) (push) Waiting to run
Validate Tenant Management Schema / Check for relevant changes (push) Waiting to run
Validate Tenant Management Schema / Validate Tenant Management Schema (push) Blocked by required conditions
EE Workflows Build Guard / ee-workflows-build-guard (push) Waiting to run
Initial import of AlgaPSA codebase from PSA server
Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz

Source: /opt/alga-psa on psa.joliet.tech
2026-06-22 16:12:17 -05:00

83 lines
4.1 KiB
Markdown

# DB transaction guardrails and after-commit work
Rules and safety nets introduced by the SLA close/reopen deadlock fix
(`.ai/sla_close_deadlock_proper_fix_plan.md` has the full investigation).
## Rules for transactional code
1. **One DB writer per ticket row per logical operation.** SLA column
mutations happen exactly once, in the caller's transaction. The SLA
"backend" (`ISlaBackend`) schedules external side effects only — it never
re-does a DB write. The CE `PgBossSlaBackend` mutation hooks are no-ops.
2. **No network or cross-connection work inside an open transaction.** Event
publishing and backend scheduling run after commit:
- `registerAfterCommit(trx, hook, label?)` (`@alga-psa/db`) queues work
that the transaction-owning `withTransaction` frame flushes after a
successful commit, in registration order. Hooks are dropped on rollback.
Nested `withTransaction` frames share the owner's `trx`, so their hooks
flush once, at the outer commit. Pass a `label` (e.g.
`"TICKET_CLOSED ticket=<id>"`) so a failed hook is traceable in logs.
Hook failures are logged and swallowed: events are at-most-once — a
publish that fails after commit is lost (no outbox), the committed write
stands.
- SLA write functions return `backendActions`; callers dispatch them with
`dispatchSlaBackendActions()` (`@alga-psa/sla`) after their transaction
resolves.
3. **SLA writes are serialized per ticket.** Every SLA write entry point
takes `pg_advisory_xact_lock(hashtext('sla:<tenant>:<ticket>'))` first
(`acquireTicketSlaLock`). Transaction-scoped, so it is safe under
pgbouncer transaction pooling and self-releases at commit/rollback.
## Event bus poison resistance
- Handler success is tracked per `(event, handler)` (Redis set
`processed_event_handlers:<tenant>`), so one failing handler's redelivery
never re-runs co-subscribers that already succeeded (e.g. outbound
webhooks on the shared default-channel streams). Subscribers that share a
stream with same-named handler functions must pass a distinct
`subscriberId` to `subscribe()`.
- Messages delivered more than `eventBus.maxDeliveries` times (default 10,
env `REDIS_STREAM_MAX_DELIVERIES`) are moved to `<stream>:dead-letter`
and acked. Dead-letter entries keep the original payload plus
`sourceStream`/`sourceMessageId`/`deliveries`/`deadLetteredAt` for
inspection and replay. The write is idempotent (marker set
`dead_lettered_messages:<stream>`, 3-day TTL), so an xAdd-succeeded /
xAck-failed retry does not duplicate the entry. Monitor dead-letter
volume.
- A handler that throws gets a bounded retry (redelivery up to the cap),
not an infinite storm.
## Postgres timeouts (defense in depth)
Migration `20260609120000_set_app_role_db_guardrail_timeouts.cjs` sets on
the app role (`DB_USER_SERVER`, default `app_user`):
- `idle_in_transaction_session_timeout = 60s` — a session idle
mid-transaction is aborted and releases its locks. This fires on a single
continuous 60s idle gap between statements, not on total transaction
duration; steady statement loops are unaffected. 60s (not lower) leaves
headroom for a slow external call awaited between statements — waiters
are already protected by lock_timeout regardless of how long the holder
sits.
- `lock_timeout = 8s` — statements fail fast instead of queueing behind a
stuck lock holder.
These are role-level GUCs (not pool `afterCreate` SETs) because pgbouncer
runs `pool_mode = transaction`: session-level SETs issued at connection
creation do not reliably follow a client across backend remapping, while
role GUCs resolve server-side at backend session start. The admin/migration
role is deliberately excluded so long-running DDL stays legal.
`pgbouncer/pgbouncer.ini.template` keeps `idle_transaction_timeout = 120`
as a last-resort reaper for whatever the role GUCs don't cover. It must
stay above the role GUC so the gentler server-side abort fires before
pgbouncer kills the connection.
Verify on a deployment:
```sql
-- as the app user, through pgbouncer
SHOW idle_in_transaction_session_timeout; -- 60s
SHOW lock_timeout; -- 8s
```