Hermes 284313f908
Some checks are pending
Bidi Control Character Guard / bidi-control-guard (push) Waiting to run
Circular Dependency Check / Check for new circular dependencies (push) Waiting to run
Citus Migration Smoke / Combined migrations on single-node Citus (push) Waiting to run
E2E Fresh Install Tests / fresh-install-e2e (push) Waiting to run
ext-v2 guardrails / Run ext-v2 guard and ESLint (push) Waiting to run
Integration Tests / Check for relevant changes (push) Waiting to run
Integration Tests / ${{ (github.event_name == 'schedule' || github.event.inputs.suite == 'full') && 'Full integration suite' || 'Tier-1 integration subset' }} (push) Blocked by required conditions
Mobile checks / Mobile lint + typecheck (push) Waiting to run
Mobile checks / Mobile unit tests (push) Waiting to run
Mobile checks / Mobile dependency audit (report) (push) Waiting to run
Mobile checks / Mobile reproducibility checks (push) Waiting to run
Secrets guard (env backups) / Ensure no tracked env backup files (push) Waiting to run
Temporal Readiness / fast-readiness (push) Waiting to run
Temporal Readiness / docker-parity (push) Waiting to run
TypeScript Type Check / Nx affected typecheck (push) Waiting to run
Unit Tests / Skipped-test budget (push) Waiting to run
Unit Tests / Nx affected unit tests (push) Waiting to run
Unit Tests / Server unit coverage (informational) (push) Waiting to run
Validate Tenant Management Schema / Check for relevant changes (push) Waiting to run
Validate Tenant Management Schema / Validate Tenant Management Schema (push) Blocked by required conditions
EE Workflows Build Guard / ee-workflows-build-guard (push) Waiting to run
Initial import of AlgaPSA codebase from PSA server
Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz

Source: /opt/alga-psa on psa.joliet.tech
2026-06-22 16:12:17 -05:00

14 KiB
Raw Permalink Blame History

PRD — API Rate Limiting and Outbound Ticket Webhooks

  • Slug: api-rate-limiting-and-ticket-webhooks
  • Date: 2026-05-05
  • Status: Draft
  • Source plans:
    • /Users/natalliabukhtsik/Desktop/projects/alga-psa/.ai/api-rate-limiting-plan.md
    • /Users/natalliabukhtsik/Desktop/projects/alga-psa/.ai/ticket-webhooks-plan.md

Summary

Two complementary protections against "noisy poller" pressure on the public REST API and the underlying Citus cluster:

  1. API rate limiting (guardrail) — a per-(tenant, api_key_id) token-bucket rate limit on every authenticated /api/v1/* request, fail-open on Redis outage, configurable per tenant.
  2. Outbound ticket webhooks (cure) — finish the partially-scaffolded webhook delivery pipeline so well-behaved customers can subscribe to ticket events instead of polling. Includes signed HTTP delivery, retries with backoff, a minimal admin UI, and a curated payload shape.

The two ship as one plan because they share infrastructure (the TokenBucketRateLimiter namespace work is required by both — the rate limiter uses namespace 'api', the webhook delivery worker uses namespace 'webhook-out' for per-webhook outbound caps).

Problem

Production telemetry on 2026-05-04 traced intermittent Citus "remaining connection slots are reserved for non-replication superuser connections" errors to a single external integration polling GET /api/v1/tickets/<id> for 6 specific ticket IDs once per minute from 52.53.71.0. When the 6-call burst lands at a minute boundary alongside other work, ~18% of the calls fail with HTTP 500.

Today there is no rate limit on the public REST API and no working outbound webhooks (the system is scaffolded but the delivery method is mocked and the data tables don't exist), so:

  • Customers have no choice but to poll if they need near-real-time data.
  • A single key can pressure the cluster regardless of other work.

Goals

  • Stop a single tenant or API key from monopolizing Citus worker connections via runaway polling.
  • Give well-behaved customers a way to subscribe to ticket lifecycle events (ticket.created, ticket.updated, ticket.assigned, ticket.status_changed, ticket.closed, ticket.comment.added) and receive signed HTTP POSTs instead of polling.
  • Migrate the noisy customer to webhooks once shipped.
  • Ship the rate-limit guardrail first (smaller, unblocks the immediate problem); ship webhook delivery next (larger, removes the cause).

Non-goals

  • Not implementing custom payload templates / Handlebars rendering for webhooks in v1.
  • Not implementing webhook subscriptions for non-ticket entities in v1 (projects / clients / contacts / invoices come later — pattern is the same).
  • Not changing internal email/notification subscribers — webhooks are an additional output, not a replacement.
  • Not changing UI / Server-Action traffic — rate limiting only protects the external API surface (x-api-key-authenticated endpoints).
  • Not building a customer-facing "webhook explorer" UI in v1; admin can manage via API and a basic settings page.
  • Not changing Citus connection-pool config (separate but complementary work).
  • Not introducing new queue/runtime dependencies — reuse the existing Redis client and the DelayedEmailQueue ZSET pattern; do not add BullMQ.

Users and Primary Flows

Tenant administrator (target persona)

  1. Set or override an API rate limit: Settings → API Keys → "Rate Limit" → set per-key max_tokens and refill_per_min, or set a tenant-wide default.
  2. Create a webhook subscription: Settings → Webhooks → "New Webhook" → name, URL, event types (multi-select), custom headers, retry config → save → copy plaintext signing secret (shown once).
  3. Verify a webhook: "Send Test" delivers a synthetic payload to the configured URL using the live signing secret; result shown inline.
  4. Inspect deliveries: Webhook detail → paginated history with status, response body, retry button.
  5. Rotate signing secret: one click → new plaintext returned once.

External integration (consumer of webhooks and rate limits)

  1. Hit a 429: receives Retry-After and X-RateLimit-* headers; backs off and retries.
  2. Receive webhook: HTTP POST with X-Alga-Signature: t=<ts>,v1=<hex>; verifies HMAC; idempotently processes by event_id.

Internal noisy poller (the 52.53.71.0 integration)

  • Continues working under sane defaults (6 calls/min is well under the limit).
  • Receives a heads-up + migration guide pointing at the new webhook flow.

UX / UI Notes

  • API Keys settings: add a "Rate Limit" column to the existing AdminApiKeysSetup table, plus an inline edit form. Surface the current remaining tokens (read via getState).
  • Webhooks settings: new section under Settings, modeled on AdminApiKeysSetup. Must include create form, list view (status, last delivery, success rate), delivery history, secret reveal/rotate, pause/resume.
  • Neither feature requires a new top-level navigation entry — both live under Settings → Security or Settings → Integrations (TBD with design; tracked in Open Questions).

Requirements

Functional Requirements

API rate limiting

  • Every authenticated /api/v1/* request consumes 1 token from the (tenant, api_key_id) bucket in namespace 'api'.
  • 429 response on denial includes Retry-After, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset.
  • Successful responses include X-RateLimit-Limit and X-RateLimit-Remaining.
  • Defaults: max_tokens = 120, refill_per_min = 60 (1 RPS sustained, 120 burst).
  • Per-(tenant, api_key_id) overrides via api_rate_limit_settings. Falls back to (tenant, NULL), then to hard-coded defaults.
  • Bypass list: health/version endpoints, internal-runner endpoints, mobile auth.
  • NM Store global-key path uses sentinel subjectId 'nm_store' so its traffic shares one tenant-scoped bucket.
  • Observation mode: RATE_LIMIT_ENFORCE=false logs denials instead of throwing. Default false until Stage 3 of rollout.
  • Fail-open if Redis is unavailable.

Outbound webhooks

  • Tenant admins create webhook subscriptions for ticket lifecycle events.
  • Internal events publish via the existing event bus (publishEvent({ eventType: 'TICKET_*', payload })); a new subscriber fans out to active webhooks.
  • Delivery is queued, not inline (Redis ZSET, mirrored on DelayedEmailQueue).
  • Each delivery is HMAC-SHA256 signed: X-Alga-Signature: t=<unix>,v1=<hex> over t + "." + body.
  • Per-webhook rate limit using the TokenBucketRateLimiter namespace 'webhook-out', dimension (tenant, webhookId). Default rate_limit_per_min = 100.
  • Retry policy: 1m → 5m → 30m → 2h → 12h, then abandon. 5 attempts total. Configurable per webhook via retry_config.
  • Auto-disable a webhook after 24h of all-failure deliveries; email the owning user.
  • Curated payload shape (see source plan §3.3) — stable subset of ticket fields plus changes diff for ticket.updated events. Comments include text/author/timestamp/internal flag, never attachments.
  • POST /api/v1/webhooks/[id]/test sends a synthetic webhook.test payload, recorded with is_test = true.
  • SSRF guard rejects RFC1918, loopback, link-local, CGNAT, and non-http(s) schemes before delivery (real and test).
  • Signing secret stored via the existing secret provider; column holds signing_secret_vault_path, never plaintext or hash.

Non-functional Requirements

  • Rate limiter adds ≤2 ms p99 to authenticated API requests when Redis is healthy.
  • Webhook delivery latency floor ~2 s (ZSET poll interval); acceptable.
  • At-least-once delivery semantics; idempotency key is event_id.
  • Tenant isolation is preserved end-to-end (subscriber filters by tenant, payload builder uses tenant-scoped getConnection).

Data / API / Integrations

New tables

  • api_rate_limit_settings (tenant-distributed): (tenant, api_key_id NULL, max_tokens, refill_per_min) with UNIQUE (tenant, api_key_id).
  • webhooks (tenant-distributed): subscription rows with signing_secret_vault_path, event_types text[], retry/rate-limit config, rolling stats columns, auto_disabled_at.
  • webhook_deliveries (tenant-distributed): one row per attempt; is_test boolean, next_retry_at, response capture.

New / modified APIs

  • TokenBucketRateLimiter: signature change — namespace as required first parameter on tryConsume / getState / getBucketKey / getBucketConfig. BucketConfigGetter widened to (tenantId, subjectId?) => BucketConfig. initialize() accepts Record<namespace, BucketConfigGetter>.
  • ApiError interface: optional headers?: Record<string, string>; handleApiError merges them into the NextResponse.
  • createSuccessResponse / createPaginatedResponse: optional extraHeaders parameter.
  • New helper enforceApiRateLimit(req, context) called from ApiBaseController.authenticate, withApiKeyAuth, and withAuth.
  • webhookEventTypeSchema extended with 'ticket.comment.added'.
  • Webhook controller stubs implemented (or removed): getDeliveryDetails, getWebhookHealth, getWebhookSubscriptions (read-only), rotateWebhookSecret, verifyWebhookSignature, listAvailableEvents. Deferred stubs (bulk, templates, transformations, etc.) have their routes deleted, not left as 501s.

Reused infrastructure (no new dependency)

  • getRedisClient for both buckets and the webhook ZSET.
  • DelayedEmailQueue pattern for the webhook poller class.
  • getSecretProviderInstance for signing-secret storage.
  • Existing event bus (publishEvent / getEventBus().subscribe) and the TICKET_* schemas in packages/event-schemas.

Security / Permissions

  • Webhook signing secrets never leave the secret provider once written; GET responses for webhook rows must omit the column entirely (covered by an integration assertion).
  • SSRF protection enforced server-side in both real and test delivery, with an env-var escape hatch (WEBHOOK_SSRF_ALLOW_PRIVATE) for staging/local only.
  • Rate-limit configuration writes scoped to tenant admins (RBAC: existing api_keys.update permission applies; webhook CRUD uses webhook.* — added if not present).
  • signing_secret_vault_path column never exposed in the API response shape.

Observability

  • Rate-limit metrics: api_rate_limit_consumed_total{tenant,api_key_id,outcome}, api_rate_limit_remaining{tenant,api_key_id}, api_rate_limit_redis_unavailable_total.
  • Webhook metrics: webhook_deliveries_total{tenant,webhook_id,outcome}, webhook_delivery_duration_ms histogram, webhook_queue_depth gauge, webhook_auto_disabled_total{tenant,webhook_id}.
  • Structured WARN log on every throttle and on every Redis fail-open.
  • Grafana panel: top throttled (tenant, api_key_id) and top failing webhooks.

Rollout / Migration

  1. Citus worker max_connections bump — out of band, ~1 hour, eliminates the immediate 500s. Not part of this plan.
  2. Rate limiter MVP — Stages 13:
    • Stage 1 (observation): RATE_LIMIT_ENFORCE=false. Measure for one week.
    • Stage 2 (notify outliers): email tenants whose keys would have been throttled.
    • Stage 3 (enforce): flip RATE_LIMIT_ENFORCE=true.
    • Stage 4: remove the env-var bypass after ~2 weeks stable.
  3. Webhook MVP — Stages 14:
    • Stage 1 (dark launch): ship behind a feature flag; internal testing against webhook.site.
    • Stage 2 (invite-only beta): enable for a handful of API-heavy tenants, including the noisy poller.
    • Stage 3 (GA): open to all tenants; publish docs.
    • Stage 4: tighten polled REST rate limits once webhook adoption is healthy.
  4. The noisy poller (52.53.71.0) gets a personal note + migration guide.

Open Questions

  1. Should we expose ticket.deleted? Internal TICKET_DELETED exists. Defer to v2 unless the noisy poller specifically asks.
  2. Per-tenant webhook count limit? Cap at 50 per tenant default.
  3. Should ticket.status_changed include previous_status_id/ previous_status_name? Yes — captured as a feature.
  4. Webhook-side filtering by entity_ids? Implement in v1 — directly addresses the noisy poller's "tell me about these 6 tickets" pattern.
  5. IA placement of the new settings UI sections — Settings → Security or Settings → Integrations? Confirm with design.
  6. Per-tenant rate-limit cap on top of per-key buckets? Defer until data shows we need it.
  7. Per-endpoint cost weights (/search costs more than /get)? Defer until observation data shows pressure differences.

Acceptance Criteria (Definition of Done)

Rate limiter

  • A test API key making 121 requests in 60 seconds receives 429 on the 121st with the documented headers, and a different key in the same tenant is unaffected.
  • The email path's existing rate-limit behavior is unchanged on a baseline notification_settings.rate_limit_per_minute value.
  • RATE_LIMIT_ENFORCE=false lets denials through but emits the same metrics and headers as enforce mode.
  • An api_rate_limit_settings row with (tenant, api_key_id) overrides the tenant default; clearForKey returns to the tenant default within the cache TTL.

Webhooks

  • TICKET_ASSIGNED published in tenant A enqueues a delivery job for an active webhook in tenant A subscribed to ticket.assigned, and does not enqueue a job for any webhook in tenant B.
  • The WebhookDeliveryQueue poller successfully delivers to a stubbed HTTP server, persists a row in webhook_deliveries, and updates webhook stats columns.
  • A webhook URL pointing at 127.0.0.1 or 10.0.0.5 is rejected before delivery (production mode), and accepted with WEBHOOK_SSRF_ALLOW_PRIVATE=true.
  • HMAC verification with the documented recipe matches the server signature byte-for-byte.
  • Signing secret never appears in any GET webhook response body.
  • 5 failed attempts mark the delivery abandoned; 24h of all-failure deliveries auto-disable the webhook.
  • Webhooks settings UI: an admin can create, view deliveries, send a test, rotate the secret, and pause/resume a webhook.