# PRD — API Rate Limiting and Outbound Ticket Webhooks - Slug: `api-rate-limiting-and-ticket-webhooks` - Date: `2026-05-05` - Status: Draft - Source plans: - `/Users/natalliabukhtsik/Desktop/projects/alga-psa/.ai/api-rate-limiting-plan.md` - `/Users/natalliabukhtsik/Desktop/projects/alga-psa/.ai/ticket-webhooks-plan.md` ## Summary Two complementary protections against "noisy poller" pressure on the public REST API and the underlying Citus cluster: 1. **API rate limiting** (guardrail) — a per-`(tenant, api_key_id)` token-bucket rate limit on every authenticated `/api/v1/*` request, fail-open on Redis outage, configurable per tenant. 2. **Outbound ticket webhooks** (cure) — finish the partially-scaffolded webhook delivery pipeline so well-behaved customers can subscribe to ticket events instead of polling. Includes signed HTTP delivery, retries with backoff, a minimal admin UI, and a curated payload shape. The two ship as one plan because they share infrastructure (the `TokenBucketRateLimiter` namespace work is required by both — the rate limiter uses namespace `'api'`, the webhook delivery worker uses namespace `'webhook-out'` for per-webhook outbound caps). ## Problem Production telemetry on 2026-05-04 traced intermittent Citus "`remaining connection slots are reserved for non-replication superuser connections`" errors to a single external integration polling `GET /api/v1/tickets/` for 6 specific ticket IDs once per minute from `52.53.71.0`. When the 6-call burst lands at a minute boundary alongside other work, ~18% of the calls fail with HTTP 500. Today there is **no rate limit** on the public REST API and **no working outbound webhooks** (the system is scaffolded but the delivery method is mocked and the data tables don't exist), so: - Customers have no choice but to poll if they need near-real-time data. - A single key can pressure the cluster regardless of other work. ## Goals - Stop a single tenant or API key from monopolizing Citus worker connections via runaway polling. - Give well-behaved customers a way to subscribe to ticket lifecycle events (`ticket.created`, `ticket.updated`, `ticket.assigned`, `ticket.status_changed`, `ticket.closed`, `ticket.comment.added`) and receive signed HTTP POSTs instead of polling. - Migrate the noisy customer to webhooks once shipped. - Ship the rate-limit guardrail first (smaller, unblocks the immediate problem); ship webhook delivery next (larger, removes the cause). ## Non-goals - Not implementing custom payload templates / Handlebars rendering for webhooks in v1. - Not implementing webhook subscriptions for non-ticket entities in v1 (projects / clients / contacts / invoices come later — pattern is the same). - Not changing internal email/notification subscribers — webhooks are an additional output, not a replacement. - Not changing UI / Server-Action traffic — rate limiting only protects the external API surface (`x-api-key`-authenticated endpoints). - Not building a customer-facing "webhook explorer" UI in v1; admin can manage via API and a basic settings page. - Not changing Citus connection-pool config (separate but complementary work). - Not introducing new queue/runtime dependencies — reuse the existing Redis client and the `DelayedEmailQueue` ZSET pattern; do not add BullMQ. ## Users and Primary Flows **Tenant administrator** (target persona) 1. *Set or override an API rate limit:* Settings → API Keys → "Rate Limit" → set per-key `max_tokens` and `refill_per_min`, or set a tenant-wide default. 2. *Create a webhook subscription:* Settings → Webhooks → "New Webhook" → name, URL, event types (multi-select), custom headers, retry config → save → copy plaintext signing secret (shown once). 3. *Verify a webhook:* "Send Test" delivers a synthetic payload to the configured URL using the live signing secret; result shown inline. 4. *Inspect deliveries:* Webhook detail → paginated history with status, response body, retry button. 5. *Rotate signing secret:* one click → new plaintext returned once. **External integration** (consumer of webhooks and rate limits) 1. *Hit a 429:* receives `Retry-After` and `X-RateLimit-*` headers; backs off and retries. 2. *Receive webhook:* HTTP POST with `X-Alga-Signature: t=,v1=`; verifies HMAC; idempotently processes by `event_id`. **Internal noisy poller** (the `52.53.71.0` integration) - Continues working under sane defaults (6 calls/min is well under the limit). - Receives a heads-up + migration guide pointing at the new webhook flow. ## UX / UI Notes - **API Keys settings:** add a "Rate Limit" column to the existing `AdminApiKeysSetup` table, plus an inline edit form. Surface the current remaining tokens (read via `getState`). - **Webhooks settings:** new section under Settings, modeled on `AdminApiKeysSetup`. Must include create form, list view (status, last delivery, success rate), delivery history, secret reveal/rotate, pause/resume. - Neither feature requires a new top-level navigation entry — both live under Settings → Security or Settings → Integrations (TBD with design; tracked in Open Questions). ## Requirements ### Functional Requirements **API rate limiting** - Every authenticated `/api/v1/*` request consumes 1 token from the `(tenant, api_key_id)` bucket in namespace `'api'`. - 429 response on denial includes `Retry-After`, `X-RateLimit-Limit`, `X-RateLimit-Remaining`, `X-RateLimit-Reset`. - Successful responses include `X-RateLimit-Limit` and `X-RateLimit-Remaining`. - Defaults: `max_tokens = 120`, `refill_per_min = 60` (1 RPS sustained, 120 burst). - Per-`(tenant, api_key_id)` overrides via `api_rate_limit_settings`. Falls back to `(tenant, NULL)`, then to hard-coded defaults. - Bypass list: health/version endpoints, internal-runner endpoints, mobile auth. - NM Store global-key path uses sentinel subjectId `'nm_store'` so its traffic shares one tenant-scoped bucket. - Observation mode: `RATE_LIMIT_ENFORCE=false` logs denials instead of throwing. Default `false` until Stage 3 of rollout. - Fail-open if Redis is unavailable. **Outbound webhooks** - Tenant admins create webhook subscriptions for ticket lifecycle events. - Internal events publish via the existing event bus (`publishEvent({ eventType: 'TICKET_*', payload })`); a new subscriber fans out to active webhooks. - Delivery is queued, not inline (Redis ZSET, mirrored on `DelayedEmailQueue`). - Each delivery is HMAC-SHA256 signed: `X-Alga-Signature: t=,v1=` over `t + "." + body`. - Per-webhook rate limit using the `TokenBucketRateLimiter` namespace `'webhook-out'`, dimension `(tenant, webhookId)`. Default `rate_limit_per_min = 100`. - Retry policy: 1m → 5m → 30m → 2h → 12h, then abandon. 5 attempts total. Configurable per webhook via `retry_config`. - Auto-disable a webhook after 24h of all-failure deliveries; email the owning user. - Curated payload shape (see source plan §3.3) — stable subset of ticket fields plus `changes` diff for `ticket.updated` events. Comments include text/author/timestamp/internal flag, never attachments. - `POST /api/v1/webhooks/[id]/test` sends a synthetic `webhook.test` payload, recorded with `is_test = true`. - SSRF guard rejects RFC1918, loopback, link-local, CGNAT, and non-`http(s)` schemes before delivery (real and test). - Signing secret stored via the existing secret provider; column holds `signing_secret_vault_path`, never plaintext or hash. ### Non-functional Requirements - Rate limiter adds ≤2 ms p99 to authenticated API requests when Redis is healthy. - Webhook delivery latency floor ~2 s (ZSET poll interval); acceptable. - At-least-once delivery semantics; idempotency key is `event_id`. - Tenant isolation is preserved end-to-end (subscriber filters by `tenant`, payload builder uses tenant-scoped `getConnection`). ## Data / API / Integrations ### New tables - `api_rate_limit_settings` (tenant-distributed): `(tenant, api_key_id NULL, max_tokens, refill_per_min)` with `UNIQUE (tenant, api_key_id)`. - `webhooks` (tenant-distributed): subscription rows with `signing_secret_vault_path`, `event_types text[]`, retry/rate-limit config, rolling stats columns, `auto_disabled_at`. - `webhook_deliveries` (tenant-distributed): one row per attempt; `is_test boolean`, `next_retry_at`, response capture. ### New / modified APIs - `TokenBucketRateLimiter`: signature change — namespace as required first parameter on `tryConsume` / `getState` / `getBucketKey` / `getBucketConfig`. `BucketConfigGetter` widened to `(tenantId, subjectId?) => BucketConfig`. `initialize()` accepts `Record`. - `ApiError` interface: optional `headers?: Record`; `handleApiError` merges them into the `NextResponse`. - `createSuccessResponse` / `createPaginatedResponse`: optional `extraHeaders` parameter. - New helper `enforceApiRateLimit(req, context)` called from `ApiBaseController.authenticate`, `withApiKeyAuth`, and `withAuth`. - `webhookEventTypeSchema` extended with `'ticket.comment.added'`. - Webhook controller stubs implemented (or removed): `getDeliveryDetails`, `getWebhookHealth`, `getWebhookSubscriptions` (read-only), `rotateWebhookSecret`, `verifyWebhookSignature`, `listAvailableEvents`. Deferred stubs (bulk, templates, transformations, etc.) have their routes deleted, not left as 501s. ### Reused infrastructure (no new dependency) - `getRedisClient` for both buckets and the webhook ZSET. - `DelayedEmailQueue` pattern for the webhook poller class. - `getSecretProviderInstance` for signing-secret storage. - Existing event bus (`publishEvent` / `getEventBus().subscribe`) and the `TICKET_*` schemas in `packages/event-schemas`. ## Security / Permissions - Webhook signing secrets never leave the secret provider once written; GET responses for webhook rows must omit the column entirely (covered by an integration assertion). - SSRF protection enforced server-side in both real and test delivery, with an env-var escape hatch (`WEBHOOK_SSRF_ALLOW_PRIVATE`) for staging/local only. - Rate-limit configuration writes scoped to tenant admins (RBAC: existing `api_keys.update` permission applies; webhook CRUD uses `webhook.*` — added if not present). - `signing_secret_vault_path` column never exposed in the API response shape. ## Observability - Rate-limit metrics: `api_rate_limit_consumed_total{tenant,api_key_id,outcome}`, `api_rate_limit_remaining{tenant,api_key_id}`, `api_rate_limit_redis_unavailable_total`. - Webhook metrics: `webhook_deliveries_total{tenant,webhook_id,outcome}`, `webhook_delivery_duration_ms` histogram, `webhook_queue_depth` gauge, `webhook_auto_disabled_total{tenant,webhook_id}`. - Structured WARN log on every throttle and on every Redis fail-open. - Grafana panel: top throttled `(tenant, api_key_id)` and top failing webhooks. ## Rollout / Migration 1. **Citus worker `max_connections` bump** — out of band, ~1 hour, eliminates the immediate 500s. Not part of this plan. 2. **Rate limiter MVP** — Stages 1–3: - *Stage 1 (observation):* `RATE_LIMIT_ENFORCE=false`. Measure for one week. - *Stage 2 (notify outliers):* email tenants whose keys would have been throttled. - *Stage 3 (enforce):* flip `RATE_LIMIT_ENFORCE=true`. - *Stage 4:* remove the env-var bypass after ~2 weeks stable. 3. **Webhook MVP** — Stages 1–4: - *Stage 1 (dark launch):* ship behind a feature flag; internal testing against `webhook.site`. - *Stage 2 (invite-only beta):* enable for a handful of API-heavy tenants, including the noisy poller. - *Stage 3 (GA):* open to all tenants; publish docs. - *Stage 4:* tighten polled REST rate limits once webhook adoption is healthy. 4. The noisy poller (`52.53.71.0`) gets a personal note + migration guide. ## Open Questions 1. Should we expose `ticket.deleted`? Internal `TICKET_DELETED` exists. Defer to v2 unless the noisy poller specifically asks. 2. Per-tenant webhook count limit? Cap at 50 per tenant default. 3. Should `ticket.status_changed` include `previous_status_id`/ `previous_status_name`? **Yes** — captured as a feature. 4. Webhook-side filtering by `entity_ids`? **Implement in v1** — directly addresses the noisy poller's "tell me about these 6 tickets" pattern. 5. IA placement of the new settings UI sections — Settings → Security or Settings → Integrations? Confirm with design. 6. Per-tenant rate-limit cap on top of per-key buckets? Defer until data shows we need it. 7. Per-endpoint cost weights (`/search` costs more than `/get`)? Defer until observation data shows pressure differences. ## Acceptance Criteria (Definition of Done) **Rate limiter** - A test API key making 121 requests in 60 seconds receives 429 on the 121st with the documented headers, and a different key in the same tenant is unaffected. - The email path's existing rate-limit behavior is unchanged on a baseline `notification_settings.rate_limit_per_minute` value. - `RATE_LIMIT_ENFORCE=false` lets denials through but emits the same metrics and headers as enforce mode. - An `api_rate_limit_settings` row with `(tenant, api_key_id)` overrides the tenant default; `clearForKey` returns to the tenant default within the cache TTL. **Webhooks** - `TICKET_ASSIGNED` published in tenant A enqueues a delivery job for an active webhook in tenant A subscribed to `ticket.assigned`, and does **not** enqueue a job for any webhook in tenant B. - The `WebhookDeliveryQueue` poller successfully delivers to a stubbed HTTP server, persists a row in `webhook_deliveries`, and updates webhook stats columns. - A webhook URL pointing at `127.0.0.1` or `10.0.0.5` is rejected before delivery (production mode), and accepted with `WEBHOOK_SSRF_ALLOW_PRIVATE=true`. - HMAC verification with the documented recipe matches the server signature byte-for-byte. - Signing secret never appears in any GET webhook response body. - 5 failed attempts mark the delivery `abandoned`; 24h of all-failure deliveries auto-disable the webhook. - Webhooks settings UI: an admin can create, view deliveries, send a test, rotate the secret, and pause/resume a webhook.