Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz Source: /opt/alga-psa on psa.joliet.tech
14 KiB
PRD — API Rate Limiting and Outbound Ticket Webhooks
- Slug:
api-rate-limiting-and-ticket-webhooks - Date:
2026-05-05 - Status: Draft
- Source plans:
/Users/natalliabukhtsik/Desktop/projects/alga-psa/.ai/api-rate-limiting-plan.md/Users/natalliabukhtsik/Desktop/projects/alga-psa/.ai/ticket-webhooks-plan.md
Summary
Two complementary protections against "noisy poller" pressure on the public REST API and the underlying Citus cluster:
- API rate limiting (guardrail) — a per-
(tenant, api_key_id)token-bucket rate limit on every authenticated/api/v1/*request, fail-open on Redis outage, configurable per tenant. - Outbound ticket webhooks (cure) — finish the partially-scaffolded webhook delivery pipeline so well-behaved customers can subscribe to ticket events instead of polling. Includes signed HTTP delivery, retries with backoff, a minimal admin UI, and a curated payload shape.
The two ship as one plan because they share infrastructure (the
TokenBucketRateLimiter namespace work is required by both — the rate limiter
uses namespace 'api', the webhook delivery worker uses namespace
'webhook-out' for per-webhook outbound caps).
Problem
Production telemetry on 2026-05-04 traced intermittent Citus
"remaining connection slots are reserved for non-replication superuser connections" errors to a single external integration polling
GET /api/v1/tickets/<id> for 6 specific ticket IDs once per minute from
52.53.71.0. When the 6-call burst lands at a minute boundary alongside other
work, ~18% of the calls fail with HTTP 500.
Today there is no rate limit on the public REST API and no working outbound webhooks (the system is scaffolded but the delivery method is mocked and the data tables don't exist), so:
- Customers have no choice but to poll if they need near-real-time data.
- A single key can pressure the cluster regardless of other work.
Goals
- Stop a single tenant or API key from monopolizing Citus worker connections via runaway polling.
- Give well-behaved customers a way to subscribe to ticket lifecycle events
(
ticket.created,ticket.updated,ticket.assigned,ticket.status_changed,ticket.closed,ticket.comment.added) and receive signed HTTP POSTs instead of polling. - Migrate the noisy customer to webhooks once shipped.
- Ship the rate-limit guardrail first (smaller, unblocks the immediate problem); ship webhook delivery next (larger, removes the cause).
Non-goals
- Not implementing custom payload templates / Handlebars rendering for webhooks in v1.
- Not implementing webhook subscriptions for non-ticket entities in v1 (projects / clients / contacts / invoices come later — pattern is the same).
- Not changing internal email/notification subscribers — webhooks are an additional output, not a replacement.
- Not changing UI / Server-Action traffic — rate limiting only protects the
external API surface (
x-api-key-authenticated endpoints). - Not building a customer-facing "webhook explorer" UI in v1; admin can manage via API and a basic settings page.
- Not changing Citus connection-pool config (separate but complementary work).
- Not introducing new queue/runtime dependencies — reuse the existing Redis
client and the
DelayedEmailQueueZSET pattern; do not add BullMQ.
Users and Primary Flows
Tenant administrator (target persona)
- Set or override an API rate limit:
Settings → API Keys → "Rate Limit" → set per-key
max_tokensandrefill_per_min, or set a tenant-wide default. - Create a webhook subscription: Settings → Webhooks → "New Webhook" → name, URL, event types (multi-select), custom headers, retry config → save → copy plaintext signing secret (shown once).
- Verify a webhook: "Send Test" delivers a synthetic payload to the configured URL using the live signing secret; result shown inline.
- Inspect deliveries: Webhook detail → paginated history with status, response body, retry button.
- Rotate signing secret: one click → new plaintext returned once.
External integration (consumer of webhooks and rate limits)
- Hit a 429: receives
Retry-AfterandX-RateLimit-*headers; backs off and retries. - Receive webhook: HTTP POST with
X-Alga-Signature: t=<ts>,v1=<hex>; verifies HMAC; idempotently processes byevent_id.
Internal noisy poller (the 52.53.71.0 integration)
- Continues working under sane defaults (6 calls/min is well under the limit).
- Receives a heads-up + migration guide pointing at the new webhook flow.
UX / UI Notes
- API Keys settings: add a "Rate Limit" column to the existing
AdminApiKeysSetuptable, plus an inline edit form. Surface the current remaining tokens (read viagetState). - Webhooks settings: new section under Settings, modeled on
AdminApiKeysSetup. Must include create form, list view (status, last delivery, success rate), delivery history, secret reveal/rotate, pause/resume. - Neither feature requires a new top-level navigation entry — both live under Settings → Security or Settings → Integrations (TBD with design; tracked in Open Questions).
Requirements
Functional Requirements
API rate limiting
- Every authenticated
/api/v1/*request consumes 1 token from the(tenant, api_key_id)bucket in namespace'api'. - 429 response on denial includes
Retry-After,X-RateLimit-Limit,X-RateLimit-Remaining,X-RateLimit-Reset. - Successful responses include
X-RateLimit-LimitandX-RateLimit-Remaining. - Defaults:
max_tokens = 120,refill_per_min = 60(1 RPS sustained, 120 burst). - Per-
(tenant, api_key_id)overrides viaapi_rate_limit_settings. Falls back to(tenant, NULL), then to hard-coded defaults. - Bypass list: health/version endpoints, internal-runner endpoints, mobile auth.
- NM Store global-key path uses sentinel subjectId
'nm_store'so its traffic shares one tenant-scoped bucket. - Observation mode:
RATE_LIMIT_ENFORCE=falselogs denials instead of throwing. Defaultfalseuntil Stage 3 of rollout. - Fail-open if Redis is unavailable.
Outbound webhooks
- Tenant admins create webhook subscriptions for ticket lifecycle events.
- Internal events publish via the existing event bus
(
publishEvent({ eventType: 'TICKET_*', payload })); a new subscriber fans out to active webhooks. - Delivery is queued, not inline (Redis ZSET, mirrored on
DelayedEmailQueue). - Each delivery is HMAC-SHA256 signed:
X-Alga-Signature: t=<unix>,v1=<hex>overt + "." + body. - Per-webhook rate limit using the
TokenBucketRateLimiternamespace'webhook-out', dimension(tenant, webhookId). Defaultrate_limit_per_min = 100. - Retry policy: 1m → 5m → 30m → 2h → 12h, then abandon. 5 attempts total.
Configurable per webhook via
retry_config. - Auto-disable a webhook after 24h of all-failure deliveries; email the owning user.
- Curated payload shape (see source plan §3.3) — stable subset of ticket
fields plus
changesdiff forticket.updatedevents. Comments include text/author/timestamp/internal flag, never attachments. POST /api/v1/webhooks/[id]/testsends a syntheticwebhook.testpayload, recorded withis_test = true.- SSRF guard rejects RFC1918, loopback, link-local, CGNAT, and non-
http(s)schemes before delivery (real and test). - Signing secret stored via the existing secret provider; column holds
signing_secret_vault_path, never plaintext or hash.
Non-functional Requirements
- Rate limiter adds ≤2 ms p99 to authenticated API requests when Redis is healthy.
- Webhook delivery latency floor ~2 s (ZSET poll interval); acceptable.
- At-least-once delivery semantics; idempotency key is
event_id. - Tenant isolation is preserved end-to-end (subscriber filters by
tenant, payload builder uses tenant-scopedgetConnection).
Data / API / Integrations
New tables
api_rate_limit_settings(tenant-distributed):(tenant, api_key_id NULL, max_tokens, refill_per_min)withUNIQUE (tenant, api_key_id).webhooks(tenant-distributed): subscription rows withsigning_secret_vault_path,event_types text[], retry/rate-limit config, rolling stats columns,auto_disabled_at.webhook_deliveries(tenant-distributed): one row per attempt;is_test boolean,next_retry_at, response capture.
New / modified APIs
TokenBucketRateLimiter: signature change — namespace as required first parameter ontryConsume/getState/getBucketKey/getBucketConfig.BucketConfigGetterwidened to(tenantId, subjectId?) => BucketConfig.initialize()acceptsRecord<namespace, BucketConfigGetter>.ApiErrorinterface: optionalheaders?: Record<string, string>;handleApiErrormerges them into theNextResponse.createSuccessResponse/createPaginatedResponse: optionalextraHeadersparameter.- New helper
enforceApiRateLimit(req, context)called fromApiBaseController.authenticate,withApiKeyAuth, andwithAuth. webhookEventTypeSchemaextended with'ticket.comment.added'.- Webhook controller stubs implemented (or removed):
getDeliveryDetails,getWebhookHealth,getWebhookSubscriptions(read-only),rotateWebhookSecret,verifyWebhookSignature,listAvailableEvents. Deferred stubs (bulk, templates, transformations, etc.) have their routes deleted, not left as 501s.
Reused infrastructure (no new dependency)
getRedisClientfor both buckets and the webhook ZSET.DelayedEmailQueuepattern for the webhook poller class.getSecretProviderInstancefor signing-secret storage.- Existing event bus (
publishEvent/getEventBus().subscribe) and theTICKET_*schemas inpackages/event-schemas.
Security / Permissions
- Webhook signing secrets never leave the secret provider once written; GET responses for webhook rows must omit the column entirely (covered by an integration assertion).
- SSRF protection enforced server-side in both real and test delivery, with
an env-var escape hatch (
WEBHOOK_SSRF_ALLOW_PRIVATE) for staging/local only. - Rate-limit configuration writes scoped to tenant admins (RBAC: existing
api_keys.updatepermission applies; webhook CRUD useswebhook.*— added if not present). signing_secret_vault_pathcolumn never exposed in the API response shape.
Observability
- Rate-limit metrics:
api_rate_limit_consumed_total{tenant,api_key_id,outcome},api_rate_limit_remaining{tenant,api_key_id},api_rate_limit_redis_unavailable_total. - Webhook metrics:
webhook_deliveries_total{tenant,webhook_id,outcome},webhook_delivery_duration_mshistogram,webhook_queue_depthgauge,webhook_auto_disabled_total{tenant,webhook_id}. - Structured WARN log on every throttle and on every Redis fail-open.
- Grafana panel: top throttled
(tenant, api_key_id)and top failing webhooks.
Rollout / Migration
- Citus worker
max_connectionsbump — out of band, ~1 hour, eliminates the immediate 500s. Not part of this plan. - Rate limiter MVP — Stages 1–3:
- Stage 1 (observation):
RATE_LIMIT_ENFORCE=false. Measure for one week. - Stage 2 (notify outliers): email tenants whose keys would have been throttled.
- Stage 3 (enforce): flip
RATE_LIMIT_ENFORCE=true. - Stage 4: remove the env-var bypass after ~2 weeks stable.
- Stage 1 (observation):
- Webhook MVP — Stages 1–4:
- Stage 1 (dark launch): ship behind a feature flag; internal testing
against
webhook.site. - Stage 2 (invite-only beta): enable for a handful of API-heavy tenants, including the noisy poller.
- Stage 3 (GA): open to all tenants; publish docs.
- Stage 4: tighten polled REST rate limits once webhook adoption is healthy.
- Stage 1 (dark launch): ship behind a feature flag; internal testing
against
- The noisy poller (
52.53.71.0) gets a personal note + migration guide.
Open Questions
- Should we expose
ticket.deleted? InternalTICKET_DELETEDexists. Defer to v2 unless the noisy poller specifically asks. - Per-tenant webhook count limit? Cap at 50 per tenant default.
- Should
ticket.status_changedincludeprevious_status_id/previous_status_name? Yes — captured as a feature. - Webhook-side filtering by
entity_ids? Implement in v1 — directly addresses the noisy poller's "tell me about these 6 tickets" pattern. - IA placement of the new settings UI sections — Settings → Security or Settings → Integrations? Confirm with design.
- Per-tenant rate-limit cap on top of per-key buckets? Defer until data shows we need it.
- Per-endpoint cost weights (
/searchcosts more than/get)? Defer until observation data shows pressure differences.
Acceptance Criteria (Definition of Done)
Rate limiter
- A test API key making 121 requests in 60 seconds receives 429 on the 121st with the documented headers, and a different key in the same tenant is unaffected.
- The email path's existing rate-limit behavior is unchanged on a baseline
notification_settings.rate_limit_per_minutevalue. RATE_LIMIT_ENFORCE=falselets denials through but emits the same metrics and headers as enforce mode.- An
api_rate_limit_settingsrow with(tenant, api_key_id)overrides the tenant default;clearForKeyreturns to the tenant default within the cache TTL.
Webhooks
TICKET_ASSIGNEDpublished in tenant A enqueues a delivery job for an active webhook in tenant A subscribed toticket.assigned, and does not enqueue a job for any webhook in tenant B.- The
WebhookDeliveryQueuepoller successfully delivers to a stubbed HTTP server, persists a row inwebhook_deliveries, and updates webhook stats columns. - A webhook URL pointing at
127.0.0.1or10.0.0.5is rejected before delivery (production mode), and accepted withWEBHOOK_SSRF_ALLOW_PRIVATE=true. - HMAC verification with the documented recipe matches the server signature byte-for-byte.
- Signing secret never appears in any GET webhook response body.
- 5 failed attempts mark the delivery
abandoned; 24h of all-failure deliveries auto-disable the webhook. - Webhooks settings UI: an admin can create, view deliveries, send a test, rotate the secret, and pause/resume a webhook.