Some checks are pending
Bidi Control Character Guard / bidi-control-guard (push) Waiting to run
Circular Dependency Check / Check for new circular dependencies (push) Waiting to run
Citus Migration Smoke / Combined migrations on single-node Citus (push) Waiting to run
E2E Fresh Install Tests / fresh-install-e2e (push) Waiting to run
ext-v2 guardrails / Run ext-v2 guard and ESLint (push) Waiting to run
Integration Tests / Check for relevant changes (push) Waiting to run
Integration Tests / ${{ (github.event_name == 'schedule' || github.event.inputs.suite == 'full') && 'Full integration suite' || 'Tier-1 integration subset' }} (push) Blocked by required conditions
Mobile checks / Mobile lint + typecheck (push) Waiting to run
Mobile checks / Mobile unit tests (push) Waiting to run
Mobile checks / Mobile dependency audit (report) (push) Waiting to run
Mobile checks / Mobile reproducibility checks (push) Waiting to run
Secrets guard (env backups) / Ensure no tracked env backup files (push) Waiting to run
Temporal Readiness / fast-readiness (push) Waiting to run
Temporal Readiness / docker-parity (push) Waiting to run
TypeScript Type Check / Nx affected typecheck (push) Waiting to run
Unit Tests / Skipped-test budget (push) Waiting to run
Unit Tests / Nx affected unit tests (push) Waiting to run
Unit Tests / Server unit coverage (informational) (push) Waiting to run
Validate Tenant Management Schema / Check for relevant changes (push) Waiting to run
Validate Tenant Management Schema / Validate Tenant Management Schema (push) Blocked by required conditions
EE Workflows Build Guard / ee-workflows-build-guard (push) Waiting to run
Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz Source: /opt/alga-psa on psa.joliet.tech
306 lines
14 KiB
Markdown
306 lines
14 KiB
Markdown
# PRD — API Rate Limiting and Outbound Ticket Webhooks
|
||
|
||
- Slug: `api-rate-limiting-and-ticket-webhooks`
|
||
- Date: `2026-05-05`
|
||
- Status: Draft
|
||
- Source plans:
|
||
- `/Users/natalliabukhtsik/Desktop/projects/alga-psa/.ai/api-rate-limiting-plan.md`
|
||
- `/Users/natalliabukhtsik/Desktop/projects/alga-psa/.ai/ticket-webhooks-plan.md`
|
||
|
||
## Summary
|
||
|
||
Two complementary protections against "noisy poller" pressure on the public
|
||
REST API and the underlying Citus cluster:
|
||
|
||
1. **API rate limiting** (guardrail) — a per-`(tenant, api_key_id)` token-bucket
|
||
rate limit on every authenticated `/api/v1/*` request, fail-open on Redis
|
||
outage, configurable per tenant.
|
||
2. **Outbound ticket webhooks** (cure) — finish the partially-scaffolded webhook
|
||
delivery pipeline so well-behaved customers can subscribe to ticket events
|
||
instead of polling. Includes signed HTTP delivery, retries with backoff, a
|
||
minimal admin UI, and a curated payload shape.
|
||
|
||
The two ship as one plan because they share infrastructure (the
|
||
`TokenBucketRateLimiter` namespace work is required by both — the rate limiter
|
||
uses namespace `'api'`, the webhook delivery worker uses namespace
|
||
`'webhook-out'` for per-webhook outbound caps).
|
||
|
||
## Problem
|
||
|
||
Production telemetry on 2026-05-04 traced intermittent Citus
|
||
"`remaining connection slots are reserved for non-replication superuser
|
||
connections`" errors to a single external integration polling
|
||
`GET /api/v1/tickets/<id>` for 6 specific ticket IDs once per minute from
|
||
`52.53.71.0`. When the 6-call burst lands at a minute boundary alongside other
|
||
work, ~18% of the calls fail with HTTP 500.
|
||
|
||
Today there is **no rate limit** on the public REST API and **no working
|
||
outbound webhooks** (the system is scaffolded but the delivery method is
|
||
mocked and the data tables don't exist), so:
|
||
|
||
- Customers have no choice but to poll if they need near-real-time data.
|
||
- A single key can pressure the cluster regardless of other work.
|
||
|
||
## Goals
|
||
|
||
- Stop a single tenant or API key from monopolizing Citus worker connections
|
||
via runaway polling.
|
||
- Give well-behaved customers a way to subscribe to ticket lifecycle events
|
||
(`ticket.created`, `ticket.updated`, `ticket.assigned`,
|
||
`ticket.status_changed`, `ticket.closed`, `ticket.comment.added`) and
|
||
receive signed HTTP POSTs instead of polling.
|
||
- Migrate the noisy customer to webhooks once shipped.
|
||
- Ship the rate-limit guardrail first (smaller, unblocks the immediate
|
||
problem); ship webhook delivery next (larger, removes the cause).
|
||
|
||
## Non-goals
|
||
|
||
- Not implementing custom payload templates / Handlebars rendering for
|
||
webhooks in v1.
|
||
- Not implementing webhook subscriptions for non-ticket entities in v1
|
||
(projects / clients / contacts / invoices come later — pattern is the same).
|
||
- Not changing internal email/notification subscribers — webhooks are an
|
||
additional output, not a replacement.
|
||
- Not changing UI / Server-Action traffic — rate limiting only protects the
|
||
external API surface (`x-api-key`-authenticated endpoints).
|
||
- Not building a customer-facing "webhook explorer" UI in v1; admin can manage
|
||
via API and a basic settings page.
|
||
- Not changing Citus connection-pool config (separate but complementary work).
|
||
- Not introducing new queue/runtime dependencies — reuse the existing Redis
|
||
client and the `DelayedEmailQueue` ZSET pattern; do not add BullMQ.
|
||
|
||
## Users and Primary Flows
|
||
|
||
**Tenant administrator** (target persona)
|
||
|
||
1. *Set or override an API rate limit:*
|
||
Settings → API Keys → "Rate Limit" → set per-key `max_tokens` and
|
||
`refill_per_min`, or set a tenant-wide default.
|
||
2. *Create a webhook subscription:*
|
||
Settings → Webhooks → "New Webhook" → name, URL, event types
|
||
(multi-select), custom headers, retry config → save → copy plaintext
|
||
signing secret (shown once).
|
||
3. *Verify a webhook:* "Send Test" delivers a synthetic payload to the
|
||
configured URL using the live signing secret; result shown inline.
|
||
4. *Inspect deliveries:* Webhook detail → paginated history with status,
|
||
response body, retry button.
|
||
5. *Rotate signing secret:* one click → new plaintext returned once.
|
||
|
||
**External integration** (consumer of webhooks and rate limits)
|
||
|
||
1. *Hit a 429:* receives `Retry-After` and `X-RateLimit-*` headers; backs
|
||
off and retries.
|
||
2. *Receive webhook:* HTTP POST with `X-Alga-Signature: t=<ts>,v1=<hex>`;
|
||
verifies HMAC; idempotently processes by `event_id`.
|
||
|
||
**Internal noisy poller** (the `52.53.71.0` integration)
|
||
|
||
- Continues working under sane defaults (6 calls/min is well under the limit).
|
||
- Receives a heads-up + migration guide pointing at the new webhook flow.
|
||
|
||
## UX / UI Notes
|
||
|
||
- **API Keys settings:** add a "Rate Limit" column to the existing
|
||
`AdminApiKeysSetup` table, plus an inline edit form. Surface the current
|
||
remaining tokens (read via `getState`).
|
||
- **Webhooks settings:** new section under Settings, modeled on
|
||
`AdminApiKeysSetup`. Must include create form, list view (status, last
|
||
delivery, success rate), delivery history, secret reveal/rotate, pause/resume.
|
||
- Neither feature requires a new top-level navigation entry — both live under
|
||
Settings → Security or Settings → Integrations (TBD with design; tracked in
|
||
Open Questions).
|
||
|
||
## Requirements
|
||
|
||
### Functional Requirements
|
||
|
||
**API rate limiting**
|
||
|
||
- Every authenticated `/api/v1/*` request consumes 1 token from the
|
||
`(tenant, api_key_id)` bucket in namespace `'api'`.
|
||
- 429 response on denial includes `Retry-After`, `X-RateLimit-Limit`,
|
||
`X-RateLimit-Remaining`, `X-RateLimit-Reset`.
|
||
- Successful responses include `X-RateLimit-Limit` and `X-RateLimit-Remaining`.
|
||
- Defaults: `max_tokens = 120`, `refill_per_min = 60` (1 RPS sustained, 120
|
||
burst).
|
||
- Per-`(tenant, api_key_id)` overrides via `api_rate_limit_settings`. Falls
|
||
back to `(tenant, NULL)`, then to hard-coded defaults.
|
||
- Bypass list: health/version endpoints, internal-runner endpoints, mobile
|
||
auth.
|
||
- NM Store global-key path uses sentinel subjectId `'nm_store'` so its
|
||
traffic shares one tenant-scoped bucket.
|
||
- Observation mode: `RATE_LIMIT_ENFORCE=false` logs denials instead of
|
||
throwing. Default `false` until Stage 3 of rollout.
|
||
- Fail-open if Redis is unavailable.
|
||
|
||
**Outbound webhooks**
|
||
|
||
- Tenant admins create webhook subscriptions for ticket lifecycle events.
|
||
- Internal events publish via the existing event bus
|
||
(`publishEvent({ eventType: 'TICKET_*', payload })`); a new subscriber
|
||
fans out to active webhooks.
|
||
- Delivery is queued, not inline (Redis ZSET, mirrored on `DelayedEmailQueue`).
|
||
- Each delivery is HMAC-SHA256 signed:
|
||
`X-Alga-Signature: t=<unix>,v1=<hex>` over `t + "." + body`.
|
||
- Per-webhook rate limit using the `TokenBucketRateLimiter` namespace
|
||
`'webhook-out'`, dimension `(tenant, webhookId)`. Default
|
||
`rate_limit_per_min = 100`.
|
||
- Retry policy: 1m → 5m → 30m → 2h → 12h, then abandon. 5 attempts total.
|
||
Configurable per webhook via `retry_config`.
|
||
- Auto-disable a webhook after 24h of all-failure deliveries; email the
|
||
owning user.
|
||
- Curated payload shape (see source plan §3.3) — stable subset of ticket
|
||
fields plus `changes` diff for `ticket.updated` events. Comments include
|
||
text/author/timestamp/internal flag, never attachments.
|
||
- `POST /api/v1/webhooks/[id]/test` sends a synthetic `webhook.test`
|
||
payload, recorded with `is_test = true`.
|
||
- SSRF guard rejects RFC1918, loopback, link-local, CGNAT, and non-`http(s)`
|
||
schemes before delivery (real and test).
|
||
- Signing secret stored via the existing secret provider; column holds
|
||
`signing_secret_vault_path`, never plaintext or hash.
|
||
|
||
### Non-functional Requirements
|
||
|
||
- Rate limiter adds ≤2 ms p99 to authenticated API requests when Redis is
|
||
healthy.
|
||
- Webhook delivery latency floor ~2 s (ZSET poll interval); acceptable.
|
||
- At-least-once delivery semantics; idempotency key is `event_id`.
|
||
- Tenant isolation is preserved end-to-end (subscriber filters by `tenant`,
|
||
payload builder uses tenant-scoped `getConnection`).
|
||
|
||
## Data / API / Integrations
|
||
|
||
### New tables
|
||
|
||
- `api_rate_limit_settings` (tenant-distributed):
|
||
`(tenant, api_key_id NULL, max_tokens, refill_per_min)` with
|
||
`UNIQUE (tenant, api_key_id)`.
|
||
- `webhooks` (tenant-distributed): subscription rows with
|
||
`signing_secret_vault_path`, `event_types text[]`, retry/rate-limit config,
|
||
rolling stats columns, `auto_disabled_at`.
|
||
- `webhook_deliveries` (tenant-distributed): one row per attempt;
|
||
`is_test boolean`, `next_retry_at`, response capture.
|
||
|
||
### New / modified APIs
|
||
|
||
- `TokenBucketRateLimiter`: signature change — namespace as required first
|
||
parameter on `tryConsume` / `getState` / `getBucketKey` / `getBucketConfig`.
|
||
`BucketConfigGetter` widened to `(tenantId, subjectId?) => BucketConfig`.
|
||
`initialize()` accepts `Record<namespace, BucketConfigGetter>`.
|
||
- `ApiError` interface: optional `headers?: Record<string, string>`;
|
||
`handleApiError` merges them into the `NextResponse`.
|
||
- `createSuccessResponse` / `createPaginatedResponse`: optional
|
||
`extraHeaders` parameter.
|
||
- New helper `enforceApiRateLimit(req, context)` called from
|
||
`ApiBaseController.authenticate`, `withApiKeyAuth`, and `withAuth`.
|
||
- `webhookEventTypeSchema` extended with `'ticket.comment.added'`.
|
||
- Webhook controller stubs implemented (or removed): `getDeliveryDetails`,
|
||
`getWebhookHealth`, `getWebhookSubscriptions` (read-only), `rotateWebhookSecret`,
|
||
`verifyWebhookSignature`, `listAvailableEvents`. Deferred stubs (bulk,
|
||
templates, transformations, etc.) have their routes deleted, not left as 501s.
|
||
|
||
### Reused infrastructure (no new dependency)
|
||
|
||
- `getRedisClient` for both buckets and the webhook ZSET.
|
||
- `DelayedEmailQueue` pattern for the webhook poller class.
|
||
- `getSecretProviderInstance` for signing-secret storage.
|
||
- Existing event bus (`publishEvent` / `getEventBus().subscribe`) and the
|
||
`TICKET_*` schemas in `packages/event-schemas`.
|
||
|
||
## Security / Permissions
|
||
|
||
- Webhook signing secrets never leave the secret provider once written;
|
||
GET responses for webhook rows must omit the column entirely (covered by
|
||
an integration assertion).
|
||
- SSRF protection enforced server-side in both real and test delivery, with
|
||
an env-var escape hatch (`WEBHOOK_SSRF_ALLOW_PRIVATE`) for staging/local
|
||
only.
|
||
- Rate-limit configuration writes scoped to tenant admins (RBAC: existing
|
||
`api_keys.update` permission applies; webhook CRUD uses `webhook.*` —
|
||
added if not present).
|
||
- `signing_secret_vault_path` column never exposed in the API response shape.
|
||
|
||
## Observability
|
||
|
||
- Rate-limit metrics: `api_rate_limit_consumed_total{tenant,api_key_id,outcome}`,
|
||
`api_rate_limit_remaining{tenant,api_key_id}`,
|
||
`api_rate_limit_redis_unavailable_total`.
|
||
- Webhook metrics:
|
||
`webhook_deliveries_total{tenant,webhook_id,outcome}`,
|
||
`webhook_delivery_duration_ms` histogram, `webhook_queue_depth` gauge,
|
||
`webhook_auto_disabled_total{tenant,webhook_id}`.
|
||
- Structured WARN log on every throttle and on every Redis fail-open.
|
||
- Grafana panel: top throttled `(tenant, api_key_id)` and top failing
|
||
webhooks.
|
||
|
||
## Rollout / Migration
|
||
|
||
1. **Citus worker `max_connections` bump** — out of band, ~1 hour, eliminates
|
||
the immediate 500s. Not part of this plan.
|
||
2. **Rate limiter MVP** — Stages 1–3:
|
||
- *Stage 1 (observation):* `RATE_LIMIT_ENFORCE=false`. Measure for one
|
||
week.
|
||
- *Stage 2 (notify outliers):* email tenants whose keys would have been
|
||
throttled.
|
||
- *Stage 3 (enforce):* flip `RATE_LIMIT_ENFORCE=true`.
|
||
- *Stage 4:* remove the env-var bypass after ~2 weeks stable.
|
||
3. **Webhook MVP** — Stages 1–4:
|
||
- *Stage 1 (dark launch):* ship behind a feature flag; internal testing
|
||
against `webhook.site`.
|
||
- *Stage 2 (invite-only beta):* enable for a handful of API-heavy
|
||
tenants, including the noisy poller.
|
||
- *Stage 3 (GA):* open to all tenants; publish docs.
|
||
- *Stage 4:* tighten polled REST rate limits once webhook adoption is
|
||
healthy.
|
||
4. The noisy poller (`52.53.71.0`) gets a personal note + migration guide.
|
||
|
||
## Open Questions
|
||
|
||
1. Should we expose `ticket.deleted`? Internal `TICKET_DELETED` exists.
|
||
Defer to v2 unless the noisy poller specifically asks.
|
||
2. Per-tenant webhook count limit? Cap at 50 per tenant default.
|
||
3. Should `ticket.status_changed` include `previous_status_id`/
|
||
`previous_status_name`? **Yes** — captured as a feature.
|
||
4. Webhook-side filtering by `entity_ids`? **Implement in v1** — directly
|
||
addresses the noisy poller's "tell me about these 6 tickets" pattern.
|
||
5. IA placement of the new settings UI sections — Settings → Security or
|
||
Settings → Integrations? Confirm with design.
|
||
6. Per-tenant rate-limit cap on top of per-key buckets? Defer until data
|
||
shows we need it.
|
||
7. Per-endpoint cost weights (`/search` costs more than `/get`)? Defer until
|
||
observation data shows pressure differences.
|
||
|
||
## Acceptance Criteria (Definition of Done)
|
||
|
||
**Rate limiter**
|
||
|
||
- A test API key making 121 requests in 60 seconds receives 429 on the 121st
|
||
with the documented headers, and a different key in the same tenant is
|
||
unaffected.
|
||
- The email path's existing rate-limit behavior is unchanged on a baseline
|
||
`notification_settings.rate_limit_per_minute` value.
|
||
- `RATE_LIMIT_ENFORCE=false` lets denials through but emits the same
|
||
metrics and headers as enforce mode.
|
||
- An `api_rate_limit_settings` row with `(tenant, api_key_id)` overrides
|
||
the tenant default; `clearForKey` returns to the tenant default within
|
||
the cache TTL.
|
||
|
||
**Webhooks**
|
||
|
||
- `TICKET_ASSIGNED` published in tenant A enqueues a delivery job for an
|
||
active webhook in tenant A subscribed to `ticket.assigned`, and does
|
||
**not** enqueue a job for any webhook in tenant B.
|
||
- The `WebhookDeliveryQueue` poller successfully delivers to a stubbed
|
||
HTTP server, persists a row in `webhook_deliveries`, and updates webhook
|
||
stats columns.
|
||
- A webhook URL pointing at `127.0.0.1` or `10.0.0.5` is rejected before
|
||
delivery (production mode), and accepted with
|
||
`WEBHOOK_SSRF_ALLOW_PRIVATE=true`.
|
||
- HMAC verification with the documented recipe matches the server signature
|
||
byte-for-byte.
|
||
- Signing secret never appears in any GET webhook response body.
|
||
- 5 failed attempts mark the delivery `abandoned`; 24h of all-failure
|
||
deliveries auto-disable the webhook.
|
||
- Webhooks settings UI: an admin can create, view deliveries, send a test,
|
||
rotate the secret, and pause/resume a webhook.
|