Hermes 284313f908
Some checks are pending
Bidi Control Character Guard / bidi-control-guard (push) Waiting to run
Circular Dependency Check / Check for new circular dependencies (push) Waiting to run
Citus Migration Smoke / Combined migrations on single-node Citus (push) Waiting to run
E2E Fresh Install Tests / fresh-install-e2e (push) Waiting to run
ext-v2 guardrails / Run ext-v2 guard and ESLint (push) Waiting to run
Integration Tests / Check for relevant changes (push) Waiting to run
Integration Tests / ${{ (github.event_name == 'schedule' || github.event.inputs.suite == 'full') && 'Full integration suite' || 'Tier-1 integration subset' }} (push) Blocked by required conditions
Mobile checks / Mobile lint + typecheck (push) Waiting to run
Mobile checks / Mobile unit tests (push) Waiting to run
Mobile checks / Mobile dependency audit (report) (push) Waiting to run
Mobile checks / Mobile reproducibility checks (push) Waiting to run
Secrets guard (env backups) / Ensure no tracked env backup files (push) Waiting to run
Temporal Readiness / fast-readiness (push) Waiting to run
Temporal Readiness / docker-parity (push) Waiting to run
TypeScript Type Check / Nx affected typecheck (push) Waiting to run
Unit Tests / Skipped-test budget (push) Waiting to run
Unit Tests / Nx affected unit tests (push) Waiting to run
Unit Tests / Server unit coverage (informational) (push) Waiting to run
Validate Tenant Management Schema / Check for relevant changes (push) Waiting to run
Validate Tenant Management Schema / Validate Tenant Management Schema (push) Blocked by required conditions
EE Workflows Build Guard / ee-workflows-build-guard (push) Waiting to run
Initial import of AlgaPSA codebase from PSA server
Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz

Source: /opt/alga-psa on psa.joliet.tech
2026-06-22 16:12:17 -05:00

306 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# PRD — API Rate Limiting and Outbound Ticket Webhooks
- Slug: `api-rate-limiting-and-ticket-webhooks`
- Date: `2026-05-05`
- Status: Draft
- Source plans:
- `/Users/natalliabukhtsik/Desktop/projects/alga-psa/.ai/api-rate-limiting-plan.md`
- `/Users/natalliabukhtsik/Desktop/projects/alga-psa/.ai/ticket-webhooks-plan.md`
## Summary
Two complementary protections against "noisy poller" pressure on the public
REST API and the underlying Citus cluster:
1. **API rate limiting** (guardrail) — a per-`(tenant, api_key_id)` token-bucket
rate limit on every authenticated `/api/v1/*` request, fail-open on Redis
outage, configurable per tenant.
2. **Outbound ticket webhooks** (cure) — finish the partially-scaffolded webhook
delivery pipeline so well-behaved customers can subscribe to ticket events
instead of polling. Includes signed HTTP delivery, retries with backoff, a
minimal admin UI, and a curated payload shape.
The two ship as one plan because they share infrastructure (the
`TokenBucketRateLimiter` namespace work is required by both — the rate limiter
uses namespace `'api'`, the webhook delivery worker uses namespace
`'webhook-out'` for per-webhook outbound caps).
## Problem
Production telemetry on 2026-05-04 traced intermittent Citus
"`remaining connection slots are reserved for non-replication superuser
connections`" errors to a single external integration polling
`GET /api/v1/tickets/<id>` for 6 specific ticket IDs once per minute from
`52.53.71.0`. When the 6-call burst lands at a minute boundary alongside other
work, ~18% of the calls fail with HTTP 500.
Today there is **no rate limit** on the public REST API and **no working
outbound webhooks** (the system is scaffolded but the delivery method is
mocked and the data tables don't exist), so:
- Customers have no choice but to poll if they need near-real-time data.
- A single key can pressure the cluster regardless of other work.
## Goals
- Stop a single tenant or API key from monopolizing Citus worker connections
via runaway polling.
- Give well-behaved customers a way to subscribe to ticket lifecycle events
(`ticket.created`, `ticket.updated`, `ticket.assigned`,
`ticket.status_changed`, `ticket.closed`, `ticket.comment.added`) and
receive signed HTTP POSTs instead of polling.
- Migrate the noisy customer to webhooks once shipped.
- Ship the rate-limit guardrail first (smaller, unblocks the immediate
problem); ship webhook delivery next (larger, removes the cause).
## Non-goals
- Not implementing custom payload templates / Handlebars rendering for
webhooks in v1.
- Not implementing webhook subscriptions for non-ticket entities in v1
(projects / clients / contacts / invoices come later — pattern is the same).
- Not changing internal email/notification subscribers — webhooks are an
additional output, not a replacement.
- Not changing UI / Server-Action traffic — rate limiting only protects the
external API surface (`x-api-key`-authenticated endpoints).
- Not building a customer-facing "webhook explorer" UI in v1; admin can manage
via API and a basic settings page.
- Not changing Citus connection-pool config (separate but complementary work).
- Not introducing new queue/runtime dependencies — reuse the existing Redis
client and the `DelayedEmailQueue` ZSET pattern; do not add BullMQ.
## Users and Primary Flows
**Tenant administrator** (target persona)
1. *Set or override an API rate limit:*
Settings → API Keys → "Rate Limit" → set per-key `max_tokens` and
`refill_per_min`, or set a tenant-wide default.
2. *Create a webhook subscription:*
Settings → Webhooks → "New Webhook" → name, URL, event types
(multi-select), custom headers, retry config → save → copy plaintext
signing secret (shown once).
3. *Verify a webhook:* "Send Test" delivers a synthetic payload to the
configured URL using the live signing secret; result shown inline.
4. *Inspect deliveries:* Webhook detail → paginated history with status,
response body, retry button.
5. *Rotate signing secret:* one click → new plaintext returned once.
**External integration** (consumer of webhooks and rate limits)
1. *Hit a 429:* receives `Retry-After` and `X-RateLimit-*` headers; backs
off and retries.
2. *Receive webhook:* HTTP POST with `X-Alga-Signature: t=<ts>,v1=<hex>`;
verifies HMAC; idempotently processes by `event_id`.
**Internal noisy poller** (the `52.53.71.0` integration)
- Continues working under sane defaults (6 calls/min is well under the limit).
- Receives a heads-up + migration guide pointing at the new webhook flow.
## UX / UI Notes
- **API Keys settings:** add a "Rate Limit" column to the existing
`AdminApiKeysSetup` table, plus an inline edit form. Surface the current
remaining tokens (read via `getState`).
- **Webhooks settings:** new section under Settings, modeled on
`AdminApiKeysSetup`. Must include create form, list view (status, last
delivery, success rate), delivery history, secret reveal/rotate, pause/resume.
- Neither feature requires a new top-level navigation entry — both live under
Settings → Security or Settings → Integrations (TBD with design; tracked in
Open Questions).
## Requirements
### Functional Requirements
**API rate limiting**
- Every authenticated `/api/v1/*` request consumes 1 token from the
`(tenant, api_key_id)` bucket in namespace `'api'`.
- 429 response on denial includes `Retry-After`, `X-RateLimit-Limit`,
`X-RateLimit-Remaining`, `X-RateLimit-Reset`.
- Successful responses include `X-RateLimit-Limit` and `X-RateLimit-Remaining`.
- Defaults: `max_tokens = 120`, `refill_per_min = 60` (1 RPS sustained, 120
burst).
- Per-`(tenant, api_key_id)` overrides via `api_rate_limit_settings`. Falls
back to `(tenant, NULL)`, then to hard-coded defaults.
- Bypass list: health/version endpoints, internal-runner endpoints, mobile
auth.
- NM Store global-key path uses sentinel subjectId `'nm_store'` so its
traffic shares one tenant-scoped bucket.
- Observation mode: `RATE_LIMIT_ENFORCE=false` logs denials instead of
throwing. Default `false` until Stage 3 of rollout.
- Fail-open if Redis is unavailable.
**Outbound webhooks**
- Tenant admins create webhook subscriptions for ticket lifecycle events.
- Internal events publish via the existing event bus
(`publishEvent({ eventType: 'TICKET_*', payload })`); a new subscriber
fans out to active webhooks.
- Delivery is queued, not inline (Redis ZSET, mirrored on `DelayedEmailQueue`).
- Each delivery is HMAC-SHA256 signed:
`X-Alga-Signature: t=<unix>,v1=<hex>` over `t + "." + body`.
- Per-webhook rate limit using the `TokenBucketRateLimiter` namespace
`'webhook-out'`, dimension `(tenant, webhookId)`. Default
`rate_limit_per_min = 100`.
- Retry policy: 1m → 5m → 30m → 2h → 12h, then abandon. 5 attempts total.
Configurable per webhook via `retry_config`.
- Auto-disable a webhook after 24h of all-failure deliveries; email the
owning user.
- Curated payload shape (see source plan §3.3) — stable subset of ticket
fields plus `changes` diff for `ticket.updated` events. Comments include
text/author/timestamp/internal flag, never attachments.
- `POST /api/v1/webhooks/[id]/test` sends a synthetic `webhook.test`
payload, recorded with `is_test = true`.
- SSRF guard rejects RFC1918, loopback, link-local, CGNAT, and non-`http(s)`
schemes before delivery (real and test).
- Signing secret stored via the existing secret provider; column holds
`signing_secret_vault_path`, never plaintext or hash.
### Non-functional Requirements
- Rate limiter adds ≤2 ms p99 to authenticated API requests when Redis is
healthy.
- Webhook delivery latency floor ~2 s (ZSET poll interval); acceptable.
- At-least-once delivery semantics; idempotency key is `event_id`.
- Tenant isolation is preserved end-to-end (subscriber filters by `tenant`,
payload builder uses tenant-scoped `getConnection`).
## Data / API / Integrations
### New tables
- `api_rate_limit_settings` (tenant-distributed):
`(tenant, api_key_id NULL, max_tokens, refill_per_min)` with
`UNIQUE (tenant, api_key_id)`.
- `webhooks` (tenant-distributed): subscription rows with
`signing_secret_vault_path`, `event_types text[]`, retry/rate-limit config,
rolling stats columns, `auto_disabled_at`.
- `webhook_deliveries` (tenant-distributed): one row per attempt;
`is_test boolean`, `next_retry_at`, response capture.
### New / modified APIs
- `TokenBucketRateLimiter`: signature change — namespace as required first
parameter on `tryConsume` / `getState` / `getBucketKey` / `getBucketConfig`.
`BucketConfigGetter` widened to `(tenantId, subjectId?) => BucketConfig`.
`initialize()` accepts `Record<namespace, BucketConfigGetter>`.
- `ApiError` interface: optional `headers?: Record<string, string>`;
`handleApiError` merges them into the `NextResponse`.
- `createSuccessResponse` / `createPaginatedResponse`: optional
`extraHeaders` parameter.
- New helper `enforceApiRateLimit(req, context)` called from
`ApiBaseController.authenticate`, `withApiKeyAuth`, and `withAuth`.
- `webhookEventTypeSchema` extended with `'ticket.comment.added'`.
- Webhook controller stubs implemented (or removed): `getDeliveryDetails`,
`getWebhookHealth`, `getWebhookSubscriptions` (read-only), `rotateWebhookSecret`,
`verifyWebhookSignature`, `listAvailableEvents`. Deferred stubs (bulk,
templates, transformations, etc.) have their routes deleted, not left as 501s.
### Reused infrastructure (no new dependency)
- `getRedisClient` for both buckets and the webhook ZSET.
- `DelayedEmailQueue` pattern for the webhook poller class.
- `getSecretProviderInstance` for signing-secret storage.
- Existing event bus (`publishEvent` / `getEventBus().subscribe`) and the
`TICKET_*` schemas in `packages/event-schemas`.
## Security / Permissions
- Webhook signing secrets never leave the secret provider once written;
GET responses for webhook rows must omit the column entirely (covered by
an integration assertion).
- SSRF protection enforced server-side in both real and test delivery, with
an env-var escape hatch (`WEBHOOK_SSRF_ALLOW_PRIVATE`) for staging/local
only.
- Rate-limit configuration writes scoped to tenant admins (RBAC: existing
`api_keys.update` permission applies; webhook CRUD uses `webhook.*`
added if not present).
- `signing_secret_vault_path` column never exposed in the API response shape.
## Observability
- Rate-limit metrics: `api_rate_limit_consumed_total{tenant,api_key_id,outcome}`,
`api_rate_limit_remaining{tenant,api_key_id}`,
`api_rate_limit_redis_unavailable_total`.
- Webhook metrics:
`webhook_deliveries_total{tenant,webhook_id,outcome}`,
`webhook_delivery_duration_ms` histogram, `webhook_queue_depth` gauge,
`webhook_auto_disabled_total{tenant,webhook_id}`.
- Structured WARN log on every throttle and on every Redis fail-open.
- Grafana panel: top throttled `(tenant, api_key_id)` and top failing
webhooks.
## Rollout / Migration
1. **Citus worker `max_connections` bump** — out of band, ~1 hour, eliminates
the immediate 500s. Not part of this plan.
2. **Rate limiter MVP** — Stages 13:
- *Stage 1 (observation):* `RATE_LIMIT_ENFORCE=false`. Measure for one
week.
- *Stage 2 (notify outliers):* email tenants whose keys would have been
throttled.
- *Stage 3 (enforce):* flip `RATE_LIMIT_ENFORCE=true`.
- *Stage 4:* remove the env-var bypass after ~2 weeks stable.
3. **Webhook MVP** — Stages 14:
- *Stage 1 (dark launch):* ship behind a feature flag; internal testing
against `webhook.site`.
- *Stage 2 (invite-only beta):* enable for a handful of API-heavy
tenants, including the noisy poller.
- *Stage 3 (GA):* open to all tenants; publish docs.
- *Stage 4:* tighten polled REST rate limits once webhook adoption is
healthy.
4. The noisy poller (`52.53.71.0`) gets a personal note + migration guide.
## Open Questions
1. Should we expose `ticket.deleted`? Internal `TICKET_DELETED` exists.
Defer to v2 unless the noisy poller specifically asks.
2. Per-tenant webhook count limit? Cap at 50 per tenant default.
3. Should `ticket.status_changed` include `previous_status_id`/
`previous_status_name`? **Yes** — captured as a feature.
4. Webhook-side filtering by `entity_ids`? **Implement in v1** — directly
addresses the noisy poller's "tell me about these 6 tickets" pattern.
5. IA placement of the new settings UI sections — Settings → Security or
Settings → Integrations? Confirm with design.
6. Per-tenant rate-limit cap on top of per-key buckets? Defer until data
shows we need it.
7. Per-endpoint cost weights (`/search` costs more than `/get`)? Defer until
observation data shows pressure differences.
## Acceptance Criteria (Definition of Done)
**Rate limiter**
- A test API key making 121 requests in 60 seconds receives 429 on the 121st
with the documented headers, and a different key in the same tenant is
unaffected.
- The email path's existing rate-limit behavior is unchanged on a baseline
`notification_settings.rate_limit_per_minute` value.
- `RATE_LIMIT_ENFORCE=false` lets denials through but emits the same
metrics and headers as enforce mode.
- An `api_rate_limit_settings` row with `(tenant, api_key_id)` overrides
the tenant default; `clearForKey` returns to the tenant default within
the cache TTL.
**Webhooks**
- `TICKET_ASSIGNED` published in tenant A enqueues a delivery job for an
active webhook in tenant A subscribed to `ticket.assigned`, and does
**not** enqueue a job for any webhook in tenant B.
- The `WebhookDeliveryQueue` poller successfully delivers to a stubbed
HTTP server, persists a row in `webhook_deliveries`, and updates webhook
stats columns.
- A webhook URL pointing at `127.0.0.1` or `10.0.0.5` is rejected before
delivery (production mode), and accepted with
`WEBHOOK_SSRF_ALLOW_PRIVATE=true`.
- HMAC verification with the documented recipe matches the server signature
byte-for-byte.
- Signing secret never appears in any GET webhook response body.
- 5 failed attempts mark the delivery `abandoned`; 24h of all-failure
deliveries auto-disable the webhook.
- Webhooks settings UI: an admin can create, view deliveries, send a test,
rotate the secret, and pause/resume a webhook.