Some checks are pending
Bidi Control Character Guard / bidi-control-guard (push) Waiting to run
Circular Dependency Check / Check for new circular dependencies (push) Waiting to run
Citus Migration Smoke / Combined migrations on single-node Citus (push) Waiting to run
E2E Fresh Install Tests / fresh-install-e2e (push) Waiting to run
ext-v2 guardrails / Run ext-v2 guard and ESLint (push) Waiting to run
Integration Tests / Check for relevant changes (push) Waiting to run
Integration Tests / ${{ (github.event_name == 'schedule' || github.event.inputs.suite == 'full') && 'Full integration suite' || 'Tier-1 integration subset' }} (push) Blocked by required conditions
Mobile checks / Mobile lint + typecheck (push) Waiting to run
Mobile checks / Mobile unit tests (push) Waiting to run
Mobile checks / Mobile dependency audit (report) (push) Waiting to run
Mobile checks / Mobile reproducibility checks (push) Waiting to run
Secrets guard (env backups) / Ensure no tracked env backup files (push) Waiting to run
Temporal Readiness / fast-readiness (push) Waiting to run
Temporal Readiness / docker-parity (push) Waiting to run
TypeScript Type Check / Nx affected typecheck (push) Waiting to run
Unit Tests / Skipped-test budget (push) Waiting to run
Unit Tests / Nx affected unit tests (push) Waiting to run
Unit Tests / Server unit coverage (informational) (push) Waiting to run
Validate Tenant Management Schema / Check for relevant changes (push) Waiting to run
Validate Tenant Management Schema / Validate Tenant Management Schema (push) Blocked by required conditions
EE Workflows Build Guard / ee-workflows-build-guard (push) Waiting to run
Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz Source: /opt/alga-psa on psa.joliet.tech
302 lines
14 KiB
Markdown
302 lines
14 KiB
Markdown
# Design: RMM alert handling and ticket lifecycle sync
|
||
|
||
- **Status:** Approved
|
||
- **Created:** 2026-06-12
|
||
- **Branch:** `feature/rmm-alerts-sync`
|
||
|
||
## Problem
|
||
|
||
RMM alert ingestion is scaffolded but broken and disconnected:
|
||
|
||
- The NinjaOne webhook (`ee/server/src/app/api/webhooks/ninjaone/route.ts`)
|
||
inserts `activity_type` and `source_data` into `rmm_alerts`
|
||
(`webhookHandler.ts:671-674`). Neither column exists. The only migration that
|
||
creates these tables is
|
||
`server/migrations/20251124000001_create_rmm_integration_tables.cjs`, so the
|
||
basic alert insert fails at runtime.
|
||
- The rules engine
|
||
(`ee/server/src/lib/integrations/ninjaone/alerts/alertProcessor.ts`) reads
|
||
JSONB `conditions` / `actions` columns from `rmm_alert_rules`. The migration
|
||
created flat `text[]` filter columns instead. The processor is also never
|
||
called from the webhook path; only tests import it.
|
||
- The ticket creator
|
||
(`ee/server/src/lib/integrations/ninjaone/alerts/ticketCreator.ts`) works but
|
||
is reachable only from the manual button in `AssetAlertsSection.tsx`.
|
||
- Nothing closes a ticket when an alert resets, nothing resets a NinjaOne alert
|
||
when a ticket closes, repeat alerts have no dedup, and there is no UI or CRUD
|
||
for alert rules.
|
||
|
||
The `rmm_alerts` / `rmm_alert_rules` tables are deployed but hold no data worth
|
||
preserving, so corrective schema work can be additive without backfill.
|
||
|
||
## Decisions
|
||
|
||
| Question | Decision |
|
||
| --- | --- |
|
||
| Scope | Full pipeline + lifecycle sync + rules CRUD/UI + workflow v2 events + notifications |
|
||
| Dedup | One open ticket per (device, condition). Repeats append a comment and bump a counter |
|
||
| Auto-close on alert reset | Always comment; close the ticket only if no human has touched it |
|
||
| Outbound reset on ticket close | Per-rule flag `resetAlertOnTicketClose`, default true |
|
||
| Provider scope | Provider-generic pipeline; NinjaOne and TacticalRMM both wired |
|
||
| Processing model | Synchronous in the webhook request (local DB work only) |
|
||
| Migration strategy | One additive corrective migration; never rewrite the deployed one |
|
||
| Maintenance windows | Suppress before rule matching; alert stored as `suppressed`; the poller processes still-active alerts after the window ends |
|
||
| Alert polling | Per-integration Temporal schedule (default on, every 15 min) reconciles missed triggers and missed resets through the same pipeline; Huntress incident polling migrated to the same model |
|
||
|
||
## Architecture
|
||
|
||
New shared module `shared/rmm/alerts/`, following the
|
||
`shared/rmm/sharedAssetIngestionService.ts` precedent so both `ee/server`
|
||
(NinjaOne) and `server` (TacticalRMM) can import it:
|
||
|
||
- `contracts.ts` defines `NormalizedRmmAlertEvent`:
|
||
`{ tenantId, integrationId, provider, kind: 'triggered' | 'reset' |
|
||
'acknowledged', externalAlertId, externalDeviceId, activityType, alertClass,
|
||
sourceType, severity, message, deviceName, externalOrganizationId,
|
||
occurredAt, raw }` — and `RmmAlertOutboundAdapter` with `resetAlert(externalAlertId)`,
|
||
optional per provider.
|
||
- `processRmmAlertEvent.ts` is the single pipeline entry point.
|
||
- `alertRuleEvaluator.ts`, `alertTicketCreator.ts`, `alertLifecycle.ts` hold the
|
||
logic moved out of `ee/server/src/lib/integrations/ninjaone/alerts/` and made
|
||
provider-agnostic.
|
||
|
||
Webhook routes keep their existing auth, tenant resolution, and tier gating.
|
||
Each route maps its payload to a `NormalizedRmmAlertEvent` (a thin
|
||
`mapNinjaOneWebhookToAlertEvent()` in ee; a Tactical equivalent in
|
||
`server/src/app/api/webhooks/tacticalrmm/route.ts`, replacing its direct
|
||
`rmm_alerts` writes) and calls the pipeline.
|
||
|
||
**Triggered:** upsert `rmm_alerts` on `(tenant, integration_id,
|
||
external_alert_id)` → compute `dedup_key` (device + condition identity; for
|
||
NinjaOne, `statusCode` falling back to `activityType`) → maintenance-window
|
||
check (a match stores the alert as `suppressed` and stops) → evaluate rules
|
||
(first match by `priority_order`; a rule with no conditions is a catch-all) →
|
||
dedup check → create a ticket or append an occurrence comment to the existing
|
||
open ticket → publish events. The matched rule's ID is stored on the alert row
|
||
so later lifecycle steps do not re-evaluate rules.
|
||
|
||
**Reset:** mark the alert resolved. If a ticket is linked and the matched rule
|
||
has `autoResolveTicket`: always add a comment; close the ticket only if it is
|
||
untouched. Publish `RMM_ALERT_RESOLVED`.
|
||
|
||
**Outbound (ticket close → RMM):** an event-bus subscriber on ticket-closed
|
||
events looks up unresolved `rmm_alerts` by `ticket_id`, checks the matched
|
||
rule's `resetAlertOnTicketClose`, and calls the provider's outbound adapter
|
||
(`NinjaOneClient.resetAlert()` at
|
||
`ee/server/src/lib/integrations/ninjaone/ninjaOneClient.ts`; Tactical's
|
||
resolve-alert API if supported, otherwise the step is skipped). This is the
|
||
only external call in the design and it already runs async on the bus.
|
||
|
||
The org-mapping flag `rmm_organization_mappings.auto_create_tickets` is
|
||
deprecated. Rules, with their organization filter, are the single source of
|
||
truth for what creates tickets.
|
||
|
||
Idempotency: NinjaOne retries on 5xx. The external-ID upsert plus the dedup
|
||
check make a replayed webhook a no-op, so at-least-once delivery is safe.
|
||
|
||
## Schema changes
|
||
|
||
One additive migration.
|
||
|
||
`rmm_alerts` — add:
|
||
|
||
- `activity_type` varchar(100), `acknowledged_at` timestamptz,
|
||
`acknowledged_by` uuid (columns the code already writes)
|
||
- `dedup_key` varchar(255), with an index on
|
||
`(tenant, integration_id, dedup_key)`
|
||
- `occurrence_count` int default 1, `last_occurrence_at` timestamptz
|
||
- `matched_rule_id` uuid null
|
||
- `auto_ticket_created` boolean default false
|
||
- `suppressed_by_window_id` uuid null (alert `status` gains a `suppressed`
|
||
value)
|
||
|
||
New table `rmm_maintenance_windows`: `tenant`, `window_id`, optional scoping
|
||
columns `integration_id`, `client_id`, `asset_id` (null = applies to all of
|
||
that dimension), `name`, `is_active`, `starts_at`/`ends_at` for one-off
|
||
windows, and a `recurrence` jsonb
|
||
(`{ type: 'weekly', days, startTime, endTime, timezone }`) for recurring ones.
|
||
|
||
The raw payload standardizes on the existing `metadata` jsonb column. Code that
|
||
writes `source_data` changes to `metadata`.
|
||
|
||
`rmm_alert_rules` — add `conditions` jsonb and `actions` jsonb; drop the eleven
|
||
flat columns (`severity_filter`, `source_type_filter`, `alert_class_filter`,
|
||
`organization_filter`, `message_pattern`, `create_ticket`,
|
||
`ticket_channel_id`, `ticket_priority`, `assigned_user_id`, `ticket_template`,
|
||
`auto_resolve_ticket`). `name`, `description`, `is_active`, and
|
||
`priority_order` stay as real columns.
|
||
|
||
### `conditions` shape
|
||
|
||
All fields optional; every present field must match; an empty object matches
|
||
every alert.
|
||
|
||
```ts
|
||
{
|
||
severities?: string[],
|
||
activityTypes?: string[],
|
||
alertClasses?: string[],
|
||
sourceTypes?: string[],
|
||
organizationIds?: string[], // external org IDs
|
||
messagePattern?: string, // regex, validated at save time
|
||
keywords?: string[] // substring match on message
|
||
}
|
||
```
|
||
|
||
### `actions` shape
|
||
|
||
```ts
|
||
{
|
||
createTicket: boolean,
|
||
boardId?: string,
|
||
priorityOverride?: string,
|
||
assignToUserId?: string,
|
||
ticketTemplate?: { titleTemplate?: string, descriptionTemplate?: string },
|
||
autoResolveTicket: boolean,
|
||
autoResolveStatusId?: string, // fallback: tenant's first is_closed status
|
||
resetAlertOnTicketClose: boolean, // default true
|
||
notifyUserIds?: string[]
|
||
}
|
||
```
|
||
|
||
Ticket templates support `{{device}}`, `{{message}}`, `{{severity}}`, and
|
||
`{{organization}}` placeholders. Zod schemas validate both shapes at the
|
||
server-action boundary.
|
||
|
||
## Lifecycle semantics
|
||
|
||
**Dedup.** On a triggered event whose matched rule creates tickets, look for an
|
||
alert row with the same `dedup_key` whose linked ticket is still open. If
|
||
found: point the new alert row at that ticket, increment `occurrence_count`,
|
||
and add an internal comment ("Alert re-triggered — Nth occurrence"). If not:
|
||
create a ticket. Dedup is always on; it is not per-rule configurable.
|
||
|
||
**Untouched.** A ticket is untouched when it has no human-authored comments, no
|
||
time entries, and no manual status change since creation. Rule-driven
|
||
auto-assignment does not count as touched.
|
||
|
||
## Maintenance windows
|
||
|
||
The pipeline checks windows before rule matching. An alert is suppressed when
|
||
an active window matches all of its non-null scopes (integration, client,
|
||
asset) at the alert's `occurredAt` — one-off windows by `starts_at`/`ends_at`,
|
||
weekly recurring windows by day and time range in the window's timezone.
|
||
|
||
A suppressed alert is stored with `status = 'suppressed'` and
|
||
`suppressed_by_window_id`. It creates no ticket, sends no notifications, and
|
||
publishes no workflow events. A reset arriving for a suppressed alert resolves
|
||
it quietly. When a window ends, the reconciliation poller processes
|
||
still-active suppressed alerts through the normal rules path, so a condition
|
||
that fired during maintenance and is still firing afterward becomes a ticket.
|
||
|
||
Windows have their own CRUD server actions (admin-gated, Zod-validated) and a
|
||
"Maintenance Windows" subsection beside Alert Rules in RMM settings: a list
|
||
plus an editor with client/asset scope pickers and a one-off or weekly
|
||
recurring schedule.
|
||
|
||
## Alert polling (reconciliation)
|
||
|
||
Polling rides Alga's job-runner abstraction (`IJobRunner`,
|
||
`JobRunnerFactory`): the same handlers
|
||
(`server/src/lib/jobs/handlers/rmmAlertPollingHandlers.ts`, registered in the
|
||
central `JobHandlerRegistry` for the server and in
|
||
`initializeJobHandlersForWorker` for the temporal worker) run on whichever
|
||
backend the edition selects — **Temporal Schedules in EE, pg-boss cron in
|
||
CE** — so TacticalRMM keeps polling in community deployments. Each
|
||
integration gets one recurring job keyed by singletonKey
|
||
`<job>:<tenant>:<integration>`: `rmm-alert-reconciliation` (interval from
|
||
`settings.alertPolling`, 5–60 min, default 15, on by default) and
|
||
`huntress-incident-poll` (from `pollIntervalMinutes`, default 5), which
|
||
replaced Huntress's pg-boss dispatcher.
|
||
|
||
A small control loop, `reconcileRmmPollingSchedules()`, converges the
|
||
recurring jobs onto `rmm_integrations` state — creating jobs for new
|
||
connections, recreating them when intervals change, cancelling them on
|
||
disconnect or polling-disable. It runs every 5 minutes from `initializeApp`,
|
||
once at server boot, and immediately from the NinjaOne connect/disconnect
|
||
flows, so settings changes take effect without operator intervention.
|
||
Handlers also re-check integration eligibility per run, making any schedule
|
||
that briefly outlives its integration a harmless no-op.
|
||
`TemporalJobRunner.scheduleRecurringJob` was upgraded to update an existing
|
||
schedule's spec instead of silently keeping the old one. Each cycle works
|
||
through the same normalized pipeline:
|
||
|
||
1. Fetch active alerts from the RMM (`NinjaOneClient.getAlerts()`; Tactical's
|
||
alerts API) and upsert any the webhooks missed as `triggered` events, so
|
||
rules, dedup, and tickets apply identically.
|
||
2. Synthesize `reset` events for local active alerts no longer active in the
|
||
RMM, catching missed reset webhooks (the main source of stale tickets).
|
||
3. Process expired-window suppressed alerts that are still active.
|
||
|
||
Webhooks remain the primary low-latency path; polling is the backstop. Ingest
|
||
idempotency makes the overlap harmless.
|
||
|
||
One id-space caveat: NinjaOne webhooks carry activity ids while its alerts API
|
||
returns condition uids, so stale-alert detection only trusts poller-ingested
|
||
ids (marked in alert metadata) and never closes a webhook-created alert on the
|
||
poller's say-so. Per-device+condition dedup absorbs cross-source duplicates.
|
||
NinjaOne and TacticalRMM ship fetchers (Tactical's reuses the alerts endpoint
|
||
its manual backfill verified); Level has no list-alerts surface and stays
|
||
webhook-only.
|
||
|
||
## Rules CRUD and UI
|
||
|
||
Server actions in `packages/integrations` (alongside `tacticalRmmActions.ts`):
|
||
list, create, update, delete, reorder. Admin-gated, Zod-validated.
|
||
|
||
UI: an "Alert Rules" section inside the RMM integration settings, rendered
|
||
per-integration next to the existing org-mapping manager, shared by NinjaOne
|
||
and Tactical. A priority-ordered rules list with active toggles and reorder
|
||
controls. The rule editor dialog mirrors the two JSONB shapes: a Match group
|
||
(severity multi-select, activity types, alert classes, org picker fed from
|
||
`rmm_organization_mappings`, keywords, regex with save-time validation) and an
|
||
Actions group (create-ticket toggle, board picker, priority override, assignee,
|
||
title/description templates with placeholder hints, auto-resolve toggle,
|
||
reset-on-close toggle, notify-users picker).
|
||
|
||
The per-asset `AssetAlertsSection` remains the alert-viewing surface. Tickets
|
||
remain the primary work queue.
|
||
|
||
## Workflow v2 events and notifications
|
||
|
||
Register `RMM_ALERT_TRIGGERED` and `RMM_ALERT_RESOLVED` as native workflow v2
|
||
catalog events with a provider-generic payload (alert ID, external IDs,
|
||
severity, asset/client IDs, ticket ID if any). The pipeline publishes them,
|
||
replacing today's orphaned legacy-bus publishes. Add one generic workflow
|
||
action, `rmm.alerts.create_ticket`, which invokes the shared ticket creator by
|
||
alert ID.
|
||
|
||
Notifications ride the existing notification infrastructure: a new `rmm-alert`
|
||
category honoring per-user preferences (in-app + email), fired when a matched
|
||
rule has `notifyUserIds`.
|
||
|
||
## Error handling
|
||
|
||
- A rule that fails to evaluate (for example, a bad regex) is logged and
|
||
skipped. It never aborts the pipeline.
|
||
- Webhook responses are unchanged: 200 for unmapped orgs, 200 on success, 500
|
||
on unexpected errors so the RMM retries. Ingest is idempotent, so retries are
|
||
safe.
|
||
- Outbound reset failures log, stamp the alert's `metadata`, and never block
|
||
the ticket close.
|
||
- Two hardening items in adjacent code: implement the CSRF validation TODO in
|
||
the NinjaOne OAuth callback (the state payload already carries `csrf` and
|
||
`timestamp`), and remove the superseded `resetInNinjaOne` TODO in
|
||
`alertProcessor.ts`'s `resolveAlert()`.
|
||
|
||
## Testing
|
||
|
||
80/20 split. An automated core targets the logic where permutations hide bugs
|
||
and regressions are expensive: the rule-evaluation matrix, dedup, the
|
||
untouched-ticket check, window matching (timezones, midnight-crossing
|
||
recurrence, scope combinations), idempotency (webhook replay, webhook+poller
|
||
overlap), lifecycle in both directions, tenant isolation and admin gating on
|
||
CRUD, and one end-to-end webhook→ticket integration test per direction. The
|
||
existing `alertProcessor` tests move to the new shared module.
|
||
|
||
Everything UI-, infra-, or delivery-shaped is a manual smoke pass instead:
|
||
settings screens, live NinjaOne/Tactical round-trips, Temporal schedule
|
||
lifecycle on connect/disconnect, email delivery, and migrations against a real
|
||
stack. The risk-driven checklist lives at
|
||
`docs/plans/2026-06-12-rmm-alert-handling/SMOKE_TESTS.md`.
|