Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz Source: /opt/alga-psa on psa.joliet.tech
14 KiB
Design: RMM alert handling and ticket lifecycle sync
- Status: Approved
- Created: 2026-06-12
- Branch:
feature/rmm-alerts-sync
Problem
RMM alert ingestion is scaffolded but broken and disconnected:
- The NinjaOne webhook (
ee/server/src/app/api/webhooks/ninjaone/route.ts) insertsactivity_typeandsource_dataintormm_alerts(webhookHandler.ts:671-674). Neither column exists. The only migration that creates these tables isserver/migrations/20251124000001_create_rmm_integration_tables.cjs, so the basic alert insert fails at runtime. - The rules engine
(
ee/server/src/lib/integrations/ninjaone/alerts/alertProcessor.ts) reads JSONBconditions/actionscolumns fromrmm_alert_rules. The migration created flattext[]filter columns instead. The processor is also never called from the webhook path; only tests import it. - The ticket creator
(
ee/server/src/lib/integrations/ninjaone/alerts/ticketCreator.ts) works but is reachable only from the manual button inAssetAlertsSection.tsx. - Nothing closes a ticket when an alert resets, nothing resets a NinjaOne alert when a ticket closes, repeat alerts have no dedup, and there is no UI or CRUD for alert rules.
The rmm_alerts / rmm_alert_rules tables are deployed but hold no data worth
preserving, so corrective schema work can be additive without backfill.
Decisions
| Question | Decision |
|---|---|
| Scope | Full pipeline + lifecycle sync + rules CRUD/UI + workflow v2 events + notifications |
| Dedup | One open ticket per (device, condition). Repeats append a comment and bump a counter |
| Auto-close on alert reset | Always comment; close the ticket only if no human has touched it |
| Outbound reset on ticket close | Per-rule flag resetAlertOnTicketClose, default true |
| Provider scope | Provider-generic pipeline; NinjaOne and TacticalRMM both wired |
| Processing model | Synchronous in the webhook request (local DB work only) |
| Migration strategy | One additive corrective migration; never rewrite the deployed one |
| Maintenance windows | Suppress before rule matching; alert stored as suppressed; the poller processes still-active alerts after the window ends |
| Alert polling | Per-integration Temporal schedule (default on, every 15 min) reconciles missed triggers and missed resets through the same pipeline; Huntress incident polling migrated to the same model |
Architecture
New shared module shared/rmm/alerts/, following the
shared/rmm/sharedAssetIngestionService.ts precedent so both ee/server
(NinjaOne) and server (TacticalRMM) can import it:
contracts.tsdefinesNormalizedRmmAlertEvent:{ tenantId, integrationId, provider, kind: 'triggered' | 'reset' | 'acknowledged', externalAlertId, externalDeviceId, activityType, alertClass, sourceType, severity, message, deviceName, externalOrganizationId, occurredAt, raw }— andRmmAlertOutboundAdapterwithresetAlert(externalAlertId), optional per provider.processRmmAlertEvent.tsis the single pipeline entry point.alertRuleEvaluator.ts,alertTicketCreator.ts,alertLifecycle.tshold the logic moved out ofee/server/src/lib/integrations/ninjaone/alerts/and made provider-agnostic.
Webhook routes keep their existing auth, tenant resolution, and tier gating.
Each route maps its payload to a NormalizedRmmAlertEvent (a thin
mapNinjaOneWebhookToAlertEvent() in ee; a Tactical equivalent in
server/src/app/api/webhooks/tacticalrmm/route.ts, replacing its direct
rmm_alerts writes) and calls the pipeline.
Triggered: upsert rmm_alerts on (tenant, integration_id, external_alert_id) → compute dedup_key (device + condition identity; for
NinjaOne, statusCode falling back to activityType) → maintenance-window
check (a match stores the alert as suppressed and stops) → evaluate rules
(first match by priority_order; a rule with no conditions is a catch-all) →
dedup check → create a ticket or append an occurrence comment to the existing
open ticket → publish events. The matched rule's ID is stored on the alert row
so later lifecycle steps do not re-evaluate rules.
Reset: mark the alert resolved. If a ticket is linked and the matched rule
has autoResolveTicket: always add a comment; close the ticket only if it is
untouched. Publish RMM_ALERT_RESOLVED.
Outbound (ticket close → RMM): an event-bus subscriber on ticket-closed
events looks up unresolved rmm_alerts by ticket_id, checks the matched
rule's resetAlertOnTicketClose, and calls the provider's outbound adapter
(NinjaOneClient.resetAlert() at
ee/server/src/lib/integrations/ninjaone/ninjaOneClient.ts; Tactical's
resolve-alert API if supported, otherwise the step is skipped). This is the
only external call in the design and it already runs async on the bus.
The org-mapping flag rmm_organization_mappings.auto_create_tickets is
deprecated. Rules, with their organization filter, are the single source of
truth for what creates tickets.
Idempotency: NinjaOne retries on 5xx. The external-ID upsert plus the dedup check make a replayed webhook a no-op, so at-least-once delivery is safe.
Schema changes
One additive migration.
rmm_alerts — add:
activity_typevarchar(100),acknowledged_attimestamptz,acknowledged_byuuid (columns the code already writes)dedup_keyvarchar(255), with an index on(tenant, integration_id, dedup_key)occurrence_countint default 1,last_occurrence_attimestamptzmatched_rule_iduuid nullauto_ticket_createdboolean default falsesuppressed_by_window_iduuid null (alertstatusgains asuppressedvalue)
New table rmm_maintenance_windows: tenant, window_id, optional scoping
columns integration_id, client_id, asset_id (null = applies to all of
that dimension), name, is_active, starts_at/ends_at for one-off
windows, and a recurrence jsonb
({ type: 'weekly', days, startTime, endTime, timezone }) for recurring ones.
The raw payload standardizes on the existing metadata jsonb column. Code that
writes source_data changes to metadata.
rmm_alert_rules — add conditions jsonb and actions jsonb; drop the eleven
flat columns (severity_filter, source_type_filter, alert_class_filter,
organization_filter, message_pattern, create_ticket,
ticket_channel_id, ticket_priority, assigned_user_id, ticket_template,
auto_resolve_ticket). name, description, is_active, and
priority_order stay as real columns.
conditions shape
All fields optional; every present field must match; an empty object matches every alert.
{
severities?: string[],
activityTypes?: string[],
alertClasses?: string[],
sourceTypes?: string[],
organizationIds?: string[], // external org IDs
messagePattern?: string, // regex, validated at save time
keywords?: string[] // substring match on message
}
actions shape
{
createTicket: boolean,
boardId?: string,
priorityOverride?: string,
assignToUserId?: string,
ticketTemplate?: { titleTemplate?: string, descriptionTemplate?: string },
autoResolveTicket: boolean,
autoResolveStatusId?: string, // fallback: tenant's first is_closed status
resetAlertOnTicketClose: boolean, // default true
notifyUserIds?: string[]
}
Ticket templates support {{device}}, {{message}}, {{severity}}, and
{{organization}} placeholders. Zod schemas validate both shapes at the
server-action boundary.
Lifecycle semantics
Dedup. On a triggered event whose matched rule creates tickets, look for an
alert row with the same dedup_key whose linked ticket is still open. If
found: point the new alert row at that ticket, increment occurrence_count,
and add an internal comment ("Alert re-triggered — Nth occurrence"). If not:
create a ticket. Dedup is always on; it is not per-rule configurable.
Untouched. A ticket is untouched when it has no human-authored comments, no time entries, and no manual status change since creation. Rule-driven auto-assignment does not count as touched.
Maintenance windows
The pipeline checks windows before rule matching. An alert is suppressed when
an active window matches all of its non-null scopes (integration, client,
asset) at the alert's occurredAt — one-off windows by starts_at/ends_at,
weekly recurring windows by day and time range in the window's timezone.
A suppressed alert is stored with status = 'suppressed' and
suppressed_by_window_id. It creates no ticket, sends no notifications, and
publishes no workflow events. A reset arriving for a suppressed alert resolves
it quietly. When a window ends, the reconciliation poller processes
still-active suppressed alerts through the normal rules path, so a condition
that fired during maintenance and is still firing afterward becomes a ticket.
Windows have their own CRUD server actions (admin-gated, Zod-validated) and a "Maintenance Windows" subsection beside Alert Rules in RMM settings: a list plus an editor with client/asset scope pickers and a one-off or weekly recurring schedule.
Alert polling (reconciliation)
Polling rides Alga's job-runner abstraction (IJobRunner,
JobRunnerFactory): the same handlers
(server/src/lib/jobs/handlers/rmmAlertPollingHandlers.ts, registered in the
central JobHandlerRegistry for the server and in
initializeJobHandlersForWorker for the temporal worker) run on whichever
backend the edition selects — Temporal Schedules in EE, pg-boss cron in
CE — so TacticalRMM keeps polling in community deployments. Each
integration gets one recurring job keyed by singletonKey
<job>:<tenant>:<integration>: rmm-alert-reconciliation (interval from
settings.alertPolling, 5–60 min, default 15, on by default) and
huntress-incident-poll (from pollIntervalMinutes, default 5), which
replaced Huntress's pg-boss dispatcher.
A small control loop, reconcileRmmPollingSchedules(), converges the
recurring jobs onto rmm_integrations state — creating jobs for new
connections, recreating them when intervals change, cancelling them on
disconnect or polling-disable. It runs every 5 minutes from initializeApp,
once at server boot, and immediately from the NinjaOne connect/disconnect
flows, so settings changes take effect without operator intervention.
Handlers also re-check integration eligibility per run, making any schedule
that briefly outlives its integration a harmless no-op.
TemporalJobRunner.scheduleRecurringJob was upgraded to update an existing
schedule's spec instead of silently keeping the old one. Each cycle works
through the same normalized pipeline:
- Fetch active alerts from the RMM (
NinjaOneClient.getAlerts(); Tactical's alerts API) and upsert any the webhooks missed astriggeredevents, so rules, dedup, and tickets apply identically. - Synthesize
resetevents for local active alerts no longer active in the RMM, catching missed reset webhooks (the main source of stale tickets). - Process expired-window suppressed alerts that are still active.
Webhooks remain the primary low-latency path; polling is the backstop. Ingest idempotency makes the overlap harmless.
One id-space caveat: NinjaOne webhooks carry activity ids while its alerts API returns condition uids, so stale-alert detection only trusts poller-ingested ids (marked in alert metadata) and never closes a webhook-created alert on the poller's say-so. Per-device+condition dedup absorbs cross-source duplicates. NinjaOne and TacticalRMM ship fetchers (Tactical's reuses the alerts endpoint its manual backfill verified); Level has no list-alerts surface and stays webhook-only.
Rules CRUD and UI
Server actions in packages/integrations (alongside tacticalRmmActions.ts):
list, create, update, delete, reorder. Admin-gated, Zod-validated.
UI: an "Alert Rules" section inside the RMM integration settings, rendered
per-integration next to the existing org-mapping manager, shared by NinjaOne
and Tactical. A priority-ordered rules list with active toggles and reorder
controls. The rule editor dialog mirrors the two JSONB shapes: a Match group
(severity multi-select, activity types, alert classes, org picker fed from
rmm_organization_mappings, keywords, regex with save-time validation) and an
Actions group (create-ticket toggle, board picker, priority override, assignee,
title/description templates with placeholder hints, auto-resolve toggle,
reset-on-close toggle, notify-users picker).
The per-asset AssetAlertsSection remains the alert-viewing surface. Tickets
remain the primary work queue.
Workflow v2 events and notifications
Register RMM_ALERT_TRIGGERED and RMM_ALERT_RESOLVED as native workflow v2
catalog events with a provider-generic payload (alert ID, external IDs,
severity, asset/client IDs, ticket ID if any). The pipeline publishes them,
replacing today's orphaned legacy-bus publishes. Add one generic workflow
action, rmm.alerts.create_ticket, which invokes the shared ticket creator by
alert ID.
Notifications ride the existing notification infrastructure: a new rmm-alert
category honoring per-user preferences (in-app + email), fired when a matched
rule has notifyUserIds.
Error handling
- A rule that fails to evaluate (for example, a bad regex) is logged and skipped. It never aborts the pipeline.
- Webhook responses are unchanged: 200 for unmapped orgs, 200 on success, 500 on unexpected errors so the RMM retries. Ingest is idempotent, so retries are safe.
- Outbound reset failures log, stamp the alert's
metadata, and never block the ticket close. - Two hardening items in adjacent code: implement the CSRF validation TODO in
the NinjaOne OAuth callback (the state payload already carries
csrfandtimestamp), and remove the supersededresetInNinjaOneTODO inalertProcessor.ts'sresolveAlert().
Testing
80/20 split. An automated core targets the logic where permutations hide bugs
and regressions are expensive: the rule-evaluation matrix, dedup, the
untouched-ticket check, window matching (timezones, midnight-crossing
recurrence, scope combinations), idempotency (webhook replay, webhook+poller
overlap), lifecycle in both directions, tenant isolation and admin gating on
CRUD, and one end-to-end webhook→ticket integration test per direction. The
existing alertProcessor tests move to the new shared module.
Everything UI-, infra-, or delivery-shaped is a manual smoke pass instead:
settings screens, live NinjaOne/Tactical round-trips, Temporal schedule
lifecycle on connect/disconnect, email delivery, and migrations against a real
stack. The risk-driven checklist lives at
docs/plans/2026-06-12-rmm-alert-handling/SMOKE_TESTS.md.