PSA/docs/plans/2026-06-12-rmm-alert-handling-design.md
Hermes 284313f908
Some checks are pending
Bidi Control Character Guard / bidi-control-guard (push) Waiting to run
Circular Dependency Check / Check for new circular dependencies (push) Waiting to run
Citus Migration Smoke / Combined migrations on single-node Citus (push) Waiting to run
E2E Fresh Install Tests / fresh-install-e2e (push) Waiting to run
ext-v2 guardrails / Run ext-v2 guard and ESLint (push) Waiting to run
Integration Tests / Check for relevant changes (push) Waiting to run
Integration Tests / ${{ (github.event_name == 'schedule' || github.event.inputs.suite == 'full') && 'Full integration suite' || 'Tier-1 integration subset' }} (push) Blocked by required conditions
Mobile checks / Mobile lint + typecheck (push) Waiting to run
Mobile checks / Mobile unit tests (push) Waiting to run
Mobile checks / Mobile dependency audit (report) (push) Waiting to run
Mobile checks / Mobile reproducibility checks (push) Waiting to run
Secrets guard (env backups) / Ensure no tracked env backup files (push) Waiting to run
Temporal Readiness / fast-readiness (push) Waiting to run
Temporal Readiness / docker-parity (push) Waiting to run
TypeScript Type Check / Nx affected typecheck (push) Waiting to run
Unit Tests / Skipped-test budget (push) Waiting to run
Unit Tests / Nx affected unit tests (push) Waiting to run
Unit Tests / Server unit coverage (informational) (push) Waiting to run
Validate Tenant Management Schema / Check for relevant changes (push) Waiting to run
Validate Tenant Management Schema / Validate Tenant Management Schema (push) Blocked by required conditions
EE Workflows Build Guard / ee-workflows-build-guard (push) Waiting to run
Initial import of AlgaPSA codebase from PSA server
Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz

Source: /opt/alga-psa on psa.joliet.tech
2026-06-22 16:12:17 -05:00

14 KiB
Raw Blame History

Design: RMM alert handling and ticket lifecycle sync

  • Status: Approved
  • Created: 2026-06-12
  • Branch: feature/rmm-alerts-sync

Problem

RMM alert ingestion is scaffolded but broken and disconnected:

  • The NinjaOne webhook (ee/server/src/app/api/webhooks/ninjaone/route.ts) inserts activity_type and source_data into rmm_alerts (webhookHandler.ts:671-674). Neither column exists. The only migration that creates these tables is server/migrations/20251124000001_create_rmm_integration_tables.cjs, so the basic alert insert fails at runtime.
  • The rules engine (ee/server/src/lib/integrations/ninjaone/alerts/alertProcessor.ts) reads JSONB conditions / actions columns from rmm_alert_rules. The migration created flat text[] filter columns instead. The processor is also never called from the webhook path; only tests import it.
  • The ticket creator (ee/server/src/lib/integrations/ninjaone/alerts/ticketCreator.ts) works but is reachable only from the manual button in AssetAlertsSection.tsx.
  • Nothing closes a ticket when an alert resets, nothing resets a NinjaOne alert when a ticket closes, repeat alerts have no dedup, and there is no UI or CRUD for alert rules.

The rmm_alerts / rmm_alert_rules tables are deployed but hold no data worth preserving, so corrective schema work can be additive without backfill.

Decisions

Question Decision
Scope Full pipeline + lifecycle sync + rules CRUD/UI + workflow v2 events + notifications
Dedup One open ticket per (device, condition). Repeats append a comment and bump a counter
Auto-close on alert reset Always comment; close the ticket only if no human has touched it
Outbound reset on ticket close Per-rule flag resetAlertOnTicketClose, default true
Provider scope Provider-generic pipeline; NinjaOne and TacticalRMM both wired
Processing model Synchronous in the webhook request (local DB work only)
Migration strategy One additive corrective migration; never rewrite the deployed one
Maintenance windows Suppress before rule matching; alert stored as suppressed; the poller processes still-active alerts after the window ends
Alert polling Per-integration Temporal schedule (default on, every 15 min) reconciles missed triggers and missed resets through the same pipeline; Huntress incident polling migrated to the same model

Architecture

New shared module shared/rmm/alerts/, following the shared/rmm/sharedAssetIngestionService.ts precedent so both ee/server (NinjaOne) and server (TacticalRMM) can import it:

  • contracts.ts defines NormalizedRmmAlertEvent: { tenantId, integrationId, provider, kind: 'triggered' | 'reset' | 'acknowledged', externalAlertId, externalDeviceId, activityType, alertClass, sourceType, severity, message, deviceName, externalOrganizationId, occurredAt, raw } — and RmmAlertOutboundAdapter with resetAlert(externalAlertId), optional per provider.
  • processRmmAlertEvent.ts is the single pipeline entry point.
  • alertRuleEvaluator.ts, alertTicketCreator.ts, alertLifecycle.ts hold the logic moved out of ee/server/src/lib/integrations/ninjaone/alerts/ and made provider-agnostic.

Webhook routes keep their existing auth, tenant resolution, and tier gating. Each route maps its payload to a NormalizedRmmAlertEvent (a thin mapNinjaOneWebhookToAlertEvent() in ee; a Tactical equivalent in server/src/app/api/webhooks/tacticalrmm/route.ts, replacing its direct rmm_alerts writes) and calls the pipeline.

Triggered: upsert rmm_alerts on (tenant, integration_id, external_alert_id) → compute dedup_key (device + condition identity; for NinjaOne, statusCode falling back to activityType) → maintenance-window check (a match stores the alert as suppressed and stops) → evaluate rules (first match by priority_order; a rule with no conditions is a catch-all) → dedup check → create a ticket or append an occurrence comment to the existing open ticket → publish events. The matched rule's ID is stored on the alert row so later lifecycle steps do not re-evaluate rules.

Reset: mark the alert resolved. If a ticket is linked and the matched rule has autoResolveTicket: always add a comment; close the ticket only if it is untouched. Publish RMM_ALERT_RESOLVED.

Outbound (ticket close → RMM): an event-bus subscriber on ticket-closed events looks up unresolved rmm_alerts by ticket_id, checks the matched rule's resetAlertOnTicketClose, and calls the provider's outbound adapter (NinjaOneClient.resetAlert() at ee/server/src/lib/integrations/ninjaone/ninjaOneClient.ts; Tactical's resolve-alert API if supported, otherwise the step is skipped). This is the only external call in the design and it already runs async on the bus.

The org-mapping flag rmm_organization_mappings.auto_create_tickets is deprecated. Rules, with their organization filter, are the single source of truth for what creates tickets.

Idempotency: NinjaOne retries on 5xx. The external-ID upsert plus the dedup check make a replayed webhook a no-op, so at-least-once delivery is safe.

Schema changes

One additive migration.

rmm_alerts — add:

  • activity_type varchar(100), acknowledged_at timestamptz, acknowledged_by uuid (columns the code already writes)
  • dedup_key varchar(255), with an index on (tenant, integration_id, dedup_key)
  • occurrence_count int default 1, last_occurrence_at timestamptz
  • matched_rule_id uuid null
  • auto_ticket_created boolean default false
  • suppressed_by_window_id uuid null (alert status gains a suppressed value)

New table rmm_maintenance_windows: tenant, window_id, optional scoping columns integration_id, client_id, asset_id (null = applies to all of that dimension), name, is_active, starts_at/ends_at for one-off windows, and a recurrence jsonb ({ type: 'weekly', days, startTime, endTime, timezone }) for recurring ones.

The raw payload standardizes on the existing metadata jsonb column. Code that writes source_data changes to metadata.

rmm_alert_rules — add conditions jsonb and actions jsonb; drop the eleven flat columns (severity_filter, source_type_filter, alert_class_filter, organization_filter, message_pattern, create_ticket, ticket_channel_id, ticket_priority, assigned_user_id, ticket_template, auto_resolve_ticket). name, description, is_active, and priority_order stay as real columns.

conditions shape

All fields optional; every present field must match; an empty object matches every alert.

{
  severities?: string[],
  activityTypes?: string[],
  alertClasses?: string[],
  sourceTypes?: string[],
  organizationIds?: string[],   // external org IDs
  messagePattern?: string,      // regex, validated at save time
  keywords?: string[]           // substring match on message
}

actions shape

{
  createTicket: boolean,
  boardId?: string,
  priorityOverride?: string,
  assignToUserId?: string,
  ticketTemplate?: { titleTemplate?: string, descriptionTemplate?: string },
  autoResolveTicket: boolean,
  autoResolveStatusId?: string,      // fallback: tenant's first is_closed status
  resetAlertOnTicketClose: boolean,  // default true
  notifyUserIds?: string[]
}

Ticket templates support {{device}}, {{message}}, {{severity}}, and {{organization}} placeholders. Zod schemas validate both shapes at the server-action boundary.

Lifecycle semantics

Dedup. On a triggered event whose matched rule creates tickets, look for an alert row with the same dedup_key whose linked ticket is still open. If found: point the new alert row at that ticket, increment occurrence_count, and add an internal comment ("Alert re-triggered — Nth occurrence"). If not: create a ticket. Dedup is always on; it is not per-rule configurable.

Untouched. A ticket is untouched when it has no human-authored comments, no time entries, and no manual status change since creation. Rule-driven auto-assignment does not count as touched.

Maintenance windows

The pipeline checks windows before rule matching. An alert is suppressed when an active window matches all of its non-null scopes (integration, client, asset) at the alert's occurredAt — one-off windows by starts_at/ends_at, weekly recurring windows by day and time range in the window's timezone.

A suppressed alert is stored with status = 'suppressed' and suppressed_by_window_id. It creates no ticket, sends no notifications, and publishes no workflow events. A reset arriving for a suppressed alert resolves it quietly. When a window ends, the reconciliation poller processes still-active suppressed alerts through the normal rules path, so a condition that fired during maintenance and is still firing afterward becomes a ticket.

Windows have their own CRUD server actions (admin-gated, Zod-validated) and a "Maintenance Windows" subsection beside Alert Rules in RMM settings: a list plus an editor with client/asset scope pickers and a one-off or weekly recurring schedule.

Alert polling (reconciliation)

Polling rides Alga's job-runner abstraction (IJobRunner, JobRunnerFactory): the same handlers (server/src/lib/jobs/handlers/rmmAlertPollingHandlers.ts, registered in the central JobHandlerRegistry for the server and in initializeJobHandlersForWorker for the temporal worker) run on whichever backend the edition selects — Temporal Schedules in EE, pg-boss cron in CE — so TacticalRMM keeps polling in community deployments. Each integration gets one recurring job keyed by singletonKey <job>:<tenant>:<integration>: rmm-alert-reconciliation (interval from settings.alertPolling, 560 min, default 15, on by default) and huntress-incident-poll (from pollIntervalMinutes, default 5), which replaced Huntress's pg-boss dispatcher.

A small control loop, reconcileRmmPollingSchedules(), converges the recurring jobs onto rmm_integrations state — creating jobs for new connections, recreating them when intervals change, cancelling them on disconnect or polling-disable. It runs every 5 minutes from initializeApp, once at server boot, and immediately from the NinjaOne connect/disconnect flows, so settings changes take effect without operator intervention. Handlers also re-check integration eligibility per run, making any schedule that briefly outlives its integration a harmless no-op. TemporalJobRunner.scheduleRecurringJob was upgraded to update an existing schedule's spec instead of silently keeping the old one. Each cycle works through the same normalized pipeline:

  1. Fetch active alerts from the RMM (NinjaOneClient.getAlerts(); Tactical's alerts API) and upsert any the webhooks missed as triggered events, so rules, dedup, and tickets apply identically.
  2. Synthesize reset events for local active alerts no longer active in the RMM, catching missed reset webhooks (the main source of stale tickets).
  3. Process expired-window suppressed alerts that are still active.

Webhooks remain the primary low-latency path; polling is the backstop. Ingest idempotency makes the overlap harmless.

One id-space caveat: NinjaOne webhooks carry activity ids while its alerts API returns condition uids, so stale-alert detection only trusts poller-ingested ids (marked in alert metadata) and never closes a webhook-created alert on the poller's say-so. Per-device+condition dedup absorbs cross-source duplicates. NinjaOne and TacticalRMM ship fetchers (Tactical's reuses the alerts endpoint its manual backfill verified); Level has no list-alerts surface and stays webhook-only.

Rules CRUD and UI

Server actions in packages/integrations (alongside tacticalRmmActions.ts): list, create, update, delete, reorder. Admin-gated, Zod-validated.

UI: an "Alert Rules" section inside the RMM integration settings, rendered per-integration next to the existing org-mapping manager, shared by NinjaOne and Tactical. A priority-ordered rules list with active toggles and reorder controls. The rule editor dialog mirrors the two JSONB shapes: a Match group (severity multi-select, activity types, alert classes, org picker fed from rmm_organization_mappings, keywords, regex with save-time validation) and an Actions group (create-ticket toggle, board picker, priority override, assignee, title/description templates with placeholder hints, auto-resolve toggle, reset-on-close toggle, notify-users picker).

The per-asset AssetAlertsSection remains the alert-viewing surface. Tickets remain the primary work queue.

Workflow v2 events and notifications

Register RMM_ALERT_TRIGGERED and RMM_ALERT_RESOLVED as native workflow v2 catalog events with a provider-generic payload (alert ID, external IDs, severity, asset/client IDs, ticket ID if any). The pipeline publishes them, replacing today's orphaned legacy-bus publishes. Add one generic workflow action, rmm.alerts.create_ticket, which invokes the shared ticket creator by alert ID.

Notifications ride the existing notification infrastructure: a new rmm-alert category honoring per-user preferences (in-app + email), fired when a matched rule has notifyUserIds.

Error handling

  • A rule that fails to evaluate (for example, a bad regex) is logged and skipped. It never aborts the pipeline.
  • Webhook responses are unchanged: 200 for unmapped orgs, 200 on success, 500 on unexpected errors so the RMM retries. Ingest is idempotent, so retries are safe.
  • Outbound reset failures log, stamp the alert's metadata, and never block the ticket close.
  • Two hardening items in adjacent code: implement the CSRF validation TODO in the NinjaOne OAuth callback (the state payload already carries csrf and timestamp), and remove the superseded resetInNinjaOne TODO in alertProcessor.ts's resolveAlert().

Testing

80/20 split. An automated core targets the logic where permutations hide bugs and regressions are expensive: the rule-evaluation matrix, dedup, the untouched-ticket check, window matching (timezones, midnight-crossing recurrence, scope combinations), idempotency (webhook replay, webhook+poller overlap), lifecycle in both directions, tenant isolation and admin gating on CRUD, and one end-to-end webhook→ticket integration test per direction. The existing alertProcessor tests move to the new shared module.

Everything UI-, infra-, or delivery-shaped is a manual smoke pass instead: settings screens, live NinjaOne/Tactical round-trips, Temporal schedule lifecycle on connect/disconnect, email delivery, and migrations against a real stack. The risk-driven checklist lives at docs/plans/2026-06-12-rmm-alert-handling/SMOKE_TESTS.md.