# Design: RMM alert handling and ticket lifecycle sync - **Status:** Approved - **Created:** 2026-06-12 - **Branch:** `feature/rmm-alerts-sync` ## Problem RMM alert ingestion is scaffolded but broken and disconnected: - The NinjaOne webhook (`ee/server/src/app/api/webhooks/ninjaone/route.ts`) inserts `activity_type` and `source_data` into `rmm_alerts` (`webhookHandler.ts:671-674`). Neither column exists. The only migration that creates these tables is `server/migrations/20251124000001_create_rmm_integration_tables.cjs`, so the basic alert insert fails at runtime. - The rules engine (`ee/server/src/lib/integrations/ninjaone/alerts/alertProcessor.ts`) reads JSONB `conditions` / `actions` columns from `rmm_alert_rules`. The migration created flat `text[]` filter columns instead. The processor is also never called from the webhook path; only tests import it. - The ticket creator (`ee/server/src/lib/integrations/ninjaone/alerts/ticketCreator.ts`) works but is reachable only from the manual button in `AssetAlertsSection.tsx`. - Nothing closes a ticket when an alert resets, nothing resets a NinjaOne alert when a ticket closes, repeat alerts have no dedup, and there is no UI or CRUD for alert rules. The `rmm_alerts` / `rmm_alert_rules` tables are deployed but hold no data worth preserving, so corrective schema work can be additive without backfill. ## Decisions | Question | Decision | | --- | --- | | Scope | Full pipeline + lifecycle sync + rules CRUD/UI + workflow v2 events + notifications | | Dedup | One open ticket per (device, condition). Repeats append a comment and bump a counter | | Auto-close on alert reset | Always comment; close the ticket only if no human has touched it | | Outbound reset on ticket close | Per-rule flag `resetAlertOnTicketClose`, default true | | Provider scope | Provider-generic pipeline; NinjaOne and TacticalRMM both wired | | Processing model | Synchronous in the webhook request (local DB work only) | | Migration strategy | One additive corrective migration; never rewrite the deployed one | | Maintenance windows | Suppress before rule matching; alert stored as `suppressed`; the poller processes still-active alerts after the window ends | | Alert polling | Per-integration Temporal schedule (default on, every 15 min) reconciles missed triggers and missed resets through the same pipeline; Huntress incident polling migrated to the same model | ## Architecture New shared module `shared/rmm/alerts/`, following the `shared/rmm/sharedAssetIngestionService.ts` precedent so both `ee/server` (NinjaOne) and `server` (TacticalRMM) can import it: - `contracts.ts` defines `NormalizedRmmAlertEvent`: `{ tenantId, integrationId, provider, kind: 'triggered' | 'reset' | 'acknowledged', externalAlertId, externalDeviceId, activityType, alertClass, sourceType, severity, message, deviceName, externalOrganizationId, occurredAt, raw }` — and `RmmAlertOutboundAdapter` with `resetAlert(externalAlertId)`, optional per provider. - `processRmmAlertEvent.ts` is the single pipeline entry point. - `alertRuleEvaluator.ts`, `alertTicketCreator.ts`, `alertLifecycle.ts` hold the logic moved out of `ee/server/src/lib/integrations/ninjaone/alerts/` and made provider-agnostic. Webhook routes keep their existing auth, tenant resolution, and tier gating. Each route maps its payload to a `NormalizedRmmAlertEvent` (a thin `mapNinjaOneWebhookToAlertEvent()` in ee; a Tactical equivalent in `server/src/app/api/webhooks/tacticalrmm/route.ts`, replacing its direct `rmm_alerts` writes) and calls the pipeline. **Triggered:** upsert `rmm_alerts` on `(tenant, integration_id, external_alert_id)` → compute `dedup_key` (device + condition identity; for NinjaOne, `statusCode` falling back to `activityType`) → maintenance-window check (a match stores the alert as `suppressed` and stops) → evaluate rules (first match by `priority_order`; a rule with no conditions is a catch-all) → dedup check → create a ticket or append an occurrence comment to the existing open ticket → publish events. The matched rule's ID is stored on the alert row so later lifecycle steps do not re-evaluate rules. **Reset:** mark the alert resolved. If a ticket is linked and the matched rule has `autoResolveTicket`: always add a comment; close the ticket only if it is untouched. Publish `RMM_ALERT_RESOLVED`. **Outbound (ticket close → RMM):** an event-bus subscriber on ticket-closed events looks up unresolved `rmm_alerts` by `ticket_id`, checks the matched rule's `resetAlertOnTicketClose`, and calls the provider's outbound adapter (`NinjaOneClient.resetAlert()` at `ee/server/src/lib/integrations/ninjaone/ninjaOneClient.ts`; Tactical's resolve-alert API if supported, otherwise the step is skipped). This is the only external call in the design and it already runs async on the bus. The org-mapping flag `rmm_organization_mappings.auto_create_tickets` is deprecated. Rules, with their organization filter, are the single source of truth for what creates tickets. Idempotency: NinjaOne retries on 5xx. The external-ID upsert plus the dedup check make a replayed webhook a no-op, so at-least-once delivery is safe. ## Schema changes One additive migration. `rmm_alerts` — add: - `activity_type` varchar(100), `acknowledged_at` timestamptz, `acknowledged_by` uuid (columns the code already writes) - `dedup_key` varchar(255), with an index on `(tenant, integration_id, dedup_key)` - `occurrence_count` int default 1, `last_occurrence_at` timestamptz - `matched_rule_id` uuid null - `auto_ticket_created` boolean default false - `suppressed_by_window_id` uuid null (alert `status` gains a `suppressed` value) New table `rmm_maintenance_windows`: `tenant`, `window_id`, optional scoping columns `integration_id`, `client_id`, `asset_id` (null = applies to all of that dimension), `name`, `is_active`, `starts_at`/`ends_at` for one-off windows, and a `recurrence` jsonb (`{ type: 'weekly', days, startTime, endTime, timezone }`) for recurring ones. The raw payload standardizes on the existing `metadata` jsonb column. Code that writes `source_data` changes to `metadata`. `rmm_alert_rules` — add `conditions` jsonb and `actions` jsonb; drop the eleven flat columns (`severity_filter`, `source_type_filter`, `alert_class_filter`, `organization_filter`, `message_pattern`, `create_ticket`, `ticket_channel_id`, `ticket_priority`, `assigned_user_id`, `ticket_template`, `auto_resolve_ticket`). `name`, `description`, `is_active`, and `priority_order` stay as real columns. ### `conditions` shape All fields optional; every present field must match; an empty object matches every alert. ```ts { severities?: string[], activityTypes?: string[], alertClasses?: string[], sourceTypes?: string[], organizationIds?: string[], // external org IDs messagePattern?: string, // regex, validated at save time keywords?: string[] // substring match on message } ``` ### `actions` shape ```ts { createTicket: boolean, boardId?: string, priorityOverride?: string, assignToUserId?: string, ticketTemplate?: { titleTemplate?: string, descriptionTemplate?: string }, autoResolveTicket: boolean, autoResolveStatusId?: string, // fallback: tenant's first is_closed status resetAlertOnTicketClose: boolean, // default true notifyUserIds?: string[] } ``` Ticket templates support `{{device}}`, `{{message}}`, `{{severity}}`, and `{{organization}}` placeholders. Zod schemas validate both shapes at the server-action boundary. ## Lifecycle semantics **Dedup.** On a triggered event whose matched rule creates tickets, look for an alert row with the same `dedup_key` whose linked ticket is still open. If found: point the new alert row at that ticket, increment `occurrence_count`, and add an internal comment ("Alert re-triggered — Nth occurrence"). If not: create a ticket. Dedup is always on; it is not per-rule configurable. **Untouched.** A ticket is untouched when it has no human-authored comments, no time entries, and no manual status change since creation. Rule-driven auto-assignment does not count as touched. ## Maintenance windows The pipeline checks windows before rule matching. An alert is suppressed when an active window matches all of its non-null scopes (integration, client, asset) at the alert's `occurredAt` — one-off windows by `starts_at`/`ends_at`, weekly recurring windows by day and time range in the window's timezone. A suppressed alert is stored with `status = 'suppressed'` and `suppressed_by_window_id`. It creates no ticket, sends no notifications, and publishes no workflow events. A reset arriving for a suppressed alert resolves it quietly. When a window ends, the reconciliation poller processes still-active suppressed alerts through the normal rules path, so a condition that fired during maintenance and is still firing afterward becomes a ticket. Windows have their own CRUD server actions (admin-gated, Zod-validated) and a "Maintenance Windows" subsection beside Alert Rules in RMM settings: a list plus an editor with client/asset scope pickers and a one-off or weekly recurring schedule. ## Alert polling (reconciliation) Polling rides Alga's job-runner abstraction (`IJobRunner`, `JobRunnerFactory`): the same handlers (`server/src/lib/jobs/handlers/rmmAlertPollingHandlers.ts`, registered in the central `JobHandlerRegistry` for the server and in `initializeJobHandlersForWorker` for the temporal worker) run on whichever backend the edition selects — **Temporal Schedules in EE, pg-boss cron in CE** — so TacticalRMM keeps polling in community deployments. Each integration gets one recurring job keyed by singletonKey `::`: `rmm-alert-reconciliation` (interval from `settings.alertPolling`, 5–60 min, default 15, on by default) and `huntress-incident-poll` (from `pollIntervalMinutes`, default 5), which replaced Huntress's pg-boss dispatcher. A small control loop, `reconcileRmmPollingSchedules()`, converges the recurring jobs onto `rmm_integrations` state — creating jobs for new connections, recreating them when intervals change, cancelling them on disconnect or polling-disable. It runs every 5 minutes from `initializeApp`, once at server boot, and immediately from the NinjaOne connect/disconnect flows, so settings changes take effect without operator intervention. Handlers also re-check integration eligibility per run, making any schedule that briefly outlives its integration a harmless no-op. `TemporalJobRunner.scheduleRecurringJob` was upgraded to update an existing schedule's spec instead of silently keeping the old one. Each cycle works through the same normalized pipeline: 1. Fetch active alerts from the RMM (`NinjaOneClient.getAlerts()`; Tactical's alerts API) and upsert any the webhooks missed as `triggered` events, so rules, dedup, and tickets apply identically. 2. Synthesize `reset` events for local active alerts no longer active in the RMM, catching missed reset webhooks (the main source of stale tickets). 3. Process expired-window suppressed alerts that are still active. Webhooks remain the primary low-latency path; polling is the backstop. Ingest idempotency makes the overlap harmless. One id-space caveat: NinjaOne webhooks carry activity ids while its alerts API returns condition uids, so stale-alert detection only trusts poller-ingested ids (marked in alert metadata) and never closes a webhook-created alert on the poller's say-so. Per-device+condition dedup absorbs cross-source duplicates. NinjaOne and TacticalRMM ship fetchers (Tactical's reuses the alerts endpoint its manual backfill verified); Level has no list-alerts surface and stays webhook-only. ## Rules CRUD and UI Server actions in `packages/integrations` (alongside `tacticalRmmActions.ts`): list, create, update, delete, reorder. Admin-gated, Zod-validated. UI: an "Alert Rules" section inside the RMM integration settings, rendered per-integration next to the existing org-mapping manager, shared by NinjaOne and Tactical. A priority-ordered rules list with active toggles and reorder controls. The rule editor dialog mirrors the two JSONB shapes: a Match group (severity multi-select, activity types, alert classes, org picker fed from `rmm_organization_mappings`, keywords, regex with save-time validation) and an Actions group (create-ticket toggle, board picker, priority override, assignee, title/description templates with placeholder hints, auto-resolve toggle, reset-on-close toggle, notify-users picker). The per-asset `AssetAlertsSection` remains the alert-viewing surface. Tickets remain the primary work queue. ## Workflow v2 events and notifications Register `RMM_ALERT_TRIGGERED` and `RMM_ALERT_RESOLVED` as native workflow v2 catalog events with a provider-generic payload (alert ID, external IDs, severity, asset/client IDs, ticket ID if any). The pipeline publishes them, replacing today's orphaned legacy-bus publishes. Add one generic workflow action, `rmm.alerts.create_ticket`, which invokes the shared ticket creator by alert ID. Notifications ride the existing notification infrastructure: a new `rmm-alert` category honoring per-user preferences (in-app + email), fired when a matched rule has `notifyUserIds`. ## Error handling - A rule that fails to evaluate (for example, a bad regex) is logged and skipped. It never aborts the pipeline. - Webhook responses are unchanged: 200 for unmapped orgs, 200 on success, 500 on unexpected errors so the RMM retries. Ingest is idempotent, so retries are safe. - Outbound reset failures log, stamp the alert's `metadata`, and never block the ticket close. - Two hardening items in adjacent code: implement the CSRF validation TODO in the NinjaOne OAuth callback (the state payload already carries `csrf` and `timestamp`), and remove the superseded `resetInNinjaOne` TODO in `alertProcessor.ts`'s `resolveAlert()`. ## Testing 80/20 split. An automated core targets the logic where permutations hide bugs and regressions are expensive: the rule-evaluation matrix, dedup, the untouched-ticket check, window matching (timezones, midnight-crossing recurrence, scope combinations), idempotency (webhook replay, webhook+poller overlap), lifecycle in both directions, tenant isolation and admin gating on CRUD, and one end-to-end webhook→ticket integration test per direction. The existing `alertProcessor` tests move to the new shared module. Everything UI-, infra-, or delivery-shaped is a manual smoke pass instead: settings screens, live NinjaOne/Tactical round-trips, Temporal schedule lifecycle on connect/disconnect, email delivery, and migrations against a real stack. The risk-driven checklist lives at `docs/plans/2026-06-12-rmm-alert-handling/SMOKE_TESTS.md`.