Hermes 284313f908
Some checks are pending
Bidi Control Character Guard / bidi-control-guard (push) Waiting to run
Circular Dependency Check / Check for new circular dependencies (push) Waiting to run
Citus Migration Smoke / Combined migrations on single-node Citus (push) Waiting to run
E2E Fresh Install Tests / fresh-install-e2e (push) Waiting to run
ext-v2 guardrails / Run ext-v2 guard and ESLint (push) Waiting to run
Integration Tests / Check for relevant changes (push) Waiting to run
Integration Tests / ${{ (github.event_name == 'schedule' || github.event.inputs.suite == 'full') && 'Full integration suite' || 'Tier-1 integration subset' }} (push) Blocked by required conditions
Mobile checks / Mobile lint + typecheck (push) Waiting to run
Mobile checks / Mobile unit tests (push) Waiting to run
Mobile checks / Mobile dependency audit (report) (push) Waiting to run
Mobile checks / Mobile reproducibility checks (push) Waiting to run
Secrets guard (env backups) / Ensure no tracked env backup files (push) Waiting to run
Temporal Readiness / fast-readiness (push) Waiting to run
Temporal Readiness / docker-parity (push) Waiting to run
TypeScript Type Check / Nx affected typecheck (push) Waiting to run
Unit Tests / Skipped-test budget (push) Waiting to run
Unit Tests / Nx affected unit tests (push) Waiting to run
Unit Tests / Server unit coverage (informational) (push) Waiting to run
Validate Tenant Management Schema / Check for relevant changes (push) Waiting to run
Validate Tenant Management Schema / Validate Tenant Management Schema (push) Blocked by required conditions
EE Workflows Build Guard / ee-workflows-build-guard (push) Waiting to run
Initial import of AlgaPSA codebase from PSA server
Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz

Source: /opt/alga-psa on psa.joliet.tech
2026-06-22 16:12:17 -05:00

12 KiB
Raw Permalink Blame History

PRD — RMM Alert Handling

  • Slug: 2026-06-12-rmm-alert-handling
  • Date: 2026-06-12
  • Status: Approved
  • Branch: feature/rmm-alerts-sync
  • Design doc: docs/plans/2026-06-12-rmm-alert-handling-design.md

Summary

Turn RMM alerts into tickets automatically. A provider-generic pipeline in shared/rmm/alerts/ ingests normalized alert events from the NinjaOne and TacticalRMM webhooks, evaluates tenant-defined rules, creates or updates tickets with dedup, and keeps alert and ticket lifecycles in sync in both directions. Tenants manage rules from the RMM integration settings UI.

Problem

Alert ingestion is scaffolded but broken on main: the webhook writes columns that don't exist in rmm_alerts, the rules engine expects a JSONB schema the migration never created, and nothing connects alerts to tickets except a manual button. There is no dedup, no auto-close, no outbound reset, and no rules UI. Competing PSAs (ConnectWise, Autotask, Halo) treat all of this as table stakes.

Goals

  • Webhook-delivered alerts create tickets automatically per tenant-defined rules.
  • Repeat firings of the same condition on the same device land on the existing open ticket instead of creating ticket storms.
  • Alert resets close untouched tickets and annotate touched ones.
  • Closing an alert-linked ticket resets the alert in the RMM (per-rule opt-out).
  • Rules are manageable from the integration settings UI by admins.
  • Alert events are first-class workflow v2 triggers; matched rules can notify users.
  • Maintenance windows suppress alert ticketing for a client, asset, or integration during planned work, without losing the alerts.
  • Scheduled polling reconciles missed webhooks: missed triggers become tickets, missed resets close stale tickets, and post-window still-active alerts get processed.
  • One pipeline serves NinjaOne and TacticalRMM; a third provider only needs a normalizer and an optional outbound adapter.

Non-goals

  • RMM device-count billing integration, scheduled device sync, or org auto-matching.
  • Per-rule dedup configuration (dedup behavior is fixed).
  • Migrating NinjaOne device sync onto sharedAssetIngestionService (separate effort).

Users and Primary Flows

  • MSP admin configures alert rules per RMM integration: match conditions, ticket routing, lifecycle flags, notifications.
  • Dispatcher/tech works alert tickets like any other ticket: sees occurrence comments on flapping conditions, sees resolution comments when alerts reset, and closing a ticket clears the alert in the RMM.
  • Automation builder uses RMM_ALERT_TRIGGERED/RMM_ALERT_RESOLVED workflow triggers and the rmm.alerts.create_ticket action for custom flows.

UX / UI Notes

  • New "Alert Rules" section in RMM integration settings (next to the org-mapping manager), rendered for NinjaOne and TacticalRMM.
  • Priority-ordered rules list: active toggle, reorder controls, edit/delete.
  • Rule editor dialog with a Match group (severities, activity types, alert classes, source types, organization picker fed from rmm_organization_mappings, keywords, message regex) and an Actions group (create-ticket toggle, board picker, priority override, assignee, title/description templates with placeholder hints, auto-resolve toggle, reset-on-close toggle, notify-users picker).
  • Save-time validation errors (e.g., bad regex) shown inline in the dialog.
  • "Maintenance Windows" subsection beside Alert Rules: a list plus an editor with client/asset scope pickers and a one-off or weekly recurring schedule (with timezone).
  • Alert polling enable/disable and interval (560 minutes) in integration settings.
  • Existing per-asset AssetAlertsSection remains the alert-viewing surface.

Requirements

Functional Requirements

FR-1 Schema

One additive migration. rmm_alerts gains activity_type, acknowledged_at, acknowledged_by, dedup_key (indexed with tenant + integration), occurrence_count (default 1), last_occurrence_at, matched_rule_id, auto_ticket_created, and suppressed_by_window_id (status gains a suppressed value). Raw payloads standardize on the existing metadata jsonb. rmm_alert_rules gains conditions and actions jsonb and drops the eleven flat filter/action columns. New rmm_maintenance_windows table (FR-10). The deployed 20251124000001 migration is not rewritten; no backfill.

FR-2 Contracts and normalizers

shared/rmm/alerts/contracts.ts defines NormalizedRmmAlertEvent (kind: triggered | reset | acknowledged) and the optional per-provider RmmAlertOutboundAdapter (resetAlert). Shared Zod schemas define the rule conditions/actions shapes (see design doc for exact fields). NinjaOne and TacticalRMM webhook routes map their payloads to the contract; existing webhook auth, tenant resolution, and tier gating are unchanged.

FR-3 Ingest pipeline

processRmmAlertEvent() runs synchronously in the webhook request. Triggered: upsert rmm_alerts on (tenant, integration_id, external_alert_id), compute and store dedup_key (device + condition identity; NinjaOne: statusCode falling back to activityType), evaluate active rules first-match by priority_order (empty conditions = catch-all; a rule that fails to evaluate is logged and skipped), store matched_rule_id, then act. Replayed webhooks are no-ops (idempotent ingest).

FR-4 Ticketing and dedup

If the matched rule creates tickets: an alert whose dedup_key matches an alert with a still-open linked ticket joins that ticket (link, increment occurrence_count, internal "re-triggered — Nth occurrence" comment). Otherwise create a ticket honoring boardId, priorityOverride (else severity→priority mapping), assignToUserId, and the title/description templates with {{device}}/{{message}}/{{severity}}/{{organization}} placeholders. Created tickets get source + source_reference, an asset association, an initial internal comment, and client resolution from the asset or the org mapping.

FR-5 Lifecycle

Reset marks the alert resolved. With a linked ticket and autoResolveTicket: always comment; close (via autoResolveStatusId, else the tenant's first is_closed status) only if the ticket is untouched — no human comments, no time entries, no manual status change; rule auto-assignment doesn't count. Acknowledged events stamp acknowledged_at/status. A ticket-closed event-bus subscriber resets still-active linked alerts in the RMM via the provider's outbound adapter when the matched rule's resetAlertOnTicketClose is true (default). Outbound failures log and stamp alert metadata; they never block the close. Providers without an adapter are skipped.

FR-6 Rules CRUD

List/create/update/delete/reorder server actions in packages/integrations, admin-gated, Zod-validated, regex validated at save time.

FR-7 Rules UI

The settings section and editor described in UX notes, shared across providers.

FR-8 Workflow v2 and notifications

RMM_ALERT_TRIGGERED and RMM_ALERT_RESOLVED registered in the workflow v2 catalog with provider-generic payloads and published by the pipeline (replacing the orphaned legacy-bus publishes). New rmm.alerts.create_ticket workflow action invokes the shared ticket creator by alert ID. New rmm-alert notification category delivers in-app + email to a matched rule's notifyUserIds, honoring per-user preferences.

FR-9 Hardening and cleanup

Implement CSRF validation in the NinjaOne OAuth callback. Move ninjaone/alerts/* logic into the shared module and remove the superseded resetInNinjaOne TODO. Deprecate rmm_organization_mappings.auto_create_tickets (no read paths remain).

FR-10 Maintenance windows

New rmm_maintenance_windows table: optional integration_id/client_id/ asset_id scopes (null = all of that dimension), one-off starts_at/ends_at or weekly recurrence jsonb (days, time range, timezone), name, is_active. The pipeline checks windows before rule matching: an alert matching all non-null scopes of an active window at its occurredAt is stored with status = 'suppressed' and suppressed_by_window_id — no ticket, no notifications, no workflow events. A reset for a suppressed alert resolves it quietly. Window CRUD server actions are admin-gated and Zod-validated, with the settings UI described in UX notes.

FR-11 Alert polling (reconciliation)

A per-integration Temporal scheduled workflow (Entra per-tenant schedule pattern), default on for connected integrations, every 15 minutes (configurable 560), created on connect and removed on disconnect. Each cycle, through the same pipeline: (1) upsert RMM-active alerts missing locally as triggered events; (2) synthesize reset events for local active alerts no longer active in the RMM; (3) process still-active suppressed alerts whose window ended through the normal rules path. Webhooks stay primary; ingest idempotency makes overlap harmless.

Non-functional Requirements

  • Webhook ingest path makes no external API calls; webhook latency stays bounded. RMM API calls happen only in the poller and the ticket-close subscriber, both off the request path.
  • All queries tenant-scoped (CitusDB composite keys: tenant + entity id).
  • Webhook response semantics preserved: 200 unmapped org, 200 success, 500 unexpected error (RMM retries; ingest idempotency makes this safe).

Data / API / Integrations

See FR-1 for schema and the design doc for exact JSONB shapes. External APIs: NinjaOneClient.resetAlert() (exists); TacticalRMM alert resolution if its API supports it (open question in SCRATCHPAD — adapter is optional by design).

Security / Permissions

  • Rule CRUD requires admin permission; all actions tenant-scoped.
  • Webhook auth unchanged (HMAC signature / shared-secret header).
  • OAuth callback CSRF validation (FR-9).

Observability

Pipeline logs rule-evaluation skips and outbound reset failures. No new metrics/monitoring infrastructure.

Rollout / Migration

Single additive migration; deployed tables hold negligible data, so no backfill. No feature flag: with zero rules configured, the pipeline stores alerts without creating tickets, which matches today's effective behavior.

Open Questions

Tracked in SCRATCHPAD.md (Tactical outbound capability; exact ticket-closed event name).

Acceptance Criteria (Definition of Done)

  • A NinjaOne CONDITION TRIGGERED webhook for a mapped org creates an rmm_alerts row and, when a rule matches, a correctly-routed ticket.
  • The same condition re-firing while that ticket is open adds an occurrence comment and creates no new ticket; after the ticket closes, a new firing creates a new ticket.
  • CONDITION RESET resolves the alert, comments the ticket, and closes it only if untouched.
  • Closing an alert-linked ticket resets the alert in NinjaOne unless the rule opted out.
  • A TacticalRMM alert webhook flows through the same pipeline end to end.
  • Admins manage rules entirely from the settings UI; invalid rules are rejected at save time.
  • Alert workflows can trigger on RMM_ALERT_TRIGGERED/RMM_ALERT_RESOLVED and call rmm.alerts.create_ticket.
  • Matched rules with notifyUserIds produce in-app and email notifications per user preference.
  • An alert firing inside a matching maintenance window creates no ticket and no notifications; the same alert outside the window processes normally; a condition still firing after its window ends becomes a ticket via the poller.
  • With webhooks disabled, a poll cycle turns RMM-active alerts into tickets per the rules and closes stale tickets whose alerts cleared in the RMM.
  • All features in features.json implemented; the automated core in tests.json passes; the SMOKE_TESTS.md checklist has been executed against a live stack.