Some checks are pending
Bidi Control Character Guard / bidi-control-guard (push) Waiting to run
Circular Dependency Check / Check for new circular dependencies (push) Waiting to run
Citus Migration Smoke / Combined migrations on single-node Citus (push) Waiting to run
E2E Fresh Install Tests / fresh-install-e2e (push) Waiting to run
ext-v2 guardrails / Run ext-v2 guard and ESLint (push) Waiting to run
Integration Tests / Check for relevant changes (push) Waiting to run
Integration Tests / ${{ (github.event_name == 'schedule' || github.event.inputs.suite == 'full') && 'Full integration suite' || 'Tier-1 integration subset' }} (push) Blocked by required conditions
Mobile checks / Mobile lint + typecheck (push) Waiting to run
Mobile checks / Mobile unit tests (push) Waiting to run
Mobile checks / Mobile dependency audit (report) (push) Waiting to run
Mobile checks / Mobile reproducibility checks (push) Waiting to run
Secrets guard (env backups) / Ensure no tracked env backup files (push) Waiting to run
Temporal Readiness / fast-readiness (push) Waiting to run
Temporal Readiness / docker-parity (push) Waiting to run
TypeScript Type Check / Nx affected typecheck (push) Waiting to run
Unit Tests / Skipped-test budget (push) Waiting to run
Unit Tests / Nx affected unit tests (push) Waiting to run
Unit Tests / Server unit coverage (informational) (push) Waiting to run
Validate Tenant Management Schema / Check for relevant changes (push) Waiting to run
Validate Tenant Management Schema / Validate Tenant Management Schema (push) Blocked by required conditions
EE Workflows Build Guard / ee-workflows-build-guard (push) Waiting to run
Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz Source: /opt/alga-psa on psa.joliet.tech
246 lines
12 KiB
Markdown
246 lines
12 KiB
Markdown
# PRD — RMM Alert Handling
|
||
|
||
- Slug: `2026-06-12-rmm-alert-handling`
|
||
- Date: 2026-06-12
|
||
- Status: Approved
|
||
- Branch: `feature/rmm-alerts-sync`
|
||
- Design doc: `docs/plans/2026-06-12-rmm-alert-handling-design.md`
|
||
|
||
## Summary
|
||
|
||
Turn RMM alerts into tickets automatically. A provider-generic pipeline in
|
||
`shared/rmm/alerts/` ingests normalized alert events from the NinjaOne and
|
||
TacticalRMM webhooks, evaluates tenant-defined rules, creates or updates
|
||
tickets with dedup, and keeps alert and ticket lifecycles in sync in both
|
||
directions. Tenants manage rules from the RMM integration settings UI.
|
||
|
||
## Problem
|
||
|
||
Alert ingestion is scaffolded but broken on main: the webhook writes columns
|
||
that don't exist in `rmm_alerts`, the rules engine expects a JSONB schema the
|
||
migration never created, and nothing connects alerts to tickets except a manual
|
||
button. There is no dedup, no auto-close, no outbound reset, and no rules UI.
|
||
Competing PSAs (ConnectWise, Autotask, Halo) treat all of this as table stakes.
|
||
|
||
## Goals
|
||
|
||
- Webhook-delivered alerts create tickets automatically per tenant-defined rules.
|
||
- Repeat firings of the same condition on the same device land on the existing
|
||
open ticket instead of creating ticket storms.
|
||
- Alert resets close untouched tickets and annotate touched ones.
|
||
- Closing an alert-linked ticket resets the alert in the RMM (per-rule opt-out).
|
||
- Rules are manageable from the integration settings UI by admins.
|
||
- Alert events are first-class workflow v2 triggers; matched rules can notify users.
|
||
- Maintenance windows suppress alert ticketing for a client, asset, or
|
||
integration during planned work, without losing the alerts.
|
||
- Scheduled polling reconciles missed webhooks: missed triggers become tickets,
|
||
missed resets close stale tickets, and post-window still-active alerts get
|
||
processed.
|
||
- One pipeline serves NinjaOne and TacticalRMM; a third provider only needs a
|
||
normalizer and an optional outbound adapter.
|
||
|
||
## Non-goals
|
||
|
||
- RMM device-count billing integration, scheduled device sync, or org auto-matching.
|
||
- Per-rule dedup configuration (dedup behavior is fixed).
|
||
- Migrating NinjaOne device sync onto `sharedAssetIngestionService` (separate effort).
|
||
|
||
## Users and Primary Flows
|
||
|
||
- **MSP admin** configures alert rules per RMM integration: match conditions,
|
||
ticket routing, lifecycle flags, notifications.
|
||
- **Dispatcher/tech** works alert tickets like any other ticket: sees occurrence
|
||
comments on flapping conditions, sees resolution comments when alerts reset,
|
||
and closing a ticket clears the alert in the RMM.
|
||
- **Automation builder** uses `RMM_ALERT_TRIGGERED`/`RMM_ALERT_RESOLVED`
|
||
workflow triggers and the `rmm.alerts.create_ticket` action for custom flows.
|
||
|
||
## UX / UI Notes
|
||
|
||
- New "Alert Rules" section in RMM integration settings (next to the org-mapping
|
||
manager), rendered for NinjaOne and TacticalRMM.
|
||
- Priority-ordered rules list: active toggle, reorder controls, edit/delete.
|
||
- Rule editor dialog with a Match group (severities, activity types, alert
|
||
classes, source types, organization picker fed from
|
||
`rmm_organization_mappings`, keywords, message regex) and an Actions group
|
||
(create-ticket toggle, board picker, priority override, assignee,
|
||
title/description templates with placeholder hints, auto-resolve toggle,
|
||
reset-on-close toggle, notify-users picker).
|
||
- Save-time validation errors (e.g., bad regex) shown inline in the dialog.
|
||
- "Maintenance Windows" subsection beside Alert Rules: a list plus an editor
|
||
with client/asset scope pickers and a one-off or weekly recurring schedule
|
||
(with timezone).
|
||
- Alert polling enable/disable and interval (5–60 minutes) in integration
|
||
settings.
|
||
- Existing per-asset `AssetAlertsSection` remains the alert-viewing surface.
|
||
|
||
## Requirements
|
||
|
||
### Functional Requirements
|
||
|
||
#### FR-1 Schema
|
||
|
||
One additive migration. `rmm_alerts` gains `activity_type`, `acknowledged_at`,
|
||
`acknowledged_by`, `dedup_key` (indexed with tenant + integration), `occurrence_count`
|
||
(default 1), `last_occurrence_at`, `matched_rule_id`, `auto_ticket_created`,
|
||
and `suppressed_by_window_id` (status gains a `suppressed` value).
|
||
Raw payloads standardize on the existing `metadata` jsonb. `rmm_alert_rules`
|
||
gains `conditions` and `actions` jsonb and drops the eleven flat filter/action
|
||
columns. New `rmm_maintenance_windows` table (FR-10). The deployed
|
||
`20251124000001` migration is not rewritten; no backfill.
|
||
|
||
#### FR-2 Contracts and normalizers
|
||
|
||
`shared/rmm/alerts/contracts.ts` defines `NormalizedRmmAlertEvent`
|
||
(kind: triggered | reset | acknowledged) and the optional per-provider
|
||
`RmmAlertOutboundAdapter` (`resetAlert`). Shared Zod schemas define the rule
|
||
`conditions`/`actions` shapes (see design doc for exact fields). NinjaOne and
|
||
TacticalRMM webhook routes map their payloads to the contract; existing webhook
|
||
auth, tenant resolution, and tier gating are unchanged.
|
||
|
||
#### FR-3 Ingest pipeline
|
||
|
||
`processRmmAlertEvent()` runs synchronously in the webhook request. Triggered:
|
||
upsert `rmm_alerts` on `(tenant, integration_id, external_alert_id)`, compute
|
||
and store `dedup_key` (device + condition identity; NinjaOne: `statusCode`
|
||
falling back to `activityType`), evaluate active rules first-match by
|
||
`priority_order` (empty conditions = catch-all; a rule that fails to evaluate
|
||
is logged and skipped), store `matched_rule_id`, then act. Replayed webhooks
|
||
are no-ops (idempotent ingest).
|
||
|
||
#### FR-4 Ticketing and dedup
|
||
|
||
If the matched rule creates tickets: an alert whose `dedup_key` matches an
|
||
alert with a still-open linked ticket joins that ticket (link, increment
|
||
`occurrence_count`, internal "re-triggered — Nth occurrence" comment).
|
||
Otherwise create a ticket honoring `boardId`, `priorityOverride` (else
|
||
severity→priority mapping), `assignToUserId`, and the title/description
|
||
templates with `{{device}}`/`{{message}}`/`{{severity}}`/`{{organization}}`
|
||
placeholders. Created tickets get source + source_reference, an asset
|
||
association, an initial internal comment, and client resolution from the asset
|
||
or the org mapping.
|
||
|
||
#### FR-5 Lifecycle
|
||
|
||
Reset marks the alert resolved. With a linked ticket and `autoResolveTicket`:
|
||
always comment; close (via `autoResolveStatusId`, else the tenant's first
|
||
is_closed status) only if the ticket is untouched — no human comments, no time
|
||
entries, no manual status change; rule auto-assignment doesn't count.
|
||
Acknowledged events stamp `acknowledged_at`/status. A ticket-closed event-bus
|
||
subscriber resets still-active linked alerts in the RMM via the provider's
|
||
outbound adapter when the matched rule's `resetAlertOnTicketClose` is true
|
||
(default). Outbound failures log and stamp alert `metadata`; they never block
|
||
the close. Providers without an adapter are skipped.
|
||
|
||
#### FR-6 Rules CRUD
|
||
|
||
List/create/update/delete/reorder server actions in `packages/integrations`,
|
||
admin-gated, Zod-validated, regex validated at save time.
|
||
|
||
#### FR-7 Rules UI
|
||
|
||
The settings section and editor described in UX notes, shared across providers.
|
||
|
||
#### FR-8 Workflow v2 and notifications
|
||
|
||
`RMM_ALERT_TRIGGERED` and `RMM_ALERT_RESOLVED` registered in the workflow v2
|
||
catalog with provider-generic payloads and published by the pipeline (replacing
|
||
the orphaned legacy-bus publishes). New `rmm.alerts.create_ticket` workflow
|
||
action invokes the shared ticket creator by alert ID. New `rmm-alert`
|
||
notification category delivers in-app + email to a matched rule's
|
||
`notifyUserIds`, honoring per-user preferences.
|
||
|
||
#### FR-9 Hardening and cleanup
|
||
|
||
Implement CSRF validation in the NinjaOne OAuth callback. Move
|
||
`ninjaone/alerts/*` logic into the shared module and remove the superseded
|
||
`resetInNinjaOne` TODO. Deprecate `rmm_organization_mappings.auto_create_tickets`
|
||
(no read paths remain).
|
||
|
||
#### FR-10 Maintenance windows
|
||
|
||
New `rmm_maintenance_windows` table: optional `integration_id`/`client_id`/
|
||
`asset_id` scopes (null = all of that dimension), one-off `starts_at`/`ends_at`
|
||
or weekly `recurrence` jsonb (days, time range, timezone), `name`, `is_active`.
|
||
The pipeline checks windows before rule matching: an alert matching all
|
||
non-null scopes of an active window at its `occurredAt` is stored with
|
||
`status = 'suppressed'` and `suppressed_by_window_id` — no ticket, no
|
||
notifications, no workflow events. A reset for a suppressed alert resolves it
|
||
quietly. Window CRUD server actions are admin-gated and Zod-validated, with the
|
||
settings UI described in UX notes.
|
||
|
||
#### FR-11 Alert polling (reconciliation)
|
||
|
||
A per-integration Temporal scheduled workflow (Entra per-tenant schedule
|
||
pattern), default on for connected integrations, every 15 minutes (configurable
|
||
5–60), created on connect and removed on disconnect. Each cycle, through the
|
||
same pipeline: (1) upsert RMM-active alerts missing locally as `triggered`
|
||
events; (2) synthesize `reset` events for local active alerts no longer active
|
||
in the RMM; (3) process still-active suppressed alerts whose window ended
|
||
through the normal rules path. Webhooks stay primary; ingest idempotency makes
|
||
overlap harmless.
|
||
|
||
### Non-functional Requirements
|
||
|
||
- Webhook ingest path makes no external API calls; webhook latency stays
|
||
bounded. RMM API calls happen only in the poller and the ticket-close
|
||
subscriber, both off the request path.
|
||
- All queries tenant-scoped (CitusDB composite keys: `tenant` + entity id).
|
||
- Webhook response semantics preserved: 200 unmapped org, 200 success, 500
|
||
unexpected error (RMM retries; ingest idempotency makes this safe).
|
||
|
||
## Data / API / Integrations
|
||
|
||
See FR-1 for schema and the design doc for exact JSONB shapes. External APIs:
|
||
`NinjaOneClient.resetAlert()` (exists); TacticalRMM alert resolution if its API
|
||
supports it (open question in SCRATCHPAD — adapter is optional by design).
|
||
|
||
## Security / Permissions
|
||
|
||
- Rule CRUD requires admin permission; all actions tenant-scoped.
|
||
- Webhook auth unchanged (HMAC signature / shared-secret header).
|
||
- OAuth callback CSRF validation (FR-9).
|
||
|
||
## Observability
|
||
|
||
Pipeline logs rule-evaluation skips and outbound reset failures. No new
|
||
metrics/monitoring infrastructure.
|
||
|
||
## Rollout / Migration
|
||
|
||
Single additive migration; deployed tables hold negligible data, so no
|
||
backfill. No feature flag: with zero rules configured, the pipeline stores
|
||
alerts without creating tickets, which matches today's effective behavior.
|
||
|
||
## Open Questions
|
||
|
||
Tracked in `SCRATCHPAD.md` (Tactical outbound capability; exact ticket-closed
|
||
event name).
|
||
|
||
## Acceptance Criteria (Definition of Done)
|
||
|
||
- A NinjaOne CONDITION TRIGGERED webhook for a mapped org creates an
|
||
`rmm_alerts` row and, when a rule matches, a correctly-routed ticket.
|
||
- The same condition re-firing while that ticket is open adds an occurrence
|
||
comment and creates no new ticket; after the ticket closes, a new firing
|
||
creates a new ticket.
|
||
- CONDITION RESET resolves the alert, comments the ticket, and closes it only
|
||
if untouched.
|
||
- Closing an alert-linked ticket resets the alert in NinjaOne unless the rule
|
||
opted out.
|
||
- A TacticalRMM alert webhook flows through the same pipeline end to end.
|
||
- Admins manage rules entirely from the settings UI; invalid rules are rejected
|
||
at save time.
|
||
- Alert workflows can trigger on `RMM_ALERT_TRIGGERED`/`RMM_ALERT_RESOLVED`
|
||
and call `rmm.alerts.create_ticket`.
|
||
- Matched rules with `notifyUserIds` produce in-app and email notifications per
|
||
user preference.
|
||
- An alert firing inside a matching maintenance window creates no ticket and no
|
||
notifications; the same alert outside the window processes normally; a
|
||
condition still firing after its window ends becomes a ticket via the poller.
|
||
- With webhooks disabled, a poll cycle turns RMM-active alerts into tickets per
|
||
the rules and closes stale tickets whose alerts cleared in the RMM.
|
||
- All features in `features.json` implemented; the automated core in
|
||
`tests.json` passes; the `SMOKE_TESTS.md` checklist has been executed against
|
||
a live stack.
|