Hermes 284313f908
Some checks are pending
Bidi Control Character Guard / bidi-control-guard (push) Waiting to run
Circular Dependency Check / Check for new circular dependencies (push) Waiting to run
Citus Migration Smoke / Combined migrations on single-node Citus (push) Waiting to run
E2E Fresh Install Tests / fresh-install-e2e (push) Waiting to run
ext-v2 guardrails / Run ext-v2 guard and ESLint (push) Waiting to run
Integration Tests / Check for relevant changes (push) Waiting to run
Integration Tests / ${{ (github.event_name == 'schedule' || github.event.inputs.suite == 'full') && 'Full integration suite' || 'Tier-1 integration subset' }} (push) Blocked by required conditions
Mobile checks / Mobile lint + typecheck (push) Waiting to run
Mobile checks / Mobile unit tests (push) Waiting to run
Mobile checks / Mobile dependency audit (report) (push) Waiting to run
Mobile checks / Mobile reproducibility checks (push) Waiting to run
Secrets guard (env backups) / Ensure no tracked env backup files (push) Waiting to run
Temporal Readiness / fast-readiness (push) Waiting to run
Temporal Readiness / docker-parity (push) Waiting to run
TypeScript Type Check / Nx affected typecheck (push) Waiting to run
Unit Tests / Skipped-test budget (push) Waiting to run
Unit Tests / Nx affected unit tests (push) Waiting to run
Unit Tests / Server unit coverage (informational) (push) Waiting to run
Validate Tenant Management Schema / Check for relevant changes (push) Waiting to run
Validate Tenant Management Schema / Validate Tenant Management Schema (push) Blocked by required conditions
EE Workflows Build Guard / ee-workflows-build-guard (push) Waiting to run
Initial import of AlgaPSA codebase from PSA server
Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz

Source: /opt/alga-psa on psa.joliet.tech
2026-06-22 16:12:17 -05:00

246 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# PRD — RMM Alert Handling
- Slug: `2026-06-12-rmm-alert-handling`
- Date: 2026-06-12
- Status: Approved
- Branch: `feature/rmm-alerts-sync`
- Design doc: `docs/plans/2026-06-12-rmm-alert-handling-design.md`
## Summary
Turn RMM alerts into tickets automatically. A provider-generic pipeline in
`shared/rmm/alerts/` ingests normalized alert events from the NinjaOne and
TacticalRMM webhooks, evaluates tenant-defined rules, creates or updates
tickets with dedup, and keeps alert and ticket lifecycles in sync in both
directions. Tenants manage rules from the RMM integration settings UI.
## Problem
Alert ingestion is scaffolded but broken on main: the webhook writes columns
that don't exist in `rmm_alerts`, the rules engine expects a JSONB schema the
migration never created, and nothing connects alerts to tickets except a manual
button. There is no dedup, no auto-close, no outbound reset, and no rules UI.
Competing PSAs (ConnectWise, Autotask, Halo) treat all of this as table stakes.
## Goals
- Webhook-delivered alerts create tickets automatically per tenant-defined rules.
- Repeat firings of the same condition on the same device land on the existing
open ticket instead of creating ticket storms.
- Alert resets close untouched tickets and annotate touched ones.
- Closing an alert-linked ticket resets the alert in the RMM (per-rule opt-out).
- Rules are manageable from the integration settings UI by admins.
- Alert events are first-class workflow v2 triggers; matched rules can notify users.
- Maintenance windows suppress alert ticketing for a client, asset, or
integration during planned work, without losing the alerts.
- Scheduled polling reconciles missed webhooks: missed triggers become tickets,
missed resets close stale tickets, and post-window still-active alerts get
processed.
- One pipeline serves NinjaOne and TacticalRMM; a third provider only needs a
normalizer and an optional outbound adapter.
## Non-goals
- RMM device-count billing integration, scheduled device sync, or org auto-matching.
- Per-rule dedup configuration (dedup behavior is fixed).
- Migrating NinjaOne device sync onto `sharedAssetIngestionService` (separate effort).
## Users and Primary Flows
- **MSP admin** configures alert rules per RMM integration: match conditions,
ticket routing, lifecycle flags, notifications.
- **Dispatcher/tech** works alert tickets like any other ticket: sees occurrence
comments on flapping conditions, sees resolution comments when alerts reset,
and closing a ticket clears the alert in the RMM.
- **Automation builder** uses `RMM_ALERT_TRIGGERED`/`RMM_ALERT_RESOLVED`
workflow triggers and the `rmm.alerts.create_ticket` action for custom flows.
## UX / UI Notes
- New "Alert Rules" section in RMM integration settings (next to the org-mapping
manager), rendered for NinjaOne and TacticalRMM.
- Priority-ordered rules list: active toggle, reorder controls, edit/delete.
- Rule editor dialog with a Match group (severities, activity types, alert
classes, source types, organization picker fed from
`rmm_organization_mappings`, keywords, message regex) and an Actions group
(create-ticket toggle, board picker, priority override, assignee,
title/description templates with placeholder hints, auto-resolve toggle,
reset-on-close toggle, notify-users picker).
- Save-time validation errors (e.g., bad regex) shown inline in the dialog.
- "Maintenance Windows" subsection beside Alert Rules: a list plus an editor
with client/asset scope pickers and a one-off or weekly recurring schedule
(with timezone).
- Alert polling enable/disable and interval (560 minutes) in integration
settings.
- Existing per-asset `AssetAlertsSection` remains the alert-viewing surface.
## Requirements
### Functional Requirements
#### FR-1 Schema
One additive migration. `rmm_alerts` gains `activity_type`, `acknowledged_at`,
`acknowledged_by`, `dedup_key` (indexed with tenant + integration), `occurrence_count`
(default 1), `last_occurrence_at`, `matched_rule_id`, `auto_ticket_created`,
and `suppressed_by_window_id` (status gains a `suppressed` value).
Raw payloads standardize on the existing `metadata` jsonb. `rmm_alert_rules`
gains `conditions` and `actions` jsonb and drops the eleven flat filter/action
columns. New `rmm_maintenance_windows` table (FR-10). The deployed
`20251124000001` migration is not rewritten; no backfill.
#### FR-2 Contracts and normalizers
`shared/rmm/alerts/contracts.ts` defines `NormalizedRmmAlertEvent`
(kind: triggered | reset | acknowledged) and the optional per-provider
`RmmAlertOutboundAdapter` (`resetAlert`). Shared Zod schemas define the rule
`conditions`/`actions` shapes (see design doc for exact fields). NinjaOne and
TacticalRMM webhook routes map their payloads to the contract; existing webhook
auth, tenant resolution, and tier gating are unchanged.
#### FR-3 Ingest pipeline
`processRmmAlertEvent()` runs synchronously in the webhook request. Triggered:
upsert `rmm_alerts` on `(tenant, integration_id, external_alert_id)`, compute
and store `dedup_key` (device + condition identity; NinjaOne: `statusCode`
falling back to `activityType`), evaluate active rules first-match by
`priority_order` (empty conditions = catch-all; a rule that fails to evaluate
is logged and skipped), store `matched_rule_id`, then act. Replayed webhooks
are no-ops (idempotent ingest).
#### FR-4 Ticketing and dedup
If the matched rule creates tickets: an alert whose `dedup_key` matches an
alert with a still-open linked ticket joins that ticket (link, increment
`occurrence_count`, internal "re-triggered — Nth occurrence" comment).
Otherwise create a ticket honoring `boardId`, `priorityOverride` (else
severity→priority mapping), `assignToUserId`, and the title/description
templates with `{{device}}`/`{{message}}`/`{{severity}}`/`{{organization}}`
placeholders. Created tickets get source + source_reference, an asset
association, an initial internal comment, and client resolution from the asset
or the org mapping.
#### FR-5 Lifecycle
Reset marks the alert resolved. With a linked ticket and `autoResolveTicket`:
always comment; close (via `autoResolveStatusId`, else the tenant's first
is_closed status) only if the ticket is untouched — no human comments, no time
entries, no manual status change; rule auto-assignment doesn't count.
Acknowledged events stamp `acknowledged_at`/status. A ticket-closed event-bus
subscriber resets still-active linked alerts in the RMM via the provider's
outbound adapter when the matched rule's `resetAlertOnTicketClose` is true
(default). Outbound failures log and stamp alert `metadata`; they never block
the close. Providers without an adapter are skipped.
#### FR-6 Rules CRUD
List/create/update/delete/reorder server actions in `packages/integrations`,
admin-gated, Zod-validated, regex validated at save time.
#### FR-7 Rules UI
The settings section and editor described in UX notes, shared across providers.
#### FR-8 Workflow v2 and notifications
`RMM_ALERT_TRIGGERED` and `RMM_ALERT_RESOLVED` registered in the workflow v2
catalog with provider-generic payloads and published by the pipeline (replacing
the orphaned legacy-bus publishes). New `rmm.alerts.create_ticket` workflow
action invokes the shared ticket creator by alert ID. New `rmm-alert`
notification category delivers in-app + email to a matched rule's
`notifyUserIds`, honoring per-user preferences.
#### FR-9 Hardening and cleanup
Implement CSRF validation in the NinjaOne OAuth callback. Move
`ninjaone/alerts/*` logic into the shared module and remove the superseded
`resetInNinjaOne` TODO. Deprecate `rmm_organization_mappings.auto_create_tickets`
(no read paths remain).
#### FR-10 Maintenance windows
New `rmm_maintenance_windows` table: optional `integration_id`/`client_id`/
`asset_id` scopes (null = all of that dimension), one-off `starts_at`/`ends_at`
or weekly `recurrence` jsonb (days, time range, timezone), `name`, `is_active`.
The pipeline checks windows before rule matching: an alert matching all
non-null scopes of an active window at its `occurredAt` is stored with
`status = 'suppressed'` and `suppressed_by_window_id` — no ticket, no
notifications, no workflow events. A reset for a suppressed alert resolves it
quietly. Window CRUD server actions are admin-gated and Zod-validated, with the
settings UI described in UX notes.
#### FR-11 Alert polling (reconciliation)
A per-integration Temporal scheduled workflow (Entra per-tenant schedule
pattern), default on for connected integrations, every 15 minutes (configurable
560), created on connect and removed on disconnect. Each cycle, through the
same pipeline: (1) upsert RMM-active alerts missing locally as `triggered`
events; (2) synthesize `reset` events for local active alerts no longer active
in the RMM; (3) process still-active suppressed alerts whose window ended
through the normal rules path. Webhooks stay primary; ingest idempotency makes
overlap harmless.
### Non-functional Requirements
- Webhook ingest path makes no external API calls; webhook latency stays
bounded. RMM API calls happen only in the poller and the ticket-close
subscriber, both off the request path.
- All queries tenant-scoped (CitusDB composite keys: `tenant` + entity id).
- Webhook response semantics preserved: 200 unmapped org, 200 success, 500
unexpected error (RMM retries; ingest idempotency makes this safe).
## Data / API / Integrations
See FR-1 for schema and the design doc for exact JSONB shapes. External APIs:
`NinjaOneClient.resetAlert()` (exists); TacticalRMM alert resolution if its API
supports it (open question in SCRATCHPAD — adapter is optional by design).
## Security / Permissions
- Rule CRUD requires admin permission; all actions tenant-scoped.
- Webhook auth unchanged (HMAC signature / shared-secret header).
- OAuth callback CSRF validation (FR-9).
## Observability
Pipeline logs rule-evaluation skips and outbound reset failures. No new
metrics/monitoring infrastructure.
## Rollout / Migration
Single additive migration; deployed tables hold negligible data, so no
backfill. No feature flag: with zero rules configured, the pipeline stores
alerts without creating tickets, which matches today's effective behavior.
## Open Questions
Tracked in `SCRATCHPAD.md` (Tactical outbound capability; exact ticket-closed
event name).
## Acceptance Criteria (Definition of Done)
- A NinjaOne CONDITION TRIGGERED webhook for a mapped org creates an
`rmm_alerts` row and, when a rule matches, a correctly-routed ticket.
- The same condition re-firing while that ticket is open adds an occurrence
comment and creates no new ticket; after the ticket closes, a new firing
creates a new ticket.
- CONDITION RESET resolves the alert, comments the ticket, and closes it only
if untouched.
- Closing an alert-linked ticket resets the alert in NinjaOne unless the rule
opted out.
- A TacticalRMM alert webhook flows through the same pipeline end to end.
- Admins manage rules entirely from the settings UI; invalid rules are rejected
at save time.
- Alert workflows can trigger on `RMM_ALERT_TRIGGERED`/`RMM_ALERT_RESOLVED`
and call `rmm.alerts.create_ticket`.
- Matched rules with `notifyUserIds` produce in-app and email notifications per
user preference.
- An alert firing inside a matching maintenance window creates no ticket and no
notifications; the same alert outside the window processes normally; a
condition still firing after its window ends becomes a ticket via the poller.
- With webhooks disabled, a poll cycle turns RMM-active alerts into tickets per
the rules and closes stale tickets whose alerts cleared in the RMM.
- All features in `features.json` implemented; the automated core in
`tests.json` passes; the `SMOKE_TESTS.md` checklist has been executed against
a live stack.