Some checks are pending
Bidi Control Character Guard / bidi-control-guard (push) Waiting to run
Circular Dependency Check / Check for new circular dependencies (push) Waiting to run
Citus Migration Smoke / Combined migrations on single-node Citus (push) Waiting to run
E2E Fresh Install Tests / fresh-install-e2e (push) Waiting to run
ext-v2 guardrails / Run ext-v2 guard and ESLint (push) Waiting to run
Integration Tests / Check for relevant changes (push) Waiting to run
Integration Tests / ${{ (github.event_name == 'schedule' || github.event.inputs.suite == 'full') && 'Full integration suite' || 'Tier-1 integration subset' }} (push) Blocked by required conditions
Mobile checks / Mobile lint + typecheck (push) Waiting to run
Mobile checks / Mobile unit tests (push) Waiting to run
Mobile checks / Mobile dependency audit (report) (push) Waiting to run
Mobile checks / Mobile reproducibility checks (push) Waiting to run
Secrets guard (env backups) / Ensure no tracked env backup files (push) Waiting to run
Temporal Readiness / fast-readiness (push) Waiting to run
Temporal Readiness / docker-parity (push) Waiting to run
TypeScript Type Check / Nx affected typecheck (push) Waiting to run
Unit Tests / Skipped-test budget (push) Waiting to run
Unit Tests / Nx affected unit tests (push) Waiting to run
Unit Tests / Server unit coverage (informational) (push) Waiting to run
Validate Tenant Management Schema / Check for relevant changes (push) Waiting to run
Validate Tenant Management Schema / Validate Tenant Management Schema (push) Blocked by required conditions
EE Workflows Build Guard / ee-workflows-build-guard (push) Waiting to run
Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz Source: /opt/alga-psa on psa.joliet.tech
152 lines
19 KiB
Markdown
152 lines
19 KiB
Markdown
# Scratchpad — RMM Alert Handling
|
||
|
||
- Plan slug: `2026-06-12-rmm-alert-handling`
|
||
- Created: 2026-06-12
|
||
|
||
## What This Is
|
||
|
||
Working memory for the RMM alert handling effort. Approved design lives at
|
||
`docs/plans/2026-06-12-rmm-alert-handling-design.md` (commit `c717a4fd50`).
|
||
|
||
## Decisions
|
||
|
||
- (2026-06-11) Dedup is one open ticket per (device, condition). Repeats append a comment and bump `occurrence_count`. Always on, not per-rule configurable.
|
||
- (2026-06-11) Auto-close on alert reset: always comment; close only if no human touched the ticket (no human comments, no time entries, no manual status change). Rule-driven auto-assignment does not count as touched.
|
||
- (2026-06-11) Outbound reset on ticket close is a per-rule `actions.resetAlertOnTicketClose` flag, default true.
|
||
- (2026-06-11) Pipeline is provider-generic in `shared/rmm/alerts/`; NinjaOne and TacticalRMM both wired this branch.
|
||
- (2026-06-11) Processing runs synchronously in the webhook request (Approach A). All ingest work is local DB; the only external call (outbound reset) happens on the ticket-close bus subscriber.
|
||
- (2026-06-11) Schema direction: rules use JSONB `conditions`/`actions` (the alertProcessor model wins). One additive corrective migration; the deployed `20251124000001` migration is never rewritten. Deployed data is negligible — no backfill.
|
||
- (2026-06-11) Raw alert payload standardizes on the existing `metadata` jsonb column; code that wrote `source_data` changes to `metadata`.
|
||
- (2026-06-11) `rmm_organization_mappings.auto_create_tickets` is deprecated; rules with `organizationIds` conditions are the single source of truth.
|
||
- (2026-06-12) Maintenance windows and alert polling added to scope (originally non-goals) for competitor parity. Windows suppress before rule matching; suppressed alerts are stored but produce no ticket/notification/workflow event. The reconciliation poller owns window-end processing of still-active suppressed alerts.
|
||
- (2026-06-12) Polling is a per-integration Temporal schedule (Entra per-tenant pattern): default on, 15-minute default interval, 5–60 configurable, created on connect / removed on disconnect. Cycles upsert missed triggers and synthesize resets for stale active alerts, all through the same pipeline.
|
||
- (2026-06-12) Merged origin/main (443 commits) before implementing — the snapshot was stale: main added Huntress + Level.io providers, redesigned the RMM settings page to master-detail, and fixed Tactical/Level alert writers to be (mostly) schema-compliant. Local `.env.localtest` (wirein to the alga-psa-local-test stack, port 5472) preserved through the merge.
|
||
- (2026-06-12) There are now FOUR rmm_alerts writers: NinjaOne (still writes nonexistent source_data), Tactical (still writes source_data), Level.io (compliant), Huntress (compliant but raw severity/status values). Huntress has a complete incident→ticket path (incidentPlan/incidentProcessor/ticketCreator, pg-boss polling) configured via rmm_integrations.settings — leave it intact this branch; the shared pipeline wires NinjaOne + Tactical (+ Level.io if cheap). Folding Huntress in is a follow-up.
|
||
- (2026-06-12) Ticket facts from main: tickets.source exists; source_reference does NOT — store in attributes JSONB (Huntress convention). Ticket creator should take the caller's trx (Huntress ticketCreator pattern, not NinjaOne's self-managed one). Internal notes: comment_threads + comments with is_internal/is_system_generated (helper addTicketInternalNote). Closed statuses: statuses where item_type='ticket' and is_closed=true. TICKET_CLOSED is published via publishWorkflowEvent at packages/tickets/src/actions/ticketActions.ts:1062 (also optimizedTicketActions.ts:2508).
|
||
- (2026-06-12) RMM_ALERT_TRIGGERED/RESOLVED Zod schemas already exist (packages/event-schemas/src/schemas/eventBusSchema.ts:966-981) but are not in system_event_catalog (needs a registration migration, pattern: 20250130201000_register_email_system_events.cjs) and not published anywhere via publishWorkflowEvent.
|
||
- (2026-06-12) Shared import alias: @alga-psa/shared/* → shared/* (server + ee/server tsconfigs). Settings UI: providerSettingsComponents map in packages/integrations/src/components/settings/integrations/RmmIntegrationsSetup.tsx; per-provider detail panes are where Alert Rules / Maintenance Windows sections plug in.
|
||
- (2026-06-12) Testing strategy is 80/20: tests.json holds a 32-test automated core (logic permutations, idempotency, lifecycle, tenant isolation, one E2E per direction); UI, live RMM round-trips, Temporal schedule lifecycle, email, and migrations are manual flows in SMOKE_TESTS.md, each tied to a named business risk. The old 114-test list was consolidated, not expanded — table-driven tests absorb the per-permutation entries.
|
||
|
||
- (2026-06-12, Robert's call) Reversed the pg-boss deviation: both RMM alert reconciliation AND Huntress incident polling now run as per-integration Temporal schedules (Entra pattern). Key enabler discovered during investigation: ee/temporal-workflows imports the full ee/server tree via its `@ee`/`@/` tsconfig aliases (the ninjaone-token-refresh activity precedent), so the activities are thin wrappers over the existing poll logic — no duplication. setupSchedules.ts reconciles schedules at worker boot; activities re-check is_active/polling-enabled per run so stale schedules no-op; NinjaOne connect/disconnect ensure/remove dynamically (ee/server/src/lib/integrations/rmm/alertPollingSchedule.ts). Caveats: interval changes saved via updateRmmAlertPollingSettings apply at next worker boot (the action lives in packages/ and can't reach the Temporal client); Tactical/Huntress schedule creation after a fresh connect also waits for worker boot (manual "poll now"/backfill covers the gap). CE deployments without Temporal workers get no polling — webhooks remain primary.
|
||
|
||
- (2026-06-12, Robert's catch) Final polling architecture: the per-integration polls ride the **IJobRunner abstraction** (packages/jobs interfaces; PgBossJobRunner CE / TemporalJobRunner EE, selected by JOB_RUNNER_TYPE else isEnterprise) — which I'd initially missed because the older IJobScheduler (pg-boss-only) looked like "the" scheduler abstraction. Same handler code both editions (server/src/lib/jobs/handlers/rmmAlertPollingHandlers.ts, registered in registerAllHandlers + initializeJobHandlersForWorker); cron intervals; reconcileRmmPollingSchedules() control loop (5-min tick on the legacy scheduler in initializeApp + boot pass + NinjaOne connect/disconnect hooks) converges jobs onto rmm_integrations state, so interval/toggle changes apply within minutes hands-off. TemporalJobRunner.scheduleRecurringJob fixed to update existing schedule specs. The bespoke temporal workflows/setupSchedules blocks and the ee alertPollingSchedule helper from the interim iteration are deleted; CE stubs added for the ninjaone fetcher + huntress poller dynamic @enterprise imports. CE Tactical polling restored.
|
||
|
||
## Discoveries / Constraints
|
||
|
||
- (2026-06-11) Alert ingestion is broken on main today: `webhookHandler.ts:671-674` inserts `activity_type` and `source_data` into `rmm_alerts`, but the only migration creating that table (`server/migrations/20251124000001_create_rmm_integration_tables.cjs`) has neither column.
|
||
- (2026-06-11) `alertProcessor.ts` reads JSONB `conditions`/`actions` from `rmm_alert_rules`; the migration created flat `text[]`/scalar columns instead. The processor is imported only by tests — never called from the webhook path.
|
||
- (2026-06-11) `ticketCreator.ts` works (manual button in `AssetAlertsSection.tsx` uses it) and is the basis for the shared creator.
|
||
- (2026-06-11) TacticalRMM's webhook (`server/src/app/api/webhooks/tacticalrmm/route.ts`) writes `rmm_alerts` directly, including the nonexistent columns — same schema bug.
|
||
- (2026-06-11) NinjaOne condition identity for dedup: `statusCode`, falling back to `activityType`. NinjaOne sends a fresh CONDITION TRIGGERED per firing, so a flapping check fires many times a day.
|
||
- (2026-06-11) NinjaOne webhook returns 200 for unmapped orgs (suppresses retries), 500 for unexpected errors (NinjaOne retries). Keep this; idempotent ingest makes at-least-once safe.
|
||
- (2026-06-11) `NinjaOneClient.resetAlert()` exists (`POST /api/v2/alert/{uid}/reset`); the only caller today is the `ninjaone.alerts.reset` workflow action.
|
||
- (2026-06-11) Known TODOs in adjacent code: CSRF validation in the NinjaOne OAuth callback; `resetInNinjaOne` in `alertProcessor.resolveAlert()` (superseded by the outbound adapter).
|
||
- (2026-06-11) Precedent for shared provider-agnostic RMM code: `shared/rmm/contracts.ts` + `shared/rmm/sharedAssetIngestionService.ts` (used by Tanium and Tactical device sync). The alert pipeline mirrors this layout.
|
||
- (2026-06-11) Legacy-bus events `RMM_ALERT_TRIGGERED`/`RMM_ALERT_RESOLVED` are published today but have no subscribers and are not in the workflow v2 catalog. Workflow v2 currently only sees `INTEGRATION_WEBHOOK_RECEIVED`.
|
||
|
||
## Implementation status (2026-06-12, mid-flight)
|
||
|
||
Done and committed: corrective migration (20260612090000) + event-catalog (…0100) and notification (…0200) migrations; shared pipeline in shared/rmm/alerts (contracts, evaluator, windowMatcher, dedup, ticketCreator, untouched, processRmmAlertEvent, reconciliation, createTicketForAlertId, outboundRegistry); NinjaOne/Tactical/Level webhooks normalized into it; TICKET_CLOSED outbound-reset subscriber + NinjaOne adapter (CE stub in packages/ee); rmmAlertNotificationSubscriber (in-app + email); rules+windows CRUD actions (packages/integrations/src/actions/integrations/rmmAlertRuleActions.ts); rmm.alerts.create_ticket workflow action; pg-boss reconciliation dispatcher (EE init, NinjaOne fetcher) — **deviation: pg-boss instead of design-doc Temporal** (Huntress precedent, CE compatibility; F083 connect/disconnect becomes "dispatcher polls only active integrations"); OAuth CSRF fix (ninjaone_oauth_state tenant secret, one-time use); legacy ee ninjaone alertProcessor/ticketCreator deleted, manual button + workflow action use shared createTicketForAlertId.
|
||
|
||
Update (2026-06-12, later): FR-7 UI shipped (RmmAlertAutomationSettings in packages/integrations, rendered in Tactical/NinjaOne/Level panes; asset-scope picker for windows deferred — F076 false). Tactical reconciliation fetcher shipped (reuses the backfill-verified alerts endpoint; backfill now runs a reconciliation cycle). Test status: unit suites (rule matrix, window matcher, dedup, schemas) + adapted tactical webhook/backfill tests green (70/71 in tactical+rmmalerts — the 1 failure, tacticalDeviceSync.fullSync, is pre-existing on merged main, unrelated); DB-backed integration suite (rmmAlertPipeline, 10/10) proves the migration + pipeline spine. tests.json: 27/32 automated (subscriber, reconciliation, untouched branches, normalizer, events, severity fallback all landed); the remaining 5 (T026/T027 CRUD action tests, T029 workflow action, T030 OAuth CSRF route, T031 notification subscriber) need bespoke harnesses and are the only un-automated items; the pre-existing tacticalDeviceSync.fullSync unit failure was inherited from main (fails without this branch's changes). Integration tests need env: DB_HOST=localhost DB_PORT=5472 DB_USER_ADMIN=postgres DB_PASSWORD_ADMIN=$(cat secrets/postgres_password).
|
||
|
||
Previously remaining: FR-7 settings UI (Alert Rules + Maintenance Windows + polling settings section in provider panes — providerSettingsComponents map in packages/integrations RmmIntegrationsSetup.tsx); test core T001–T032 (old tactical webhook unit tests will need adapting to the pipeline; old ee alertProcessor tests reference deleted module — replace with shared-module tests); flip features.json/tests.json implemented flags; update design doc for the pg-boss deviation + id-space caveat.
|
||
|
||
Verification caveats for smoke testing: NinjaOne webhook external ids are activity ids while the alerts API returns uids — reconciliation only trusts poller-ingested ids for staleness (RECONCILIATION_INGEST_MARKER in metadata) and dedup absorbs cross-source duplicates; verify uid/id behavior against a live sandbox. Tactical/Level reconciliation fetchers intentionally deferred until their list-alerts API shapes are verified live (F082 open).
|
||
|
||
## Commands / Runbooks
|
||
|
||
- Run server migrations: from `server/`, `npx knex migrate:latest` (see existing env scripts; use the worktree's compose stack via the alga-env-manager skill).
|
||
- Webhook entry for local testing: `POST /api/webhooks/ninjaone?tenant=<tenantId>` with `X-Alga-Webhook-Secret` header (secret in `rmm_integrations.settings.webhookSecret`).
|
||
|
||
## Links / References
|
||
|
||
- Design doc: `docs/plans/2026-06-12-rmm-alert-handling-design.md`
|
||
- Migration with current (broken) schema: `server/migrations/20251124000001_create_rmm_integration_tables.cjs`
|
||
- NinjaOne webhook: `ee/server/src/app/api/webhooks/ninjaone/route.ts`, handler `ee/server/src/lib/integrations/ninjaone/webhooks/webhookHandler.ts`
|
||
- Rules engine to move: `ee/server/src/lib/integrations/ninjaone/alerts/alertProcessor.ts`
|
||
- Ticket creator to move: `ee/server/src/lib/integrations/ninjaone/alerts/ticketCreator.ts`
|
||
- Tactical webhook: `server/src/app/api/webhooks/tacticalrmm/route.ts`
|
||
- Shared device-ingest precedent: `shared/rmm/sharedAssetIngestionService.ts`
|
||
- Provider registry (capability flags): `packages/integrations/src/lib/rmm/providerRegistry.ts`
|
||
- Asset alert UI: `ee/server/src/components/assets/AssetAlertsSection.tsx`
|
||
- NinjaOne client (resetAlert): `ee/server/src/lib/integrations/ninjaone/ninjaOneClient.ts`
|
||
|
||
## Open Questions
|
||
|
||
- Does TacticalRMM's API expose alert resolution for the outbound adapter? If not, ship the adapter for NinjaOne only and mark the capability off for Tactical (pipeline skips it cleanly).
|
||
- Exact event-bus event name for ticket closure (TICKET_UPDATED with closed status vs. a dedicated TICKET_CLOSED) — confirm against `server/src/lib/eventBus/` when wiring the subscriber.
|
||
- Confirm Tactical's alerts API supports listing active alerts for reconciliation (NinjaOne's `getAlerts()` already exists in the client).
|
||
- FR-11 scheduler choice: the design says Temporal (Entra pattern), but Huntress's incident poller uses pg-boss (server-side, CE-compatible — Tactical is a CE provider and CE deployments may not run Temporal workers). Decide at FR-11; pg-boss looks like the better precedent. Flag the deviation to Robert either way.
|
||
- Huntress provider registry flags are stale (deviceSync/events false despite reality) — out of scope here, but worth a drive-by fix or follow-up.
|
||
- Reconciliation poller and `resolveAlert` semantics: a poller-synthesized reset should be distinguishable in the ticket comment ("alert no longer active in RMM" vs. "alert reset received").
|
||
|
||
## Smoke run (2026-06-12) — all 8 runbook flows PASS after fixes
|
||
|
||
Executed /tmp/rmm-alert-smoke-tests.md end-to-end (algadev, local-test stack,
|
||
EE + JOB_RUNNER_TYPE=pgboss). Bugs found and fixed on this branch:
|
||
|
||
- **statuses.item_type drift (4 call sites)**: main's board-scoped statuses
|
||
migration left `item_type` NULL; live data uses `status_type` + `board_id`.
|
||
Fixed ticketCreator (now delegates to `TicketModel.getDefaultStatusId`),
|
||
`resolveCloseStatusId`, `untouched.ts`, the rule-form `closedStatuses`
|
||
query, and the Huntress ticket creator. Symptoms: webhook 500 on first
|
||
trigger; auto-close never closed; untouched check ignored status moves.
|
||
- **Dedup primary selection**: sibling lookup ordered `created_at desc`,
|
||
bumping the newest absorbed row instead of the ticket-owning primary.
|
||
Now `asc`; occurrence counts and "occurrence N" comments track the primary.
|
||
- **Severity→priority**: exact-name match missed "P1 - Critical"-style
|
||
names; added substring fallback pass.
|
||
- **Polling settings first save was a no-op**: `jsonb_set` can't create the
|
||
`alertPolling` parent key; now merges `jsonb_build_object` into the parent.
|
||
- **Reconciler tick never recurred**: legacy `IJobScheduler.scheduleRecurringJob`
|
||
is a one-shot delayed send (cron coerced to '24 hours', singletonHours 24,
|
||
no re-fire). The tick is now a process-local 5-min `setInterval` in
|
||
initializeApp (+ boot pass); reconcile is idempotent so multi-replica
|
||
ticks are safe.
|
||
- **PgBossJobRunner recurring-record lifecycle**: every cron fire flipped the
|
||
schedule's jobs row to `completed` (runs share `jobServiceId`), so the
|
||
reconciler saw "no job" → recreated rows while enabled and leaked the
|
||
pgboss schedule on disable. Fires now carry `jobRecurring: true` and the
|
||
worker returns recurring rows to `queued`; `cancelJob` exempts recurring
|
||
records from the completed/failed guard, unschedules, and nulls
|
||
`external_id` (the live-schedule pointer the reconciler now also filters
|
||
on). Ineligible-branch cancels via newest record of any status.
|
||
- **Rule dialog stale patternError**: dialog component stays mounted; a
|
||
cancelled invalid-regex edit blocked the next Add rule. Reset on `isOpen`.
|
||
- **Button ids**: 14 missing required `id` props in RmmAlertAutomationSettings.
|
||
- **Integration test seed**: statuses now board-scoped; 20/20 green again.
|
||
|
||
Environment notes for future smoke runs: `DB_NAME_SERVER` is required by
|
||
`migrate:ee` (else knex hits the default `postgres` DB — an accidental full
|
||
migration run was left there; safe to drop that DB's public schema);
|
||
Tactical row needs `NEXT_PUBLIC_FORCE_FEATURE_FLAGS=tactical-rmm-integration:true`;
|
||
EE without Temporal needs `JOB_RUNNER_TYPE=pgboss`; seed tenant had no
|
||
closed ticket status (marked "Enchanted Closure" is_closed=true); Test
|
||
Connection (GET /api/beta/v1/client/) is what activates the integration.
|
||
Tactical backfill button label is "Sync Alerts".
|
||
|
||
Validation: rmmalerts unit 40/40; tactical unit 36/37 (1 pre-existing main
|
||
failure); rmmAlertPipeline integration 20/20; tsc clean on touched packages.
|
||
Flow 6 observed live: interval change converged ~3 min, disable unscheduled,
|
||
pre-fix leaked schedule self-healed on first tick.
|
||
|
||
## Deferred: "run workflow" as an alert-rule action (2026-06-12)
|
||
|
||
Considered and deliberately held off. Coverage today: workflows can already
|
||
trigger on RMM_ALERT_TRIGGERED/RESOLVED (system event catalog) and on ticket
|
||
creation, and can call rmm.alerts.create_ticket; the rule action would mainly
|
||
add rule-level filtering as the trigger plus rule context in the payload.
|
||
|
||
If revisited, the design direction we converged on: run-workflow as an
|
||
independent action toggle (composes with create-ticket; "replace" = create
|
||
ticket off + workflow on), launched fire-and-forget via
|
||
launchPublishedWorkflowRun after the processing transaction commits. Two
|
||
gaps must be closed for replace mode to be safe: (1) storm protection
|
||
without a ticket — dedup should treat an active same-dedup-key alert as an
|
||
occurrence even before any ticket exists, else flap storms launch one
|
||
workflow per firing during the async gap; (2) lifecycle hooks — workflow-
|
||
created tickets keep auto-resolve/reset-on-close only when created through
|
||
rmm.alerts.create_ticket (which maintains rmm_alerts.ticket_id). The
|
||
transactional-vs-eventual ticket guarantee is the documented trade-off.
|