Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz Source: /opt/alga-psa on psa.joliet.tech
6.3 KiB
Smoke Tests — RMM Alert Handling
Manual validation for everything the automated core (tests.json) doesn't cover: UI surfaces, live RMM round-trips, Temporal schedule lifecycle, email delivery, and migrations against a real stack.
These steps are derived from the design (docs/plans/2026-06-12-rmm-alert-handling-design.md),
not a running build — the feature is not implemented yet. Re-ground exact
labels and routes against the real screens when executing, and update this file
where they differ.
Preflight
- A worktree dev stack with migrations run clean (
npx knex migrate:latestfromserver/exits 0; spot-check thatrmm_alertshasdedup_keyandrmm_maintenance_windowsexists). - A NinjaOne sandbox connected at Settings → Integrations → RMM with at least one organization mapped to a test client and one device synced as an asset.
- A catch-all alert rule: create it in the new Alert Rules section (create-ticket on, a board picked, notify yourself). Creating, editing, reordering, and toggling this rule during preflight doubles as the rules-UI compatibility check.
- A test condition you can flip on demand on the sandbox device (e.g., a disk-space or stopped-service condition you control).
Risks this smoke is defending
- A real client outage produces no ticket — SLA breach nobody notices.
- A flapping condition floods the board — techs drown and miss real alerts.
- Tickets close out from under a working tech, or stale alert tickets never close — lost work either way.
- A maintenance window swallows alerts forever — "suppressed" silently becomes "lost".
- Alga writes wrong state into the customer's RMM, or cleared tickets leave stale alerts piling up in the NinjaOne console.
- Alert tickets land on the wrong board/client/priority — misdispatched work and a polluted audit trail.
- Notify/automation wiring is dead — teams that rely on alert workflows and emails find out during an incident.
- A disconnected integration keeps polling — zombie schedule hammering a revoked credential.
Flows
Flow 1 — a real alert becomes a correctly-routed ticket (risks 1, 6)
Trigger the test condition on the sandbox device. Within a minute, a ticket appears on the board your rule chose, with the rule's priority and assignee, linked to the correct client and asset, titled from your template with the device name filled in, and carrying an initial internal comment with the alert details. Check the asset's drawer: the alert shows in the alerts section with the linked ticket.
Flow 2 — a flapping condition doesn't storm the board (risk 2)
With Flow 1's ticket still open, clear and re-trigger the same condition twice. The board's ticket count for that client does not change; the existing ticket gains "re-triggered — 2nd occurrence" / "3rd occurrence" internal comments.
Flow 3 — alert clears: untouched closes, touched survives (risk 3)
Clear the condition in NinjaOne. The untouched ticket gains an "alert resolved" comment and moves to the closed status automatically. Then re-trigger (new ticket), add a tech comment to it, and clear the condition again: this ticket gains the resolution comment but stays open.
Flow 4 — closing a ticket clears the alert in the RMM (risk 5)
Re-trigger to get an open alert ticket, then close the ticket in Alga. In the NinjaOne console's alerts view, the alert is reset/gone within a minute. Then edit the rule to turn off "reset alert in RMM on ticket close", repeat, and confirm the alert stays active in NinjaOne (the opt-out is honored).
Flow 5 — maintenance window suppresses during, surfaces after (risk 4)
In Maintenance Windows, create a one-off window for the test client covering the next 30 minutes. Trigger the condition: no ticket appears, no notification fires, and the alert shows as suppressed on the asset. Edit the window to end now (or wait it out) while the condition is still active. Within one poll cycle (≤15 min) the alert is processed and the ticket appears. This is the highest-value silent-failure check in the set.
Flow 6 — the poller heals missed webhooks (risks 1, 3)
Break webhook delivery deliberately (delete the webhook registration in NinjaOne, or point it at a dead URL). Trigger one condition and clear a different, previously-ticketed one. Within one poll cycle, the new alert has a ticket and the cleared alert's untouched ticket is closed — same outcomes as webhooks, just slower. Restore the webhook afterward.
Flow 7 — notify and automation wiring is live (risk 7)
With your rule's notify-user set to yourself: trigger an alert and confirm both
the in-app notification and the email arrive (and that opting out of the
rmm-alert category in your notification preferences suppresses the email on a
repeat). In the workflow designer, build a trivial workflow triggered by
RMM_ALERT_TRIGGERED (e.g., add a ticket comment), trigger an alert, and confirm
the workflow execution log shows a run.
Flow 8 — disconnect tears the poller down (risk 8)
Disconnect the NinjaOne integration in settings. The reconciliation schedule is gone from the Temporal UI and no polling errors accrue in server logs afterward. Reconnect and confirm the schedule reappears.
Variant — TacticalRMM parity (risks 1, 2)
Repeat Flows 1 and 2 against a TacticalRMM instance: webhook alert → routed ticket, repeat → occurrence comment. Confirm a reconciliation cycle runs without errors (outbound reset is expected to be skipped if Tactical ships without an adapter).
Pass criteria
- Risks 1/6: every triggered test alert produced exactly one ticket, on the right board, client, priority, and asset — on both providers.
- Risk 2: repeated firings never created a second open ticket.
- Risk 3: untouched tickets closed themselves on alert clear; the touched ticket survived with its history intact.
- Risk 5: NinjaOne's console agreed with Alga after every ticket close — cleared when the rule says reset, untouched when opted out.
- Risk 4: nothing fired during the window, and the still-active alert became a ticket after it — suppressed never meant lost.
- Risk 7: notification arrived on both channels, preferences were honored, and the alert-triggered workflow ran.
- Risk 8: disconnect left no schedule and no errors; reconnect restored it.