Hermes 284313f908
Some checks are pending
Bidi Control Character Guard / bidi-control-guard (push) Waiting to run
Circular Dependency Check / Check for new circular dependencies (push) Waiting to run
Citus Migration Smoke / Combined migrations on single-node Citus (push) Waiting to run
E2E Fresh Install Tests / fresh-install-e2e (push) Waiting to run
ext-v2 guardrails / Run ext-v2 guard and ESLint (push) Waiting to run
Integration Tests / Check for relevant changes (push) Waiting to run
Integration Tests / ${{ (github.event_name == 'schedule' || github.event.inputs.suite == 'full') && 'Full integration suite' || 'Tier-1 integration subset' }} (push) Blocked by required conditions
Mobile checks / Mobile lint + typecheck (push) Waiting to run
Mobile checks / Mobile unit tests (push) Waiting to run
Mobile checks / Mobile dependency audit (report) (push) Waiting to run
Mobile checks / Mobile reproducibility checks (push) Waiting to run
Secrets guard (env backups) / Ensure no tracked env backup files (push) Waiting to run
Temporal Readiness / fast-readiness (push) Waiting to run
Temporal Readiness / docker-parity (push) Waiting to run
TypeScript Type Check / Nx affected typecheck (push) Waiting to run
Unit Tests / Skipped-test budget (push) Waiting to run
Unit Tests / Nx affected unit tests (push) Waiting to run
Unit Tests / Server unit coverage (informational) (push) Waiting to run
Validate Tenant Management Schema / Check for relevant changes (push) Waiting to run
Validate Tenant Management Schema / Validate Tenant Management Schema (push) Blocked by required conditions
EE Workflows Build Guard / ee-workflows-build-guard (push) Waiting to run
Initial import of AlgaPSA codebase from PSA server
Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz

Source: /opt/alga-psa on psa.joliet.tech
2026-06-22 16:12:17 -05:00

146 lines
9.4 KiB
Markdown

# PRD — NinjaOne Proactive Token Refresh
- Slug: `ninjaone-proactive-token-refresh`
- Date: `2026-03-26`
- Status: Draft
## Summary
Add per-integration proactive NinjaOne OAuth token refresh scheduling through Temporal so connected NinjaOne integrations refresh access and refresh tokens before expiry instead of waiting for a user-triggered sync or webhook processing path to hit the expired token.
## Problem
NinjaOne credentials are currently refreshed lazily inside the API client when a request notices the token is near expiry or when a request receives a `401`. This means:
- the first user-visible action after expiry pays the refresh cost;
- refresh-token failures surface during organization/device syncs instead of being handled as background maintenance;
- there is no dedicated lifecycle owner for NinjaOne token refreshes in Temporal;
- failures are hard to distinguish from sync failures until worker logs are inspected.
Recent production evidence showed a Temporal organization sync reaching the worker successfully, then failing while refreshing the NinjaOne token at `https://ca.ninjarmm.com/oauth/token` with `400 Bad Request` and `error: invalid_token`. That proves the current path does attempt refresh, but only on demand and too late for a good operator or user experience.
## Goals
- Refresh NinjaOne OAuth credentials proactively before `expires_at` using Temporal worker-owned execution.
- Schedule refreshes per integration, not via a global polling scanner.
- Persist newly rotated access tokens and refresh tokens after each successful refresh.
- Reschedule the next refresh automatically after each successful refresh.
- Keep current lazy refresh logic as a fallback path if a scheduled run is missed.
- Make refresh failure state explicit enough that operators and future code can distinguish reconnect-required credentials from ordinary sync failures.
## Non-goals
- Replacing the existing lazy refresh logic in the NinjaOne client.
- Building a generic cross-provider token lifecycle framework in this scope.
- Adding a full user-facing token-health dashboard.
- Introducing a broad periodic scanner over all integrations.
- Auto-reconnecting or auto-reauthorizing NinjaOne after a terminal refresh-token failure.
## Users and Primary Flows
1. Connected tenant with active NinjaOne integration
- OAuth callback stores credentials and marks the integration active.
- The system schedules one delayed Temporal refresh workflow for that integration before token expiry.
2. Background refresh lifecycle
- The delayed workflow wakes up before expiry.
- The worker loads current NinjaOne credentials, refreshes them through the NinjaOne OAuth token endpoint, persists the rotated tokens, and computes the next refresh time.
- The worker schedules the next one-off refresh workflow for the same integration.
3. Failure and reconnect flow
- If refresh fails with a retryable infrastructure error, the workflow retries according to Temporal activity/workflow policy.
- If refresh fails with a non-retryable token/provider error such as `invalid_token`, the integration is marked as requiring reconnect and no further future refresh is scheduled until a reconnect or manual recovery path resets the lifecycle.
- User-triggered syncs still use lazy refresh fallback, but should usually find a fresh token already present.
4. Disconnect / reconnect flow
- Disconnecting NinjaOne cancels or invalidates future scheduled refreshes for that integration.
- Reconnecting NinjaOne creates a new valid credential set and seeds a new proactive refresh schedule.
## UX / UI Notes
- No new user-facing page is required in this scope.
- Existing sync flows should fail less often for expired tokens because refresh should already have happened in the background.
- When a refresh token is invalid and the integration needs reconnect, server actions should continue to return a clear reconnect-style error rather than a generic sync failure where practical, but a broader UI redesign is not part of this scope.
## Requirements
### Functional Requirements
- Introduce a dedicated NinjaOne token refresh workflow/activity in Temporal.
- Schedule one delayed refresh workflow per active NinjaOne integration using the credential `expires_at` value and a configurable safety buffer.
- Seed or reschedule that delayed workflow when:
- OAuth callback stores fresh credentials,
- a proactive refresh succeeds,
- a lazy refresh succeeds in the client.
- Ensure only one future proactive refresh is considered active for a given integration at a time.
- Refresh logic must reload the latest stored credential set at execution time rather than trusting stale workflow input.
- On successful refresh, persist:
- new access token,
- new refresh token,
- new expiry timestamp,
- unchanged instance URL unless the provider response or current stored credentials require otherwise.
- On terminal provider/token failure, record reconnect-required state in integration-owned metadata and stop automatic rescheduling until the integration is reconnected or explicitly reset.
- Disconnecting NinjaOne must cancel, invalidate, or safely no-op any in-flight future refresh workflow for that integration.
- Reconnecting NinjaOne must replace stale lifecycle state and create a fresh future refresh schedule.
- Existing organization/device sync and webhook-triggered client calls must keep the current lazy refresh fallback path.
- Refresh scheduling and execution must emit structured logs with tenant, integration, workflow identity, schedule target time, attempt outcome, and provider error payload details where safe.
### Non-functional Requirements
- Scheduling must be precise enough that refresh normally occurs before expiry with reasonable clock skew tolerance.
- The design must avoid a global high-frequency poller over all NinjaOne integrations.
- The implementation must be idempotent under duplicate workflow starts, repeated reconnects, or retries.
- Workflow ownership and cleanup semantics must survive worker restarts and deploys without orphaning endless refresh loops.
## Data / API / Integrations
- Current NinjaOne credentials live in the tenant secret `ninjaone_credentials` and contain:
- `access_token`
- `refresh_token`
- `expires_at`
- `instance_url`
- Current `rmm_integrations` rows do not store OAuth expiry directly. This plan should store schedule/lifecycle metadata in provider settings or another integration-owned persistence field that is available without reading secrets for every UI/status read.
- The proactive refresh workflow should use the same NinjaOne OAuth refresh contract already used by the client:
- `POST {instanceUrl}/oauth/token`
- `grant_type=refresh_token`
- `refresh_token`
- `client_id`
- `client_secret`
- The workflow should run on the existing app Temporal worker/task queue used for NinjaOne sync workflows unless a more specific queue is already required by runtime conventions.
## Security / Permissions
- Do not duplicate raw tokens into `rmm_integrations` or other broadly-readable tables.
- Any status or lifecycle metadata persisted outside secrets must exclude access tokens and refresh tokens.
- Failure logs should capture provider error codes and safe response body fragments, but must not log secret values or full request bodies containing credentials.
## Observability
- Log schedule creation/reschedule/cancel decisions with tenant and integration IDs.
- Log workflow execution start with tenant, integration, scheduled refresh target, and current token expiry.
- Log successful refresh completion with old/new expiry timestamps and next scheduled refresh time.
- Log terminal failure with provider status, provider error body, and whether the integration was marked reconnect-required.
- Reuse existing integration token lifecycle events where they fit, and add a NinjaOne-specific refresh-scheduled/refreshed signal only if needed for implementation clarity.
## Rollout / Migration
- Implement the workflow and scheduling path without removing lazy refresh.
- Backfill existing active NinjaOne integrations by seeding a future refresh workflow from their currently stored secret expiry.
- Treat integrations missing credentials or missing expiry as unschedulable and surface that as reconnect-required or configuration error rather than silently looping forever.
- Deploy with conservative scheduling buffer and validate on one integration before broad production reliance.
## Open Questions
- Whether schedule/lifecycle metadata should live in `rmm_integrations.settings` or in a dedicated table for token lifecycle state.
- Whether a terminal `invalid_token` refresh error should update `sync_error`, a new reconnect-required field in settings, or both.
- Whether disconnected integrations should actively cancel existing Temporal handles or rely on workflow/activity guards plus idempotent no-op behavior.
## Acceptance Criteria (Definition of Done)
- A newly connected NinjaOne integration automatically gets a future proactive refresh workflow scheduled before token expiry.
- A successful proactive refresh rotates and persists credentials, then schedules the next future refresh without user action.
- Existing active integrations can be seeded into the proactive schedule lifecycle after rollout.
- Lazy refresh remains available and continues to work as a fallback for missed schedules.
- A terminal refresh-token failure is recorded as reconnect-required state and no longer appears as an opaque sync-only failure.
- Disconnect and reconnect flows do not leave duplicate or stale future refresh executions for the integration.