Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz Source: /opt/alga-psa on psa.joliet.tech
18 KiB
Microsoft 365 Inbound Email Subscription Renewal – Implementation Plan
Date: November 18, 2025
Authors: Codex (draft)
Status: Draft
Edition: Applies to Community & Enterprise (EE layers add telemetry, but renewal logic lives in shared server code)
Implementation Todo List
Phase 1: Core Service & Logic
- Create
EmailWebhookMaintenanceServiceclass structure - Implement
findRenewalCandidatesDB query (joinemail_providersµsoft_email_provider_config) - Implement
renewMicrosoftWebhooksmain orchestration method - Add
renewWebhookSubscriptionlogic usingMicrosoftGraphAdapter - Implement fallback to
registerWebhookSubscriptionon 404/ResourceNotFound - Persist new subscription ID and expiration to
microsoft_email_provider_config - Update
email_provider_healthwith renewal result (status, failure reason)
Phase 2: Scheduling (EE & CE)
- Create
email-webhook-maintenance-workflow.tsTemporal workflow (EE) - Create workflow activities wrapping
EmailWebhookMaintenanceService - Register workflow in
ee/temporal-workflows/src/workflows/index.ts - Add client helper in
shared/workflow/init/registerWorkflowActions.ts(Handled inee/temporal-workflows/src/client.tsandserver/src/lib/jobs/index.ts) - Create pg-boss job handler for CE daily renewal
Phase 3: UI & Manual Controls
- Create server action
retryMicrosoftSubscriptionRenewal - Update
EmailSettingsUI to show "Subscription expires in..." column - Add "Retry Renewal" button to
EmailSettingsprovider table
1. Problem Statement
Inbound Microsoft 365 mailboxes rely on Microsoft Graph change notifications to trigger the INBOUND_EMAIL_RECEIVED event stream that ultimately feeds the Temporal system-email-processing-workflow (shared/workflow/workflows/system-email-processing-workflow.ts). Each webhook subscription expires after ~72 hours and must be renewed before expiration. Today:
MicrosoftGraphAdapter.renewWebhookSubscriptionexists (server/src/services/email/providers/MicrosoftGraphAdapter.ts:417) but nothing schedules or invokes it for email providers.- After a subscription expires,
server/src/app/api/email/webhooks/microsoft/route.tsnever receives notifications;EmailWebhookService(server/src/services/email/EmailWebhookService.ts) therefore stops enqueueing jobs, Redis never emitsINBOUND_EMAIL_RECEIVED, and tickets/comments are no longer created bysystem-email-processing-workflow. - Operators must manually delete/recreate providers to regain coverage, which is error-prone and risky for hosted tenants.
We need an automatic renewal + recovery loop so Microsoft 365 inbound email continues to flow without manual work and we can detect/report failures quickly.
2. Goals
- Automatically renew every active Microsoft inbound email subscription at least 24 hours before it expires, per tenant.
- Automatically recreate subscriptions that are missing or rejected during renewal (404 → new
POST /subscriptions), and persist the new IDs/expiration metadata inmicrosoft_email_provider_config. - Surface renewal health in
email_provider_healthand the Email Settings UI so operators can see when the last renewal ran, its result, and the next expiration. - Emit actionable alerts/events (PostHog + structured logs) when renewals fail repeatedly so support teams intervene before inbound mail stops flowing.
- Keep CE and EE parity: CE handles renewal and queue continuity; EE additionally forwards failures into workflow/analytics stacks without forking code paths.
3. Non-Goals
- Gmail Pub/Sub watch automation (already handled separately via topic refresh).
- Changing ticket creation logic in
system-email-processing-workflowbeyond consuming steady events. - Overhauling OAuth onboarding or adding delegated mailbox discovery (tracked elsewhere).
- Outbound mail renewals (Resend / Managed Domains) — only inbound Microsoft Graph change notifications are in scope here.
4. Current State Summary
- Webhook ingestion: Microsoft Graph webhooks hit
server/src/app/api/email/webhooks/microsoft/route.ts, which loads provider metadata frommicrosoft_email_provider_config, validatesclientState, and enqueues jobs throughEmailWebhookService→EmailQueueService(Redis) →EmailProcessor(server/src/services/email/EmailProcessor.ts). - Event flow:
EmailProcessoremitsINBOUND_EMAIL_RECEIVEDvia the Redis event bus, which the Temporal worker consumes to runsystem-email-processing-workflow.ts(seedocs/inbound-email/architecture/workflow.md). - Provider storage:
email_providersholds common metadata;microsoft_email_provider_configstores tokens + webhook fields (migrationserver/migrations/20250714081528_create_vendor_email_config_tables.cjs).EmailProviderServicemaps these rows toEmailProviderConfig(server/src/interfaces/email.interfaces.ts). - Existing renewal logic:
MicrosoftGraphAdaptercan patch/subscriptions/{id}to extend expiration and updateswebhook_expires_at, but nothing schedules this call. Calendar webhooks already have arenew-microsoft-calendar-webhookspg-boss job (server/src/lib/jobs/handlers/calendarWebhookMaintenanceHandler.ts), which demonstrates the renewal query pattern even though EE email providers will rely on Temporal for orchestration. - Health visibility:
email_provider_healthexists (server/migrations/20250601000000_create_email_system_tables.cjs) but is unused for inbound webhooks, so we cannot alert on renewal failures. UI under/server/src/components/admin/EmailSettings.tsxshows limited status, without subscription metadata. - Docs:
docs/inbound-email/architecture/overall.mdis still a placeholder and does not mention webhook lifecycle, making on-call handoffs harder.
5. Solution Overview
We will introduce an Email Webhook Maintenance Service that discovers Microsoft providers needing attention, renews or recreates their subscriptions via the existing adapter, and records health metrics. On Enterprise Edition we will schedule that service via a dedicated Temporal Cron workflow (15-minute cadence by default) so we gain end-to-end visibility in the Temporal UI, native retries, and workflow history. Community Edition — which does not run Temporal — will rely on a lightweight pg-boss job that runs once per day (sufficient for CE’s smaller footprint) plus an optional manual CLI trigger. Renewal results flow into email_provider_health, logs, and (for EE) PostHog/Temporal observability hooks. The Email Settings UI gains a “Subscription” column and manual “Retry renewal” action that calls the same service.
5.1 Renewal Candidate Discovery
- Query
email_providers⨝microsoft_email_provider_configfor active (is_active = true) Microsoft rows with any of:webhook_subscription_idnull/empty (never initialized).webhook_expires_atnull or withinlookAheadMinutes(default 1440 = 24h).last_subscription_renewalolder thanrenewalIntervalMinutes(safety net).
- Guard with
FOR UPDATE SKIP LOCKED(or update-returning) scoped to the tenant to prevent duplicate renewals if two workers overlap. - Convert each row into an
EmailProviderConfiginstance (reuseEmailProviderService.getProviderto hydrate OAuth secrets).
5.2 Renewal / Recovery Flow
For each candidate provider:
- Instantiate
MicrosoftGraphAdapterwith provider config. The adapter already loads tokens from the vendor config or secrets provider. - If
webhook_subscription_idexists, callrenewWebhookSubscription().- If Graph returns 404/ResourceNotFound, fall back to
registerWebhookSubscription()to fully recreate the subscription (using stored folder filters + derived webhook URL).
- If Graph returns 404/ResourceNotFound, fall back to
- Persist new
webhook_subscription_id,webhook_expires_at, andlast_subscription_renewalinmicrosoft_email_provider_config. - Update
email_provider_healthrow for the provider with fields like:subscription_status(enum: healthy, renewing, error)subscription_expires_at,last_renewal_attempt_at,last_renewal_result,failure_reasonlast_notification_received_at(optional future enhancement by backfilling from webhook route).
- Emit structured log + PostHog event (EE) for success/failure. On >3 consecutive failures mark the provider
connection_status = 'error'inemail_providersand surface to UI.
5.3 Observability & Operations
- Temporal scheduling (EE): Add an EE-only workflow (e.g.,
ee/temporal-workflows/src/workflows/email-webhook-maintenance-workflow.ts) that invokes the maintenance service as an activity on a 15-minute Cron schedule per tenant. Wire it throughee/temporal-workflows/src/workflows/index.tsand expose a workflow client helper so server actions and the UI can trigger ad-hoc runs or retry signals. - pg-boss scheduling (CE): Register a CE-only pg-boss recurring job (24-hour cadence) that calls the same maintenance service. The daily interval keeps load minimal while ensuring expired subscriptions are caught within a day.
- Manual controls:
- Server action
retryMicrosoftSubscriptionRenewal(providerId)underserver/src/lib/actions/email-actions. - CLI / script (optional) for on-call to run
node scripts/email/renew-microsoft-webhook.cjs --tenant ... --provider ....
- Server action
- UI: Extend Email Settings table to show “Subscription expires in Xh” and add a “Retry renewal” button that calls the action. Gate the button per-provider to avoid duplicates (disable while job is running).
- Docs/runbook: Update
docs/inbound-email/architecture/overall.mdwith the webhook lifecycle + renewal loop. Add an operations runbook underdocs/inbound-email/operations/microsoft-renewal.mdcapturing alerts, manual commands, and expectations.
6. Phased Work Breakdown
Phase 0 – Data & Observability Foundations (0.5 sprint)
- Add missing columns to
email_provider_health(subscription_status,subscription_expires_at,last_renewal_attempt_at,last_renewal_result,failure_reason) via a migration. - Backfill
microsoft_email_provider_config.webhook_expires_atandwebhook_subscription_idfor any providers missing values (call GraphGET /subscriptions/{id}where possible, else schedule re-registration). - Instrument
server/src/app/api/email/webhooks/microsoft/route.tsto writelast_notification_received_attoemail_provider_healthso we can detect silent failures independent of renewals. - Add structured logging helpers (shared logger wrapper) for all webhook lifecycle operations (register, renew, delete) with tenant/provider context.
Phase 1 – Renewal Service & Scheduler (1 sprint)
- Create
EmailWebhookMaintenanceService(server/src/services/email/EmailWebhookMaintenanceService.ts) that exposesrenewMicrosoftWebhooks({ tenantId, lookAheadMinutes }). - Implement SQL queries using the admin connection (
@alga-psa/db/admin) to fetch candidate providers with locking semantics. - Within the service, instantiate
MicrosoftGraphAdapterand callrenewWebhookSubscriptionorregisterWebhookSubscriptionper provider, handling 404/410 gracefully. - Centralize persistence updates (webhook fields +
email_provider_health) so both the job and manual UI reuse the same code. - Implement
ee/temporal-workflows/src/workflows/email-webhook-maintenance-workflow.tsplus activities that callEmailWebhookMaintenanceService, emit telemetry, and accept signals to force immediate renewal. - Register the workflow in the EE worker (
ee/temporal-workflows/src/workflows/index.ts), add a helper client inshared/workflow/init/registerWorkflowActions.ts, and schedule a Cron run per tenant via Temporal APIs (storing schedule metadata per tenant/edition). For CE, wire up a pg-boss recurring job (*/1440minutes) that invokes the same service without Temporal. - Provide configuration knobs via env vars:
MICROSOFT_EMAIL_RENEWAL_LOOKAHEAD_MINUTES,MICROSOFT_EMAIL_RENEWAL_FAILURE_THRESHOLD,MICROSOFT_EMAIL_RENEWAL_BATCH_SIZE.
Phase 2 – Failure Handling, UI, and Alerts (1 sprint)
- Extend
EmailProviderService.updateProviderStatusto mark providerserrorafternconsecutive renewal failures and optionally disable inbound processing to avoid silent drops. - Implement manual “Retry renewal” action + button in the Email Settings UI (likely in
server/src/components/admin/EmailSettings.tsxor the newer settings tabs). Ensure dialog IDs satisfydocs/AI_coding_standards.md. - Fire PostHog events (
email_provider.subscription_renewal_success/_failure) inside EE builds (useee/server/src/lib/analytics/posthog.ts). - Add optional Slack/Email notifications via
SharedNotificationServicewhen a provider remains in error for >1 hour. - Update docs: finish
docs/inbound-email/architecture/overall.mdwith diagrams connecting webhooks → Redis → Temporal and highlight the renewal scheduler.
Phase 3 – Testing, Migration & Rollout (0.5 sprint)
- Expand WireMock fixtures under
test-config/wiremock-oauth/microsoft-oauth.jsonto cover renewal success, 404, and throttling responses. - Add unit tests for
EmailWebhookMaintenanceService(using mocked adapters) and integration tests that simulate expiring subscriptions to ensure the job renews them and records health rows. - Add regression tests ensuring
EmailProcessorstill emitsINBOUND_EMAIL_RECEIVEDwhen jobs fire concurrently (queue + workflow). - Create a rollout checklist: enable the Temporal schedule in staging, monitor
email_provider_healthmetrics + workflow histories, then promote to production tenants gradually (toggle lookAhead from 720 → 1440 once stable). - Provide a backfill job to enqueue an immediate renewal for all tenants after deployment to ensure consistent state.
7. Data & Schema Considerations
microsoft_email_provider_configalready storeswebhook_subscription_id,webhook_expires_at,webhook_verification_token, and OAuth secrets. No schema changes needed beyond ensuring indexes exist (server/migrations/20250818014500_index_ms_webhook_columns.cjs).email_provider_healthneeds new columns enumerated in Phase 0; distribute the table in Citus migrations (ee/server/migrations/citus/...).- Consider materializing a view (or updating
email_provider_configs) that exposessubscription_expires_atso APIs/UI don’t need to read vendor tables directly. - Store consecutive failure count either inside
email_provider_health(simple integer) or as JSON metadata.
8. Testing Strategy
- Unit tests:
- Mock Microsoft Graph responses (renew success, 404, throttling) to ensure the service retries/re-registers correctly.
- Verify DB persistence logic updates both
microsoft_email_provider_configandemail_provider_health.
- Integration tests (
server/src/test/integration/...):- Spin up WireMock (existing
test-config/wiremock-oauth/microsoft-oauth.json) to simulate expiring subscriptions, then run the maintenance handler and assert webhook expiration extends. - Feed a fake webhook notification after renewal to ensure
EmailWebhookServiceprocesses it end-to-end (Redis queue + Temporal stub).
- Spin up WireMock (existing
- End-to-end smoke:
- In staging, configure a Microsoft test tenant, wait for expiration threshold, verify job renews automatically (monitor logs + DB).
- Validate UI shows updates and manual “Retry renewal” triggers the same service.
- Chaos testing:
- Revoke Graph subscriptions manually and confirm the job recreates them.
- Force invalid tokens to ensure renewal surfaces an actionable error and does not loop infinitely.
9. Risks & Mitigations
- Token expiration / revoked consent: Renewal will fail until OAuth tokens are refreshed. Mitigation: detect
invalid_grantresponses, mark provider aserror, and notify admins to reauthorize. - Schedule overlap causing double renewals: Use DB-level locking and per-provider idempotency (compare
webhook_expires_atbefore updating) to avoid patch storms, and keep Temporal schedule concurrency at 1. - Graph throttling: Batch renewals with exponential backoff + jitter; respect 429 retry-after headers and rely on Temporal activity retries (EE) or pg-boss retry policies (CE fallback) to spread load.
- Partial migrations (old vs. new tables): Ensure the service reads from canonical vendor tables but writes back to any aggregated view (
email_provider_configs) used by other services to avoid stale data. - Alert fatigue: Only alert after repeated failures and include remediation instructions (re-auth, contact Microsoft, etc.).
10. Open Questions / Follow-Ups
- Should we migrate
email_provider_configs(new unified table) to be the single source of truth before building the service, or can we rely on the vendor tables short-term? - Do we need to renew subscriptions per mailbox folder (multiple
folder_filters), or is the first entry sufficient? If multiple, we may need to store multiple subscription IDs per provider. - How should we handle tenants with thousands of shared mailboxes? Do we throttle per tenant or globally?
- Should we raise a Temporal workflow signal/event when a provider enters
errorso downstream automations pause/resume gracefully? - Do we also want to auto-delete orphaned subscriptions (Graph still has them, but provider removed)? Possibly add cleanup later.
Next Steps: get architecture/product sign-off on this plan, then create engineering tickets by phase (Phase 0 foundations first). Ensure the on-call runbook reflects the new job before rollout.