PSA/ee/docs/plans/2025-11-18-calendar-webhook-renewal-improvements.md
Hermes 284313f908
Some checks are pending
Bidi Control Character Guard / bidi-control-guard (push) Waiting to run
Circular Dependency Check / Check for new circular dependencies (push) Waiting to run
Citus Migration Smoke / Combined migrations on single-node Citus (push) Waiting to run
E2E Fresh Install Tests / fresh-install-e2e (push) Waiting to run
ext-v2 guardrails / Run ext-v2 guard and ESLint (push) Waiting to run
Integration Tests / Check for relevant changes (push) Waiting to run
Integration Tests / ${{ (github.event_name == 'schedule' || github.event.inputs.suite == 'full') && 'Full integration suite' || 'Tier-1 integration subset' }} (push) Blocked by required conditions
Mobile checks / Mobile lint + typecheck (push) Waiting to run
Mobile checks / Mobile unit tests (push) Waiting to run
Mobile checks / Mobile dependency audit (report) (push) Waiting to run
Mobile checks / Mobile reproducibility checks (push) Waiting to run
Secrets guard (env backups) / Ensure no tracked env backup files (push) Waiting to run
Temporal Readiness / fast-readiness (push) Waiting to run
Temporal Readiness / docker-parity (push) Waiting to run
TypeScript Type Check / Nx affected typecheck (push) Waiting to run
Unit Tests / Skipped-test budget (push) Waiting to run
Unit Tests / Nx affected unit tests (push) Waiting to run
Unit Tests / Server unit coverage (informational) (push) Waiting to run
Validate Tenant Management Schema / Check for relevant changes (push) Waiting to run
Validate Tenant Management Schema / Validate Tenant Management Schema (push) Blocked by required conditions
EE Workflows Build Guard / ee-workflows-build-guard (push) Waiting to run
Initial import of AlgaPSA codebase from PSA server
Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz

Source: /opt/alga-psa on psa.joliet.tech
2026-06-22 16:12:17 -05:00

14 KiB

Microsoft Calendar Webhook Renewal Improvements

Date: November 18, 2025
Status: In Progress - Phases 1, 2, 4.1, 4.2 Complete
Related: Email Subscription Renewal Plan, Calendar Integrations Completion Plan


Executive Summary

The Microsoft calendar webhook renewal implementation (calendarWebhookMaintenanceHandler.ts) is functional but lacks the robustness and operational visibility of the email webhook renewal system. While calendar renewals run every 30 minutes (good), they lack fallback recovery, health tracking, and structured error handling that would prevent silent failures in production.


Current State Analysis

What Exists

  • Scheduled renewal job: Runs every 30 minutes via pg-boss (*/30 * * * *)
  • Basic renewal logic: MicrosoftCalendarAdapter.renewWebhookSubscription() successfully renews active subscriptions
  • Tenant scoping: Properly wrapped with runWithTenant()
  • Logging: Errors are logged with tenant/provider context

What's Missing Mostly Fixed

1. No Fallback to Re-register on 404 FIXED

Current behavior:

  • If renewWebhookSubscription() throws a 404 (subscription deleted/expired), the handler logs an error and moves on
  • The provider remains broken until manual intervention

Fixed Implementation:

  • Detects 404/ResourceNotFound errors via isResourceNotFoundError()
  • Automatically calls registerWebhookSubscription() to recreate the subscription
  • Updates the stored subscription ID and expiration

Status: Implemented in CalendarWebhookMaintenanceService.processCandidate()

2. No Handling for Missing Subscriptions FIXED

Current behavior:

  • Skips providers without webhookExpiresAt
  • No attempt to register a subscription if one doesn't exist

Fixed Implementation:

  • Checks for missing webhook_subscription_id in findRenewalCandidates()
  • Automatically registers a new subscription if missing via recreateSubscription()

Status: Implemented in CalendarWebhookMaintenanceService.findRenewalCandidates() and processCandidate()

3. No Health Status Tracking FIXED

Current behavior:

  • No equivalent to email_provider_health table
  • Renewal success/failure is only in logs
  • No way to query "which providers have failing renewals?"

Fixed Implementation:

  • calendar_provider_health table created with migration 20251118120000_create_calendar_provider_health.cjs
  • Tracks:
    • subscription_status (healthy, renewing, error)
    • subscription_expires_at
    • last_renewal_attempt_at
    • last_renewal_result (success/failure)
    • failure_reason
    • last_webhook_received_at
    • consecutive_failure_count
  • Enables UI dashboards and alerting

Status: Fully implemented

4. No Service Layer Abstraction FIXED

Current behavior:

  • Handler function directly calls adapter methods
  • Logic is tightly coupled to the job handler

Fixed Implementation:

  • CalendarWebhookMaintenanceService class created
  • Encapsulates:
    • Candidate discovery with DB queries
    • Renewal/re-registration orchestration
    • Health status updates
    • Error classification (404 detection)
  • Reusable by UI actions, CLI tools, and scheduled jobs

Status: Fully implemented and handler updated to use service

5. Limited Error Classification FIXED

Current behavior:

  • All errors are treated the same
  • No distinction between recoverable (404) vs. permanent (invalid token) failures

Fixed Implementation:

  • isResourceNotFoundError() helper detects 404/ResourceNotFound
  • Differentiates between recoverable and permanent failures
  • Marks providers as error only after 3+ repeated failures

Status: Fully implemented

6. No Structured Renewal Results FIXED

Current behavior:

  • Handler returns void
  • No way to track which providers were processed or their outcomes

Fixed Implementation:

  • Returns RenewalResult[] with:
    • providerId, tenant, success, action (renewed/recreated/failed)
    • newExpiration, error (if failed)
  • Enables batch reporting and UI feedback

Status: Fully implemented

7. No Manual Renewal Action FIXED

Fixed Implementation:

  • retryMicrosoftCalendarSubscriptionRenewal() server action created
  • Includes RBAC permission checks
  • Returns structured results for UI feedback

Status: Server action complete, UI integration pending

8. No PostHog Telemetry FIXED

Fixed Implementation:

  • PostHog events emitted (EE only): calendar_provider.subscription_renewal_success / _failure
  • Includes tenant/provider dimensions for dashboards

Status: Fully implemented (EE edition only)


Phase 1: Service Layer & Fallback Recovery (High Priority) COMPLETE

1.1 Create CalendarWebhookMaintenanceService

  • Mirror EmailWebhookMaintenanceService structure
  • Location: server/src/services/calendar/CalendarWebhookMaintenanceService.ts
  • Methods:
    • renewMicrosoftWebhooks(options) - Main entry point
    • findRenewalCandidates() - Query with DB locking
    • processCandidate() - Renew or re-register per provider
    • recreateSubscription() - Fallback registration
    • isResourceNotFoundError() - Error classification
    • updateProviderStatus() - Update calendar_providers.status on failures

1.2 Add 404 Fallback Logic

  • In processCandidate(), catch 404 errors from renewWebhookSubscription()
  • Call adapter.registerWebhookSubscription() to recreate
  • Update microsoft_calendar_provider_config with new subscription ID

1.3 Handle Missing Subscriptions

  • In findRenewalCandidates(), include providers with:
    • webhook_subscription_id null/empty
    • webhook_expires_at null
  • Attempt registration during processCandidate()

Deliverables:

  • Service class created
  • Updated handler to use service (calendarWebhookMaintenanceHandler.ts)
  • Integration tests for 404 recovery and missing subscription registration (pending)

Phase 2: Health Tracking & Observability (Medium Priority) COMPLETE

2.1 Create calendar_provider_health Table

  • Migration: server/migrations/20251118120000_create_calendar_provider_health.cjs
  • Columns:
    • calendar_provider_id (UUID, FK to calendar_providers.id)
    • tenant (UUID, FK to tenants.tenant)
    • subscription_status (enum: healthy, renewing, error)
    • subscription_expires_at (timestamp)
    • last_renewal_attempt_at (timestamp)
    • last_renewal_result (string: success/failure)
    • failure_reason (text)
    • last_webhook_received_at (timestamp)
    • consecutive_failure_count (integer) - for threshold tracking
  • Indexes: (tenant, subscription_status), (calendar_provider_id, tenant), (subscription_expires_at)

2.2 Update Service to Track Health

  • updateHealthStatus() method writes to calendar_provider_health
  • Called after each renewal attempt (success or failure)
  • Upsert pattern (insert or update)

2.3 Instrument Webhook Route

  • Update server/src/app/api/calendar/webhooks/microsoft/route.ts
  • Write last_webhook_received_at to health table on successful webhook receipt
  • Enables detection of silent failures (subscription exists but no notifications)

Deliverables:

  • Migration with health table
  • Service updates health on every renewal
  • Webhook route instrumentation

Phase 3: UI & Manual Controls (Medium Priority) 🔄 PARTIAL

3.1 Server Action for Manual Renewal

  • server/src/lib/actions/calendarActions.ts
  • retryMicrosoftCalendarSubscriptionRenewal(providerId: string)
  • Calls CalendarWebhookMaintenanceService.renewMicrosoftWebhooks({ providerId })
  • Returns structured result for UI feedback
  • Includes RBAC permission checks

3.2 UI Updates

  • CalendarIntegrationsSettings.tsx or related component
  • Show "Subscription expires in Xh" column (from health table)
  • Add "Retry Renewal" button per provider
  • Display last renewal result and failure reason if error
  • Disable button while renewal is in progress

Deliverables:

  • Server action with error handling
  • UI components showing renewal status (pending)
  • Manual retry button with feedback (pending)

Phase 4: Error Handling & Alerting (Low Priority) 🔄 PARTIAL

4.1 Mark Providers as Error After Repeated Failures

  • Track consecutive failure count in health table (consecutive_failure_count)
  • After 3+ consecutive failures, set calendar_providers.status = 'error'
  • Update error_message with actionable guidance

4.2 Structured Logging & Events

  • Emit PostHog events (EE): calendar_provider.subscription_renewal_success / _failure
  • Include tenant/provider dimensions for dashboards
  • Log renewal attempts with structured context (expiry time, action taken)
  • Only enabled when EDITION === 'enterprise'

4.3 Alerting Integration

  • Hook into existing notification system for repeated failures
  • Alert operators when provider enters error state
  • Include remediation steps (re-authorize OAuth, check webhook URL)

Deliverables:

  • Failure threshold logic
  • PostHog instrumentation (EE)
  • Alert integration (pending)

Comparison Table

Feature Email Implementation Calendar Implementation Gap
Scheduled renewal Daily (pg-boss) Every 30 min (pg-boss) None
404 fallback Auto re-register Auto re-register Fixed
Missing subscription handling Auto register Auto register Fixed
Health tracking table email_provider_health calendar_provider_health Fixed
Service layer EmailWebhookMaintenanceService CalendarWebhookMaintenanceService Fixed
Manual renewal action retryMicrosoftSubscriptionRenewal retryMicrosoftCalendarSubscriptionRenewal Fixed
UI status display Subscription expiry column Pending Medium
Error classification 404 vs. permanent 404 vs. permanent Fixed
Structured results RenewalResult[] RenewalResult[] Fixed
Failure threshold 3+ failures → error 3+ failures → error Fixed
PostHog events EE telemetry EE telemetry Fixed

Implementation Priority

  1. Phase 1 (Critical): Service layer + 404 fallback + missing subscription handling

    • Prevents silent failures
    • Enables automatic recovery
    • Estimated effort: 1 sprint
  2. Phase 2 (High): Health tracking table + service updates

    • Enables observability
    • Foundation for UI/alerting
    • Estimated effort: 0.5 sprint
  3. Phase 3 (Medium): UI + manual controls

    • Operator self-service
    • Better UX
    • Estimated effort: 0.5 sprint
  4. Phase 4 (Low): Error thresholds + alerting

    • Production hardening
    • Proactive incident response
    • Estimated effort: 0.5 sprint

Testing Strategy

Unit Tests

  • Mock MicrosoftCalendarAdapter responses (success, 404, permanent error)
  • Verify service handles all cases correctly
  • Test error classification logic

Integration Tests

  • WireMock fixtures for Microsoft Graph (renew success, 404, throttling)
  • Simulate expired/missing subscriptions
  • Verify DB updates (health table, provider config)

End-to-End Smoke

  • Configure test tenant with Microsoft calendar
  • Wait for renewal window
  • Verify automatic renewal + health tracking
  • Manually trigger renewal via UI action

Migration Considerations

  • Backfill health table: For existing providers, create initial health rows with current expiry times
  • Gradual rollout: Enable service layer first, then add health tracking, then UI
  • Monitoring: Watch renewal success rates before/after changes to validate improvements

Open Questions

  1. Should calendar providers also support Temporal workflows (EE) like email, or is pg-boss sufficient?

Answer: We should use temporal

  1. Do we need a separate health table, or can we extend calendar_providers with renewal fields?

Answer: you decide

  1. Should we track webhook receipt timestamps in health table (like email) to detect silent failures?

Answer: yes

  1. What's the desired failure threshold before marking provider as error? (Email uses 3+ consecutive failures)

Answer: let's match email


Success Criteria

  • Calendar webhook renewals automatically recover from 404 errors
  • Providers with missing subscriptions are automatically registered
  • Operators can see renewal status and last renewal time in UI (pending UI work)
  • Manual renewal action available from settings page (server action ready)
  • Health table enables alerting on repeated failures
  • Integration tests cover all renewal scenarios (pending)

Next Steps:

  • Phase 1 Complete - Service layer + 404 fallback + missing subscription handling
  • Phase 2 Complete - Health tracking table + service updates + webhook instrumentation
  • Phase 3 Partial - Server action complete, UI updates pending
  • Phase 4 Partial - Failure thresholds + PostHog events complete, alerting integration pending

Remaining Work:

  • UI components for displaying renewal status and manual retry button (Phase 3.2)
  • Alert integration for repeated failures (Phase 4.3)
  • Integration tests for renewal scenarios
  • Temporal workflow support for EE (per plan answer #1)