Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz Source: /opt/alga-psa on psa.joliet.tech
17 KiB
PRD: Teams Notification Delivery and Action Audit Observability
Plan slug: 2026-05-24-teams-observability-loop
Owning area: EE / Microsoft Teams addon (ee/packages/microsoft-teams)
Related plan: .ai/teams_improvements/microsoft-teams-addon-competitive-parity-plan.md ("Best One-PR Agent Loop")
Problem Statement
The Microsoft Teams addon (ADD_ONS.TEAMS) ships functional bot, message-extension, quick-action, and meeting features, but has no persisted record of what notifications it attempted, what mutations it performed, or why anything failed. This is the blocker for every Phase 2+ Teams feature (health dashboard, diagnostics, channel routing) because none of them have data to read.
Additionally, proactive channel messaging (Phase 2) requires Bot Framework ConversationReference + serviceUrl to be captured on first contact. Capturing now (even if not yet consumed) means Phase 2 starts with a populated table instead of waiting for users to interact again.
User Value
- Admins can diagnose failed Teams notifications without reading server logs.
- Engineers can answer "did this notification go out?" deterministically.
- Security/compliance has an attributable audit trail for every Teams-originated PSA mutation.
- Phase 2+ features are unblocked with real production data on Day 1.
Goals
- Every Teams notification attempt (skipped, sent, delivered, failed) creates a row in
teams_notification_deliveries. - Every Teams-originated PSA mutation creates a row in
teams_audit_events. - Bot
ConversationReference+serviceUrlare captured on first inbound activity per (tenant, microsoft_user_id, conversation_id). - All new tables follow Citus multi-tenant rules:
tenantin PK, every WHERE, every join; distributed ontenant. - All new tables are registered in the tenant deletion temporal workflow so tenant offboarding is complete.
- Read-only server actions (
listTeamsDeliveries,listTeamsAuditEvents) exist behindwithAuth+ permission check.
Non-Goals (Explicit)
- No UI changes. No health dashboard. No admin diagnostics page.
- No channel mapping table. No setup wizard. No configurable channel tab.
- No 402/403 entitlement response standardization.
- No trial flow, per-seat metering, billing changes.
- No changes to existing notification/mutation behavior — pure instrumentation.
- No LLM/fuzzy intent matching.
- No bot SSO token exchange.
- No new metrics export, no Prometheus, no log shipping changes.
Target Users
- MSP admins querying the new tables via server action (for now — UI in Phase 2).
- Engineers debugging Teams delivery in staging/production.
- Future Phase 2 health dashboard as a data consumer.
Primary Flows
Flow A: Notification delivery (instrumented)
deliverTeamsNotificationImpl()is invoked from notification pipeline.- Function computes idempotency key = SHA-256(
internal_notification_id+tenant+destination_type+destination_id+attempt_number). - Before returning, function writes one row to
teams_notification_deliverieswithstatus ∈ {skipped, sent, delivered, failed}and structurederror_codeenum value when applicable. - If a row with the same
idempotency_keyalready exists, the insert is a no-op (ON CONFLICT DO NOTHING).
Flow B: Teams mutation audit
teamsActionRegistrydispatches one of:assign_ticket,add_note,reply_to_contact,log_time,approval_response,create_ticket_from_message,update_from_message.- After dispatch (success or caught error), one row is written to
teams_audit_events. - Payload is not stored — only SHA-256 hash of canonicalized JSON payload plus safe metadata (actor, target, surface, action, result, error code).
Flow C: ConversationReference capture
- Bot receives any inbound activity (message, invoke, conversationUpdate).
- Handler upserts
teams_conversation_referenceskeyed on (tenant,microsoft_user_id,conversation_id) with currentservice_url,conversation_type,tenant_id_aad,updated_at. - Phase 2 proactive messaging consumes this table; this PR only writes.
Flow D: Admin read
- Admin calls server action
listTeamsDeliveries({ filter, limit, cursor })orlistTeamsAuditEvents({ filter, limit, cursor }). withAuthresolves tenant from session;hasPermission(user, 'teams_integration', 'read')gates access.- Action returns paginated rows scoped to the authenticated tenant.
Data Model
Table: teams_notification_deliveries
| Column | Type | Notes |
|---|---|---|
tenant |
uuid NOT NULL | Citus distribution key; in PK |
delivery_id |
uuid NOT NULL | In PK |
internal_notification_id |
uuid | FK to internal_notifications.id (tenant, id) — soft FK if cross-table cycles |
category |
text | Enum: assignment / customer_reply / approval_request / escalation / sla_risk |
destination_type |
text NOT NULL | Enum: user_activity / chat / channel |
destination_id |
text NOT NULL | Microsoft user id / chat id / channel id |
attempt_number |
int NOT NULL DEFAULT 1 | |
idempotency_key |
text NOT NULL | UNIQUE per (tenant, idempotency_key) |
provider_message_id |
text | Activity feed/chat/channel message id from Graph |
status |
text NOT NULL | Enum: skipped / sent / delivered / failed |
error_code |
text | Enum: see below |
error_message |
text | Free text, truncated to 1KB |
retryable |
boolean | |
provider_request_id |
text | Graph request-id header when available |
sent_at |
timestamptz | |
delivered_at |
timestamptz | |
responded_at |
timestamptz | |
created_at |
timestamptz NOT NULL DEFAULT now() |
Primary key: (tenant, delivery_id).
Unique: (tenant, idempotency_key) — supports INSERT … ON CONFLICT DO NOTHING.
Index: (tenant, internal_notification_id), (tenant, status, created_at DESC).
Distribution: create_distributed_table('teams_notification_deliveries', 'tenant', colocate_with => 'teams_integrations').
Table: teams_audit_events
| Column | Type | Notes |
|---|---|---|
tenant |
uuid NOT NULL | Citus distribution key; in PK |
event_id |
uuid NOT NULL | In PK |
actor_user_id |
uuid | PSA user id resolved from token |
microsoft_user_id |
text | Teams aadObjectId if available |
surface |
text NOT NULL | Enum: bot / message_extension / quick_action / tab |
action_id |
text NOT NULL | Enum: 7 mutation actions |
target_type |
text | e.g., ticket, time_entry, approval |
target_id |
text | |
idempotency_key |
text | Same key Teams sent (if any), used for retried invokes |
payload_hash |
text | SHA-256 of canonicalized JSON input |
result_status |
text NOT NULL | Enum: success / failure |
error_code |
text | Action-error taxonomy enum |
created_at |
timestamptz NOT NULL DEFAULT now() |
Primary key: (tenant, event_id).
Index: (tenant, actor_user_id, created_at DESC), (tenant, target_type, target_id).
Distribution: create_distributed_table('teams_audit_events', 'tenant', colocate_with => 'teams_integrations').
Table: teams_conversation_references
| Column | Type | Notes |
|---|---|---|
tenant |
uuid NOT NULL | Citus distribution key; in PK |
microsoft_user_id |
text NOT NULL | aadObjectId from activity |
conversation_id |
text NOT NULL | Bot Framework conversation id |
conversation_type |
text NOT NULL | personal / groupChat / channel |
service_url |
text NOT NULL | Bot Framework service URL — required for proactive messages |
tenant_id_aad |
text | AAD tenant id (Microsoft side) |
channel_id_bot_framework |
text | Always msteams for our use, kept for forward-compat |
last_activity_at |
timestamptz NOT NULL | |
created_at |
timestamptz NOT NULL DEFAULT now() | |
updated_at |
timestamptz NOT NULL DEFAULT now() |
Primary key: (tenant, microsoft_user_id, conversation_id).
Distribution: create_distributed_table('teams_conversation_references', 'tenant', colocate_with => 'teams_integrations').
Error code taxonomy (delivery)
Single text enum constraint, no separate type:
graph_throttled(429)graph_unauthorized(401/403)graph_not_found(404)graph_server_error(5xx)user_not_mapped(no Microsoft user link)addon_inactive(entitlement gate hit)integration_inactive(teams_integrations.install_status != 'active')package_misconfigured(missing base URL / app id)transient(network / timeout)unknown
The migration enforces the value with a CHECK (error_code IS NULL OR error_code IN (...)) constraint to allow forward-compat additions without an enum type rebuild.
Retention Strategy
Decision: Range-partition teams_notification_deliveries by created_at (monthly).
- Migration creates the parent table partitioned by
RANGE (created_at)plus the next 3 months of child partitions. - A separate small migration creates
cleanup_teams_notification_deliveries(retention_interval)PL/pgSQL function that drops partitions older thannow() - retention_interval. Default:90 days. teams_audit_eventsis not partitioned in this PR. Audit retention is typically longer (1y+) and partitioning can be added later without backfill. Add acleanup_teams_audit_events(interval)function with default365 daysso the knob exists.teams_conversation_referencesis not partitioned — it's an upsert-keyed table, expected to stay small (one row per (user, conversation)).
Citus note: Partitioned distributed tables in Citus require each partition to also be distributed. The migration uses create_distributed_table on the parent and Citus auto-distributes children. We verify this with a smoke query in the migration.
Tenant Deletion Integration (CRITICAL)
File: ee/temporal-workflows/src/activities/tenant-deletion-activities.ts
Constant: TENANT_TABLES_DELETION_ORDER (currently line 36+).
Required edits
Add the three new tables before teams_integrations (line 94) so FK ordering holds. Group them with the existing Microsoft profile bindings comment block.
// Microsoft profile bindings (dependents before profile definitions)
- 'microsoft_profile_consumer_bindings', 'teams_integrations', 'microsoft_profiles',
+ 'microsoft_profile_consumer_bindings',
+ 'teams_notification_deliveries', 'teams_audit_events', 'teams_conversation_references',
+ 'teams_integrations', 'microsoft_profiles',
Why this placement
- Partitioned table deletion: when the parent (
teams_notification_deliveries) is deleted withWHERE tenant=?, partition pruning handles children. No need to enumerate each monthly partition. - All three new tables reference
(tenant, …)fromteams_integrationsvia tenant column only (no hard FK to integration row), so they can be deleted beforeteams_integrationswithout FK violation regardless. - Listing them explicitly (not relying on CASCADE) matches the existing pattern in the deletion order constant.
Verification step in the plan
A dedicated feature/test confirms a tenant-deletion dry run on a seeded tenant removes rows from all three new tables and that order conflicts are absent.
API Surface
Server actions (new)
Both live in ee/packages/microsoft-teams/src/lib/actions/integrations/teamsObservabilityActions.ts:
export const listTeamsDeliveries = withAuth(async (user, { tenant }, params: {
status?: 'skipped' | 'sent' | 'delivered' | 'failed';
category?: string;
since?: string; // ISO datetime
limit?: number; // default 50, max 200
cursor?: string; // opaque pagination cursor
}) => { /* … */ });
export const listTeamsAuditEvents = withAuth(async (user, { tenant }, params: {
surface?: 'bot' | 'message_extension' | 'quick_action' | 'tab';
action_id?: string;
actor_user_id?: string;
result_status?: 'success' | 'failure';
since?: string;
limit?: number;
cursor?: string;
}) => { /* … */ });
Both require hasPermission(user, 'teams_integration', 'read').
No new HTTP routes
These server actions are callable from the existing settings/admin context. No /api/teams/* additions in this PR.
Multi-Tenant / Citus Compliance
- All three tables include
tenantin the primary key. - All inserts/queries include
tenantin WHERE. - All tables distributed on
tenant, colocated withteams_integrations. - No cross-tenant unique constraints.
idempotency_keyuniqueness is scoped per-tenant viaUNIQUE (tenant, idempotency_key).- Reads in server actions use
withAuth→runWithTenant→createTenantKnex()to set tenant context automatically.
Privacy / Security
- No raw payloads stored.
payload_hashis SHA-256 of canonicalized JSON; original payload is GC'd by the action handler. error_messagetruncated to 1KB to avoid leaking large customer text.- Server actions gated by
hasPermission(user, 'teams_integration', 'read'). - Bot Framework JWT validation is not part of this PR (separate audit); assumed already enforced by existing bot handler.
Risks
- Citus partitioned distributed tables. Combining
PARTITION BY RANGEwithcreate_distributed_tablehas known sharp edges. Mitigation: smoke-test in CE migration first (without Citus), then verify on a Citus-enabled staging environment. - Insert overhead on hot notification path. ~1 extra insert per notification. Mitigation: insert is fire-and-forget within the same transaction; failures logged but do not break delivery.
- Idempotency-key collisions. SHA-256 across 5 components → collision risk negligible, but
ON CONFLICT DO NOTHINGensures a duplicate write is safe. - Tenant deletion ordering. Documented above; verified by feature F-TD-01.
- Worker rebuild scope creep.
deliverTeamsNotificationImplmay be called fromservices/workflow-worker. PR must verify and rebuild worker if so.
Rollout / Migration
- Migrations are forward-only.
down()drops tables (safe — no data depends on them in CE/prod yet). - Deploy order: run migrations → deploy
@alga-psa/microsoft-teamsrebuild → restart server → restart workflow-worker (if it imports the package). - No feature flag needed; instrumentation is internal.
- No user-visible changes — no docs/release-notes entry required beyond a one-line internal changelog.
Acceptance Criteria
- Three migrations land in
ee/server/migrations/(CE-compatible) andee/server/migrations/citus/(where Citus-specific logic differs). deliverTeamsNotificationImpl()writes exactly one row per terminal outcome with correctstatusanderror_code.- Each of the 7 mutation actions in
teamsActionRegistry.tswrites exactly one audit row on success and one on failure. - First inbound bot activity from a (tenant, user, conversation) tuple writes a
teams_conversation_referencesrow; subsequent activities upsert. tenant-deletion-activities.tslists all three new tables inTENANT_TABLES_DELETION_ORDER, beforeteams_integrations.- Tenant deletion test (existing integration test) removes all rows from new tables.
listTeamsDeliveries/listTeamsAuditEventsreturn only rows for the authenticated tenant; cross-tenant attempt is rejected by query construction.- All new queries include
tenantin WHERE; verified by grep + code review. - No test mocks the database — integration tests hit a real DB per project convention.
- No raw payload text is persisted; verified by grepping inserts.
Open Questions
- Should
cleanup_*functions be invoked by anything in this PR? Default: no — just create the functions, leave invocation to an operator/cron in a follow-up. Confirmed in scope as "knob exists" only. - Permission key
teams_integration:read— does it exist? If not, the PR adds it to the permission seeder. Action: grep first; add only if missing. - Conversation reference persistence — table vs JSON on
teams_integrations? Decided: separate table (Phase 2 will index by user/conversation and a JSON column on a single-row-per-tenant table is the wrong shape). - Should
internal_notification_idbe a hard FK? Decided: no.internal_notificationsis large and FK across distributed shards is expensive; rely on tenant-scoped soft reference.