Hermes 284313f908
Some checks are pending
Bidi Control Character Guard / bidi-control-guard (push) Waiting to run
Circular Dependency Check / Check for new circular dependencies (push) Waiting to run
Citus Migration Smoke / Combined migrations on single-node Citus (push) Waiting to run
E2E Fresh Install Tests / fresh-install-e2e (push) Waiting to run
ext-v2 guardrails / Run ext-v2 guard and ESLint (push) Waiting to run
Integration Tests / Check for relevant changes (push) Waiting to run
Integration Tests / ${{ (github.event_name == 'schedule' || github.event.inputs.suite == 'full') && 'Full integration suite' || 'Tier-1 integration subset' }} (push) Blocked by required conditions
Mobile checks / Mobile lint + typecheck (push) Waiting to run
Mobile checks / Mobile unit tests (push) Waiting to run
Mobile checks / Mobile dependency audit (report) (push) Waiting to run
Mobile checks / Mobile reproducibility checks (push) Waiting to run
Secrets guard (env backups) / Ensure no tracked env backup files (push) Waiting to run
Temporal Readiness / fast-readiness (push) Waiting to run
Temporal Readiness / docker-parity (push) Waiting to run
TypeScript Type Check / Nx affected typecheck (push) Waiting to run
Unit Tests / Skipped-test budget (push) Waiting to run
Unit Tests / Nx affected unit tests (push) Waiting to run
Unit Tests / Server unit coverage (informational) (push) Waiting to run
Validate Tenant Management Schema / Check for relevant changes (push) Waiting to run
Validate Tenant Management Schema / Validate Tenant Management Schema (push) Blocked by required conditions
EE Workflows Build Guard / ee-workflows-build-guard (push) Waiting to run
Initial import of AlgaPSA codebase from PSA server
Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz

Source: /opt/alga-psa on psa.joliet.tech
2026-06-22 16:12:17 -05:00

272 lines
17 KiB
Markdown

# PRD: Teams Notification Delivery and Action Audit Observability
**Plan slug:** `2026-05-24-teams-observability-loop`
**Owning area:** EE / Microsoft Teams addon (`ee/packages/microsoft-teams`)
**Related plan:** `.ai/teams_improvements/microsoft-teams-addon-competitive-parity-plan.md` ("Best One-PR Agent Loop")
## Problem Statement
The Microsoft Teams addon (`ADD_ONS.TEAMS`) ships functional bot, message-extension, quick-action, and meeting features, but has **no persisted record of what notifications it attempted, what mutations it performed, or why anything failed.** This is the blocker for every Phase 2+ Teams feature (health dashboard, diagnostics, channel routing) because none of them have data to read.
Additionally, proactive channel messaging (Phase 2) requires Bot Framework `ConversationReference` + `serviceUrl` to be captured on first contact. Capturing now (even if not yet consumed) means Phase 2 starts with a populated table instead of waiting for users to interact again.
## User Value
- **Admins** can diagnose failed Teams notifications without reading server logs.
- **Engineers** can answer "did this notification go out?" deterministically.
- **Security/compliance** has an attributable audit trail for every Teams-originated PSA mutation.
- **Phase 2+** features are unblocked with real production data on Day 1.
## Goals
1. Every Teams notification attempt (skipped, sent, delivered, failed) creates a row in `teams_notification_deliveries`.
2. Every Teams-originated PSA mutation creates a row in `teams_audit_events`.
3. Bot `ConversationReference` + `serviceUrl` are captured on first inbound activity per (tenant, microsoft_user_id, conversation_id).
4. All new tables follow Citus multi-tenant rules: `tenant` in PK, every WHERE, every join; distributed on `tenant`.
5. **All new tables are registered in the tenant deletion temporal workflow** so tenant offboarding is complete.
6. Read-only server actions (`listTeamsDeliveries`, `listTeamsAuditEvents`) exist behind `withAuth` + permission check.
## Non-Goals (Explicit)
- No UI changes. No health dashboard. No admin diagnostics page.
- No channel mapping table. No setup wizard. No configurable channel tab.
- No 402/403 entitlement response standardization.
- No trial flow, per-seat metering, billing changes.
- No changes to existing notification/mutation behavior — pure instrumentation.
- No LLM/fuzzy intent matching.
- No bot SSO token exchange.
- No new metrics export, no Prometheus, no log shipping changes.
## Target Users
- **MSP admins** querying the new tables via server action (for now — UI in Phase 2).
- **Engineers** debugging Teams delivery in staging/production.
- **Future Phase 2 health dashboard** as a data consumer.
## Primary Flows
### Flow A: Notification delivery (instrumented)
1. `deliverTeamsNotificationImpl()` is invoked from notification pipeline.
2. Function computes idempotency key = SHA-256(`internal_notification_id`+`tenant`+`destination_type`+`destination_id`+`attempt_number`).
3. Before returning, function writes one row to `teams_notification_deliveries` with `status ∈ {skipped, sent, delivered, failed}` and structured `error_code` enum value when applicable.
4. If a row with the same `idempotency_key` already exists, the insert is a no-op (`ON CONFLICT DO NOTHING`).
### Flow B: Teams mutation audit
1. `teamsActionRegistry` dispatches one of: `assign_ticket`, `add_note`, `reply_to_contact`, `log_time`, `approval_response`, `create_ticket_from_message`, `update_from_message`.
2. After dispatch (success or caught error), one row is written to `teams_audit_events`.
3. Payload is **not** stored — only SHA-256 hash of canonicalized JSON payload plus safe metadata (actor, target, surface, action, result, error code).
### Flow C: ConversationReference capture
1. Bot receives any inbound activity (message, invoke, conversationUpdate).
2. Handler upserts `teams_conversation_references` keyed on (`tenant`, `microsoft_user_id`, `conversation_id`) with current `service_url`, `conversation_type`, `tenant_id_aad`, `updated_at`.
3. Phase 2 proactive messaging consumes this table; this PR only writes.
### Flow D: Admin read
1. Admin calls server action `listTeamsDeliveries({ filter, limit, cursor })` or `listTeamsAuditEvents({ filter, limit, cursor })`.
2. `withAuth` resolves tenant from session; `hasPermission(user, 'teams_integration', 'read')` gates access.
3. Action returns paginated rows scoped to the authenticated tenant.
## Data Model
### Table: `teams_notification_deliveries`
| Column | Type | Notes |
|-----------------------|-------------------------------|-------|
| `tenant` | uuid NOT NULL | Citus distribution key; in PK |
| `delivery_id` | uuid NOT NULL | In PK |
| `internal_notification_id` | uuid | FK to `internal_notifications.id` (tenant, id) — soft FK if cross-table cycles |
| `category` | text | Enum: `assignment` / `customer_reply` / `approval_request` / `escalation` / `sla_risk` |
| `destination_type` | text NOT NULL | Enum: `user_activity` / `chat` / `channel` |
| `destination_id` | text NOT NULL | Microsoft user id / chat id / channel id |
| `attempt_number` | int NOT NULL DEFAULT 1 | |
| `idempotency_key` | text NOT NULL | UNIQUE per (tenant, idempotency_key) |
| `provider_message_id` | text | Activity feed/chat/channel message id from Graph |
| `status` | text NOT NULL | Enum: `skipped` / `sent` / `delivered` / `failed` |
| `error_code` | text | Enum: see below |
| `error_message` | text | Free text, truncated to 1KB |
| `retryable` | boolean | |
| `provider_request_id` | text | Graph `request-id` header when available |
| `sent_at` | timestamptz | |
| `delivered_at` | timestamptz | |
| `responded_at` | timestamptz | |
| `created_at` | timestamptz NOT NULL DEFAULT now() | |
**Primary key:** `(tenant, delivery_id)`.
**Unique:** `(tenant, idempotency_key)` — supports `INSERT … ON CONFLICT DO NOTHING`.
**Index:** `(tenant, internal_notification_id)`, `(tenant, status, created_at DESC)`.
**Distribution:** `create_distributed_table('teams_notification_deliveries', 'tenant', colocate_with => 'teams_integrations')`.
### Table: `teams_audit_events`
| Column | Type | Notes |
|-------------------|-------------------------------|-------|
| `tenant` | uuid NOT NULL | Citus distribution key; in PK |
| `event_id` | uuid NOT NULL | In PK |
| `actor_user_id` | uuid | PSA user id resolved from token |
| `microsoft_user_id` | text | Teams `aadObjectId` if available |
| `surface` | text NOT NULL | Enum: `bot` / `message_extension` / `quick_action` / `tab` |
| `action_id` | text NOT NULL | Enum: 7 mutation actions |
| `target_type` | text | e.g., `ticket`, `time_entry`, `approval` |
| `target_id` | text | |
| `idempotency_key` | text | Same key Teams sent (if any), used for retried invokes |
| `payload_hash` | text | SHA-256 of canonicalized JSON input |
| `result_status` | text NOT NULL | Enum: `success` / `failure` |
| `error_code` | text | Action-error taxonomy enum |
| `created_at` | timestamptz NOT NULL DEFAULT now() | |
**Primary key:** `(tenant, event_id)`.
**Index:** `(tenant, actor_user_id, created_at DESC)`, `(tenant, target_type, target_id)`.
**Distribution:** `create_distributed_table('teams_audit_events', 'tenant', colocate_with => 'teams_integrations')`.
### Table: `teams_conversation_references`
| Column | Type | Notes |
|-----------------------|-------------------------------|-------|
| `tenant` | uuid NOT NULL | Citus distribution key; in PK |
| `microsoft_user_id` | text NOT NULL | `aadObjectId` from activity |
| `conversation_id` | text NOT NULL | Bot Framework conversation id |
| `conversation_type` | text NOT NULL | `personal` / `groupChat` / `channel` |
| `service_url` | text NOT NULL | Bot Framework service URL — required for proactive messages |
| `tenant_id_aad` | text | AAD tenant id (Microsoft side) |
| `channel_id_bot_framework` | text | Always `msteams` for our use, kept for forward-compat |
| `last_activity_at` | timestamptz NOT NULL | |
| `created_at` | timestamptz NOT NULL DEFAULT now() | |
| `updated_at` | timestamptz NOT NULL DEFAULT now() | |
**Primary key:** `(tenant, microsoft_user_id, conversation_id)`.
**Distribution:** `create_distributed_table('teams_conversation_references', 'tenant', colocate_with => 'teams_integrations')`.
### Error code taxonomy (delivery)
Single text enum constraint, no separate type:
- `graph_throttled` (429)
- `graph_unauthorized` (401/403)
- `graph_not_found` (404)
- `graph_server_error` (5xx)
- `user_not_mapped` (no Microsoft user link)
- `addon_inactive` (entitlement gate hit)
- `integration_inactive` (`teams_integrations.install_status != 'active'`)
- `package_misconfigured` (missing base URL / app id)
- `transient` (network / timeout)
- `unknown`
The migration enforces the value with a `CHECK (error_code IS NULL OR error_code IN (...))` constraint to allow forward-compat additions without an enum type rebuild.
## Retention Strategy
**Decision:** Range-partition `teams_notification_deliveries` by `created_at` (monthly).
- Migration creates the parent table partitioned by `RANGE (created_at)` plus the next 3 months of child partitions.
- A separate small migration creates `cleanup_teams_notification_deliveries(retention_interval)` PL/pgSQL function that drops partitions older than `now() - retention_interval`. Default: `90 days`.
- `teams_audit_events` is **not** partitioned in this PR. Audit retention is typically longer (1y+) and partitioning can be added later without backfill. Add a `cleanup_teams_audit_events(interval)` function with default `365 days` so the knob exists.
- `teams_conversation_references` is **not** partitioned — it's an upsert-keyed table, expected to stay small (one row per (user, conversation)).
**Citus note:** Partitioned distributed tables in Citus require each partition to also be distributed. The migration uses `create_distributed_table` on the parent and Citus auto-distributes children. We verify this with a smoke query in the migration.
## Tenant Deletion Integration (CRITICAL)
**File:** `ee/temporal-workflows/src/activities/tenant-deletion-activities.ts`
**Constant:** `TENANT_TABLES_DELETION_ORDER` (currently line 36+).
### Required edits
Add the three new tables **before** `teams_integrations` (line 94) so FK ordering holds. Group them with the existing Microsoft profile bindings comment block.
```diff
// Microsoft profile bindings (dependents before profile definitions)
- 'microsoft_profile_consumer_bindings', 'teams_integrations', 'microsoft_profiles',
+ 'microsoft_profile_consumer_bindings',
+ 'teams_notification_deliveries', 'teams_audit_events', 'teams_conversation_references',
+ 'teams_integrations', 'microsoft_profiles',
```
### Why this placement
- Partitioned table deletion: when the parent (`teams_notification_deliveries`) is deleted with `WHERE tenant=?`, partition pruning handles children. No need to enumerate each monthly partition.
- All three new tables reference `(tenant, …)` from `teams_integrations` via tenant column only (no hard FK to integration row), so they can be deleted before `teams_integrations` without FK violation regardless.
- Listing them explicitly (not relying on CASCADE) matches the existing pattern in the deletion order constant.
### Verification step in the plan
A dedicated feature/test confirms a tenant-deletion dry run on a seeded tenant removes rows from all three new tables and that order conflicts are absent.
## API Surface
### Server actions (new)
Both live in `ee/packages/microsoft-teams/src/lib/actions/integrations/teamsObservabilityActions.ts`:
```ts
export const listTeamsDeliveries = withAuth(async (user, { tenant }, params: {
status?: 'skipped' | 'sent' | 'delivered' | 'failed';
category?: string;
since?: string; // ISO datetime
limit?: number; // default 50, max 200
cursor?: string; // opaque pagination cursor
}) => { /* … */ });
export const listTeamsAuditEvents = withAuth(async (user, { tenant }, params: {
surface?: 'bot' | 'message_extension' | 'quick_action' | 'tab';
action_id?: string;
actor_user_id?: string;
result_status?: 'success' | 'failure';
since?: string;
limit?: number;
cursor?: string;
}) => { /* … */ });
```
Both require `hasPermission(user, 'teams_integration', 'read')`.
### No new HTTP routes
These server actions are callable from the existing settings/admin context. No `/api/teams/*` additions in this PR.
## Multi-Tenant / Citus Compliance
- All three tables include `tenant` in the primary key.
- All inserts/queries include `tenant` in WHERE.
- All tables distributed on `tenant`, colocated with `teams_integrations`.
- No cross-tenant unique constraints.
- `idempotency_key` uniqueness is scoped per-tenant via `UNIQUE (tenant, idempotency_key)`.
- Reads in server actions use `withAuth``runWithTenant``createTenantKnex()` to set tenant context automatically.
## Privacy / Security
- **No raw payloads stored.** `payload_hash` is SHA-256 of canonicalized JSON; original payload is GC'd by the action handler.
- `error_message` truncated to 1KB to avoid leaking large customer text.
- Server actions gated by `hasPermission(user, 'teams_integration', 'read')`.
- Bot Framework JWT validation is **not** part of this PR (separate audit); assumed already enforced by existing bot handler.
## Risks
1. **Citus partitioned distributed tables.** Combining `PARTITION BY RANGE` with `create_distributed_table` has known sharp edges. Mitigation: smoke-test in CE migration first (without Citus), then verify on a Citus-enabled staging environment.
2. **Insert overhead on hot notification path.** ~1 extra insert per notification. Mitigation: insert is fire-and-forget within the same transaction; failures logged but do not break delivery.
3. **Idempotency-key collisions.** SHA-256 across 5 components → collision risk negligible, but `ON CONFLICT DO NOTHING` ensures a duplicate write is safe.
4. **Tenant deletion ordering.** Documented above; verified by feature F-TD-01.
5. **Worker rebuild scope creep.** `deliverTeamsNotificationImpl` may be called from `services/workflow-worker`. PR must verify and rebuild worker if so.
## Rollout / Migration
- Migrations are forward-only. `down()` drops tables (safe — no data depends on them in CE/prod yet).
- Deploy order: run migrations → deploy `@alga-psa/microsoft-teams` rebuild → restart server → restart workflow-worker (if it imports the package).
- No feature flag needed; instrumentation is internal.
- No user-visible changes — no docs/release-notes entry required beyond a one-line internal changelog.
## Acceptance Criteria
- [ ] Three migrations land in `ee/server/migrations/` (CE-compatible) and `ee/server/migrations/citus/` (where Citus-specific logic differs).
- [ ] `deliverTeamsNotificationImpl()` writes exactly one row per terminal outcome with correct `status` and `error_code`.
- [ ] Each of the 7 mutation actions in `teamsActionRegistry.ts` writes exactly one audit row on success and one on failure.
- [ ] First inbound bot activity from a (tenant, user, conversation) tuple writes a `teams_conversation_references` row; subsequent activities upsert.
- [ ] `tenant-deletion-activities.ts` lists all three new tables in `TENANT_TABLES_DELETION_ORDER`, before `teams_integrations`.
- [ ] Tenant deletion test (existing integration test) removes all rows from new tables.
- [ ] `listTeamsDeliveries` / `listTeamsAuditEvents` return only rows for the authenticated tenant; cross-tenant attempt is rejected by query construction.
- [ ] All new queries include `tenant` in WHERE; verified by grep + code review.
- [ ] No test mocks the database — integration tests hit a real DB per project convention.
- [ ] No raw payload text is persisted; verified by grepping inserts.
## Open Questions
1. **Should `cleanup_*` functions be invoked by anything in this PR?** Default: no — just create the functions, leave invocation to an operator/cron in a follow-up. Confirmed in scope as "knob exists" only.
2. **Permission key `teams_integration:read` — does it exist?** If not, the PR adds it to the permission seeder. Action: grep first; add only if missing.
3. **Conversation reference persistence — table vs JSON on `teams_integrations`?** Decided: separate table (Phase 2 will index by user/conversation and a JSON column on a single-row-per-tenant table is the wrong shape).
4. **Should `internal_notification_id` be a hard FK?** Decided: no. `internal_notifications` is large and FK across distributed shards is expensive; rely on tenant-scoped soft reference.