PSA/ee/docs/plans/2025-09-19-custom-portal-domains-plan.md
Hermes 284313f908
Some checks are pending
Bidi Control Character Guard / bidi-control-guard (push) Waiting to run
Circular Dependency Check / Check for new circular dependencies (push) Waiting to run
Citus Migration Smoke / Combined migrations on single-node Citus (push) Waiting to run
E2E Fresh Install Tests / fresh-install-e2e (push) Waiting to run
ext-v2 guardrails / Run ext-v2 guard and ESLint (push) Waiting to run
Integration Tests / Check for relevant changes (push) Waiting to run
Integration Tests / ${{ (github.event_name == 'schedule' || github.event.inputs.suite == 'full') && 'Full integration suite' || 'Tier-1 integration subset' }} (push) Blocked by required conditions
Mobile checks / Mobile lint + typecheck (push) Waiting to run
Mobile checks / Mobile unit tests (push) Waiting to run
Mobile checks / Mobile dependency audit (report) (push) Waiting to run
Mobile checks / Mobile reproducibility checks (push) Waiting to run
Secrets guard (env backups) / Ensure no tracked env backup files (push) Waiting to run
Temporal Readiness / fast-readiness (push) Waiting to run
Temporal Readiness / docker-parity (push) Waiting to run
TypeScript Type Check / Nx affected typecheck (push) Waiting to run
Unit Tests / Skipped-test budget (push) Waiting to run
Unit Tests / Nx affected unit tests (push) Waiting to run
Unit Tests / Server unit coverage (informational) (push) Waiting to run
Validate Tenant Management Schema / Check for relevant changes (push) Waiting to run
Validate Tenant Management Schema / Validate Tenant Management Schema (push) Blocked by required conditions
EE Workflows Build Guard / ee-workflows-build-guard (push) Waiting to run
Initial import of AlgaPSA codebase from PSA server
Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz

Source: /opt/alga-psa on psa.joliet.tech
2026-06-22 16:12:17 -05:00

164 lines
18 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Title: Customer Portal Custom Domains Design & Implementation Plan
Date: 2025-09-19
Overview / Rationale
- Goal: Enable enterprise tenants to surface portal traffic on a vanity hostname while preserving seamless operation for CE tenants who remain on the default fleet domains.
- Approach: Tenants create a DNS CNAME that targets a canonical host we control (`<tenant7>.portal.algapsa.com`). We manage lifecycle state, DNS verification, and Istio configuration via Temporal so the portal can respond on both the canonical host and any approved vanity CNAMEs.
- Success Criteria: Exactly one active domain per tenant, accurate status visibility in the Client Portal settings UI, automated Istio/cert-manager reconciliation, and resilient observability hooks (OpenTelemetry traces, PostHog metrics) around the workflow.
- Non-goals: Supporting multiple domains per tenant, delegating DNS into customer zones, or replacing cert-manager.
Key Decisions & Clarifications
- Canonical target: Store `<first 7 chars of tenant_id>.portal.algapsa.com` in the database so DNS guidance remains stable even if tenant metadata shifts.
- Certificates: Re-use the existing wildcard `*.portal.algapsa.com` certificate for the canonical ingress path. Vanity domains CNAME to the canonical host; cert-manager issues per-domain certificates through ACME using HTTP-01, leveraging our ingress path. No new Route53 zones required.
- Reconciliation: Temporal activity generates desired manifests, writes them into the `nm-kube-config/alga-psa` repo, applies via `kubectl`/Helm, and commits back to Git for auditability (no standalone operator).
- Istio config source: `nm-kube-config/alga-psa` is the single source of truth; the activity rewrites YAML based on DB state, ensuring every change is versioned prior to applying in-cluster.
- Observability: Use our existing OpenTelemetry setup for workflow/activity spans and PostHog for counters and timings.
- Security: Gate all settings actions with RBAC (tenant admins only) and audit via existing logging helpers.
Current State Snapshot
- UI: CE build now shows the canonical hosted domain with an Enterprise badge; EE build ships a fully wired domain form (status, refresh/disable controls, DNS instructions) backed by server actions.
- CE vs EE: Webpack alias continues to map `@ee/*` to CE stubs, with an enterprise override providing the rich settings panel dynamically.
- Persistence & actions: `portal_domains` table, model helpers, and CE/EE server actions are implemented and committed.
- Temporal scaffolding: Workflow client, DNS verification activity, and reconciliation handle manifest generation; next iteration moves reconciliation to GitOps by rewriting `nm-kube-config` manifests, applying them, and committing the diff.
- Istio & cert-manager: `algapsa-gateway`, `apps-gateway`, and `apps-gateway-auto` continue to terminate traffic with cert-manager issued secrets. Per-tenant `Gateway` + `VirtualService` resources are generated automatically; HTTP-01 challenge routing can be enabled via `PORTAL_DOMAIN_CHALLENGE_*` env configuration.
Target Tenant Experience
1. Admin (EE tenant) opens Settings → Client Portal.
2. Page displays current domain state (none, pending, active, failed) plus canonical target instructions.
3. Admin submits a vanity host (single domain). The backend persists the request, kicks off Temporal, and shows `pending_dns`.
4. Temporal verifies the CNAME points at the canonical target; success transitions to certificate provisioning and reconciliation.
5. Once cert-manager reports Ready and Istio resources are synced, status flips to `active`. Failures surface actionable `status_message` strings.
6. Admin can trigger a refresh or remove the domain; removal re-runs reconciliation to prune Kubernetes resources.
7. CE tenants see only the default portal domain card with no editable controls.
Architecture & Components
Database Schema: `portal_domains`
- Migration via Knex (follow `docs/AI_coding_standards.md`). Lives in CE repo so schema exists everywhere; EE code governs usage.
- Columns:
- `id` UUID PK
- `tenant_id` FK → `tenant.tenant_id`
- `domain` CITEXT unique (vanity hostname requested by tenant)
- `canonical_host` CITEXT (stored `<tenant7>.portal.algapsa.com`), unique per tenant
- `status` ENUM: `pending_dns`, `verifying_dns`, `dns_failed`, `pending_certificate`, `certificate_issuing`, `certificate_failed`, `deploying`, `active`, `disabled`
- `status_message` TEXT (human-friendly guidance)
- `last_checked_at` timestamptz
- `verification_method` ENUM default `cname`
- `verification_details` JSONB (e.g. `{ "expected_cname": "abc1234.portal.algapsa.com" }`)
- `certificate_secret_name` TEXT (`portal-domain-{tenant_id}`)
- `last_synced_resource_version` TEXT (for VirtualService/Gateway tracking)
- Timestamps: `created_at`, `updated_at`
- Unique constraint `(tenant_id)`
Server Layer (RBAC-aware)
- Actions exposed under `server/src/lib/actions/tenant-actions/portalDomainActions.ts`:
- `getPortalDomainStatus` (read current row + derived hints). Always safe for CE but returns read-only stub when edition ≠ EE.
- `requestPortalDomainRegistration(domain)` (validate, ensure Admin RBAC, persist row, enqueue workflow).
- `refreshPortalDomainStatus()` (poll DB + optionally trigger Temporal query to tighten freshness).
- `disablePortalDomain()` (mark disabled, signal workflow, enqueue reconciliation).
- REST endpoints (optional) under `/api/settings/client-portal/domain` to support CLI/automation. Wrap each handler with RBAC guard + audit trail entry.
- Edition gating: CE build exports stubs returning the default hosted domain.
Temporal Workflow & Activities
- Workflow: `PortalDomainRegistrationWorkflow` located at `ee/temporal-workflows/src/workflows/portal-domains/registration.workflow.ts`.
- Activities (all in `ee/temporal-workflows/src/activities/portal-domains`):
1. `recordStatus(domainId, status, message?)` updates DB row via shared repository helper.
2. `verifyCname(domain, canonicalHost)` performs repeated DNS lookups until two consecutive matches (5 min backoff). On timeout → `dns_failed` with guidance.
3. `renderAndApplyKubernetesState()` fetches **all** managed rows, renders Istio Gateway/VirtualService and cert-manager `Certificate` manifests, writes optional JSON snapshots to the GitOps worktree (`PORTAL_DOMAIN_MANIFEST_DIR`), applies updates via the Kubernetes API client, records resource versions, and prunes anything no longer present.
4. `waitForCertificateReady(namespace, certificateName)` polls cert-manager status via K8s API, emits PostHog metrics (issuance duration) and OpenTelemetry spans, and confirms HTTP-01 challenges are reachable (failing fast with actionable messaging when challenge pods lack ingress).
5. `waitForIstioSync(domainId)` checks VirtualService/Gateway `resourceVersion` matches recorded value; updates DB.
6. `finalizeActivation(domainId)` optional HTTP probe to `https://domain` to confirm 200 status, then mark `active`.
7. Shared `handleFailure(domainId, error, stage)` surfaces sanitized details to `status_message` and emits PostHog failure metric.
- Signals:
- `removeDomain` invoked when tenant disables domain to short-circuit and move to tear-down path.
- Observability: Wrap workflow + activities with OpenTelemetry instrumentation (`workflow.logger` + `trace`) and PostHog event names (`portal_domain.dns_verified`, `portal_domain.cert_ready`, etc.).
Kubernetes & TLS Strategy
- Canonical host: Each tenant routes through `<tenant7>.portal.algapsa.com`. This stays backed by the wildcard certificate we already manage (`*.portal.algapsa.com`).
- Vanity host: When the tenants vanity domain CNAMEs to the canonical host, cert-manager issues an individual certificate using the HTTP-01 solver. We must ensure Istio routes `/.well-known/acme-challenge/*` to cert-managers challenge service; if that path is unavailable we will have the Temporal workflow render the required challenge response assets until issuance completes. Secrets named `portal-domain-{tenant}` live in namespace `msp`.
- Resources:
- `Certificate` per vanity domain (namespace `msp`, `issuerRef` from `PORTAL_DOMAIN_CERT_*` env, default `letsencrypt-dns`) with deterministic secret name `portal-domain-<tenant7>`.
- `Gateway` per domain (namespace `istio-system` by default) exposes HTTP→HTTPS redirect and SNI-bound TLS server using the generated secret.
- `VirtualService` per domain (namespace `msp`) routes vanity traffic to `sebastian.msp.svc.cluster.local:3000`; optional `/.well-known/acme-challenge/*` route forwards to `PORTAL_DOMAIN_CHALLENGE_HOST` when enabled.
- Reconciliation flushes the desired resource set on every run and prunes labelled resources that no longer map to active domains.
GitOps Workflow (nm-kube-config)
- Production manifests live under `~/nm-kube-config/alga-psa/portal-domains/<tenantSlug>.yaml`. Each file contains the rendered `Certificate`, `Gateway`, and `VirtualService` separated by `---` so kubectl/Helm can apply them directly.
- Staging (hv-dev2) mirrors the layout at `~/nm-kube-config/argo-workflow/alga-psa-dev/portal-domains/<tenantSlug>.yaml` to keep dev/test traffic isolated. The workflow picks the target root based on environment.
- `PORTAL_DOMAIN_MANIFEST_DIR` points at the appropriate root folder; on every reconciliation the activity rewrites the per-tenant YAML from database state (sorted keys, deterministic ordering) so Git diffs stay readable.
- After files are updated the activity runs `kubectl apply -f <tenantSlug>.yaml` (or batched apply) against the cluster, stages the changes with `git add portal-domains`, commits with a message like `chore(portal-domains): sync <tenantSlug>`, and pushes to the shared repo.
- A helper CLI (`pnpm nm-kube-sync`) will encapsulate diff detection, safe commit messages, optional PR creation, and fall back to printing `kubectl` commands for manual review when auto-apply is disabled.
- Operational playbook: review the generated Git diff, merge/push (or approve the automations push), verify Argo/Flux sync health, then trigger the Temporal refresh action so DB status aligns with the cluster.
UI Updates (CE vs EE)
- CE (`server/src/components/settings/general/ClientPortalSettings.tsx`): Replace “Coming Soon” with a read-only card highlighting default portal address (`<tenant7>.portal.algapsa.com`) and note that custom domains require Enterprise.
- EE (`ee/server/src/components/settings/general/ClientPortalSettings.tsx`):
- Status banner showing `domain`, `canonical_host`, `status`, `last_checked_at`, and `status_message`.
- Form with input `id="client-portal-domain-input"`, submit button `id="client-portal-domain-submit"`, and optional `id="client-portal-domain-refresh"` button to poll immediately.
- DNS instructions card (canonical host, sample CNAME record) and hints for propagation delays.
- Error panel listing actionable steps (derived from `status_message`).
- Remove button (`id="client-portal-domain-remove"`) once status is non-pending.
- Poll status with SWR (15s) while in non-terminal state; respect AI coding standards for naming, instrumentation IDs, and toast usage.
Security & Permissions
- RBAC: Reuse existing tenant admin role checks (`requireTenantPermission('settings.manage_portal')` or similar). Deny actions for non-admins with clear error.
- Audit logging: Insert entries via existing helper for every create, refresh, remove action.
- Validation: Enforce ASCII/LDH host rules, reject apex domains without CNAME capability, normalise to lowercase.
Observability
- Temporal: Use OpenTelemetry tracing already wired in `ee/temporal-workflows` to emit spans for each activity. Tag spans with `tenant_id`, `domain`, `stage` (avoid PII beyond hostnames).
- Metrics: Emit PostHog events (counts, durations, failure reasons). Add dashboards to monitor time-to-activate, failure rate, and number of active domains.
- Logging: Standardise structured logs with correlation IDs (workflow run ID).
Progress Update 2025-09-19
-`portal_domains` migration, enums, and model helpers landed with canonical-host storage and normalization utilities.
- ✅ CE/EE server actions implemented with RBAC enforcement, Temporal enqueue hooks, and PostHog capture; CE build returns read-only status.
- ✅ CE UI now surfaces the canonical host; EE UI delivers status badges, form submission, refresh/disable actions, and onboarding guidance.
- ✅ Workflow client + Temporal workflow now render and apply Istio Gateway/VirtualService + Certificate resources via the Kubernetes API, prune stale objects, and capture manifest snapshots when configured.
- ✅ Support documentation added (`ee/docs/guides/portal-domain-runbook.md`) describing operational flows and observability.
- ✅ Optional GitOps export writes JSON manifests for each tenant into `PORTAL_DOMAIN_MANIFEST_DIR`, ready for nm-kube-config commits.
- 🟡 Automated tests remain sparse (initial manifest-render unit test added); workflow + UI coverage still outstanding.
Implementation Plan
1. **Schema & Models** ✅ Completed (2025-09-19)
- Migration and model helpers merged; canonical host stored per tenant.
2. **Server Actions & API** ✅ Completed (2025-09-19)
- RBAC-guarded EE actions and CE stubs implemented; PostHog instrumentation in place. REST surface still optional (not started).
3. **Temporal Workflow & Activities** 🟡 In progress
- DNS verification + reconciliation activities now apply/prune Kubernetes resources and write manifest snapshots; remaining work covers certificate readiness polling, richer status messaging, and workflow tests.
4. **GitOps & Reconciliation Tooling** 🟡 In progress
- Per-tenant YAML written to `nm-kube-config/{alga-psa|argo-workflow/alga-psa-dev}/portal-domains`; still need the CLI to diff/commit/apply and docs for the automation flow.
5. **Kubernetes Reconciliation Templates & HTTP-01 Pathing** 🟡 In progress
- Gateway/VirtualService templates implemented with optional HTTP-01 routing via `PORTAL_DOMAIN_CHALLENGE_*`; still need to standardise the challenge-serving workload and productionize readiness probes.
6. **UI (CE + EE)** ✅ Completed (2025-09-19)
- CE shows default host; EE form with status badges, refresh/disable flows, and guidance shipped. Cypress/Playwright coverage outstanding.
7. **Testing & Verification** 🔄 Not started
- Unit/integration tests, workflow harness, and staging checklist to be added.
8. **Rollout** 🔄 Not started
- Migration deployment, worker release, and manual onboarding plan remain after backend completion.
Terminal Status UX Matrix
| Status | UI Treatment | Available Actions | Next-Step Guidance |
|----------------------|-------------------------------------------------------------------------------|-------------------------------------------------|-------------------------------------------------------------------------------------|
| `active` | Success banner with vanity + canonical host, "Active" badge, last verified time | Refresh, Remove | Inform user domain is live; suggest confirming CNAME remains pointed correctly. |
| `disabled` | Neutral banner noting custom domain disabled and default host in use | Submit new domain, Refresh | Explain portal serves canonical host; advise submitting a new domain when ready. |
| `dns_failed` | Error banner showing last resolved target + suggested TTL wait | Refresh, Remove, Retry Registration | Prompt user to correct DNS record to the canonical host, then retry once propagated.|
| `certificate_failed` | Error banner with cert-manager status message and timestamp | Refresh, Remove, Retry Registration | Direct user to ensure HTTP-01 path is reachable; recommend contacting support if blocked. |
Resolved Questions
- Canonical target storage: store in schema (`canonical_host`).
- DNS / Certificate scope: rely on wildcard `*.portal.algapsa.com` for canonical ingress; vanity domains CNAME into it and are issued certs without extending Route53 zones.
- ACME challenge method: enforce HTTP-01 challenges by exposing `/.well-known/acme-challenge/*` through Istio or temporary challenge-serving workloads managed by the workflow.
- Reconciliation location: handled entirely inside the Temporal activity suite; no separate Kubernetes operator required.
- Resource teardown: reconciliation activity renders the full desired list and prunes anything missing, ensuring deleted domains remove all K8s resources automatically.
Remaining Work & Follow-ups
- Add cert-manager readiness polling + HTTP probes before marking domains `active`; emit PostHog timings + richer status messages for failure cases.
- Finalise HTTP-01 challenge serving (shared solver service or on-demand pod) and bake the required `PORTAL_DOMAIN_CHALLENGE_*` defaults into staging/production.
- Build the GitOps helper CLI (`pnpm nm-kube-sync`) to diff manifests, open PRs, and optionally apply changes; update runbook once available.
- Expand automated coverage: workflow unit/integration tests, CE/EE action tests, and mocked EE UI e2e flows plus a staging validation checklist.
- Validate the new base VirtualService redirect management in staging once rolled out; regression coverage lives in `ee/temporal-workflows/src/activities/__tests__/portal-domain-activities.git.test.ts` to guard the `/client-portal/dashboard` default route.
- Provide operational tooling (Temporal signal CLI/script) for forced reconciliation and document the procedure in the runbook.
- Plan rollout sequencing: migration deployment order, Temporal worker release, customer enablement messaging, and nm-kube-config PR cadence.