Some checks are pending
Bidi Control Character Guard / bidi-control-guard (push) Waiting to run
Circular Dependency Check / Check for new circular dependencies (push) Waiting to run
Citus Migration Smoke / Combined migrations on single-node Citus (push) Waiting to run
E2E Fresh Install Tests / fresh-install-e2e (push) Waiting to run
ext-v2 guardrails / Run ext-v2 guard and ESLint (push) Waiting to run
Integration Tests / Check for relevant changes (push) Waiting to run
Integration Tests / ${{ (github.event_name == 'schedule' || github.event.inputs.suite == 'full') && 'Full integration suite' || 'Tier-1 integration subset' }} (push) Blocked by required conditions
Mobile checks / Mobile lint + typecheck (push) Waiting to run
Mobile checks / Mobile unit tests (push) Waiting to run
Mobile checks / Mobile dependency audit (report) (push) Waiting to run
Mobile checks / Mobile reproducibility checks (push) Waiting to run
Secrets guard (env backups) / Ensure no tracked env backup files (push) Waiting to run
Temporal Readiness / fast-readiness (push) Waiting to run
Temporal Readiness / docker-parity (push) Waiting to run
TypeScript Type Check / Nx affected typecheck (push) Waiting to run
Unit Tests / Skipped-test budget (push) Waiting to run
Unit Tests / Nx affected unit tests (push) Waiting to run
Unit Tests / Server unit coverage (informational) (push) Waiting to run
Validate Tenant Management Schema / Check for relevant changes (push) Waiting to run
Validate Tenant Management Schema / Validate Tenant Management Schema (push) Blocked by required conditions
EE Workflows Build Guard / ee-workflows-build-guard (push) Waiting to run
Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz Source: /opt/alga-psa on psa.joliet.tech
164 lines
18 KiB
Markdown
164 lines
18 KiB
Markdown
Title: Customer Portal Custom Domains – Design & Implementation Plan
|
||
Date: 2025-09-19
|
||
|
||
Overview / Rationale
|
||
- Goal: Enable enterprise tenants to surface portal traffic on a vanity hostname while preserving seamless operation for CE tenants who remain on the default fleet domains.
|
||
- Approach: Tenants create a DNS CNAME that targets a canonical host we control (`<tenant7>.portal.algapsa.com`). We manage lifecycle state, DNS verification, and Istio configuration via Temporal so the portal can respond on both the canonical host and any approved vanity CNAMEs.
|
||
- Success Criteria: Exactly one active domain per tenant, accurate status visibility in the Client Portal settings UI, automated Istio/cert-manager reconciliation, and resilient observability hooks (OpenTelemetry traces, PostHog metrics) around the workflow.
|
||
- Non-goals: Supporting multiple domains per tenant, delegating DNS into customer zones, or replacing cert-manager.
|
||
|
||
Key Decisions & Clarifications
|
||
- Canonical target: Store `<first 7 chars of tenant_id>.portal.algapsa.com` in the database so DNS guidance remains stable even if tenant metadata shifts.
|
||
- Certificates: Re-use the existing wildcard `*.portal.algapsa.com` certificate for the canonical ingress path. Vanity domains CNAME to the canonical host; cert-manager issues per-domain certificates through ACME using HTTP-01, leveraging our ingress path. No new Route53 zones required.
|
||
- Reconciliation: Temporal activity generates desired manifests, writes them into the `nm-kube-config/alga-psa` repo, applies via `kubectl`/Helm, and commits back to Git for auditability (no standalone operator).
|
||
- Istio config source: `nm-kube-config/alga-psa` is the single source of truth; the activity rewrites YAML based on DB state, ensuring every change is versioned prior to applying in-cluster.
|
||
- Observability: Use our existing OpenTelemetry setup for workflow/activity spans and PostHog for counters and timings.
|
||
- Security: Gate all settings actions with RBAC (tenant admins only) and audit via existing logging helpers.
|
||
|
||
Current State Snapshot
|
||
- UI: CE build now shows the canonical hosted domain with an Enterprise badge; EE build ships a fully wired domain form (status, refresh/disable controls, DNS instructions) backed by server actions.
|
||
- CE vs EE: Webpack alias continues to map `@ee/*` to CE stubs, with an enterprise override providing the rich settings panel dynamically.
|
||
- Persistence & actions: `portal_domains` table, model helpers, and CE/EE server actions are implemented and committed.
|
||
- Temporal scaffolding: Workflow client, DNS verification activity, and reconciliation handle manifest generation; next iteration moves reconciliation to GitOps by rewriting `nm-kube-config` manifests, applying them, and committing the diff.
|
||
- Istio & cert-manager: `algapsa-gateway`, `apps-gateway`, and `apps-gateway-auto` continue to terminate traffic with cert-manager issued secrets. Per-tenant `Gateway` + `VirtualService` resources are generated automatically; HTTP-01 challenge routing can be enabled via `PORTAL_DOMAIN_CHALLENGE_*` env configuration.
|
||
|
||
Target Tenant Experience
|
||
1. Admin (EE tenant) opens Settings → Client Portal.
|
||
2. Page displays current domain state (none, pending, active, failed) plus canonical target instructions.
|
||
3. Admin submits a vanity host (single domain). The backend persists the request, kicks off Temporal, and shows `pending_dns`.
|
||
4. Temporal verifies the CNAME points at the canonical target; success transitions to certificate provisioning and reconciliation.
|
||
5. Once cert-manager reports Ready and Istio resources are synced, status flips to `active`. Failures surface actionable `status_message` strings.
|
||
6. Admin can trigger a refresh or remove the domain; removal re-runs reconciliation to prune Kubernetes resources.
|
||
7. CE tenants see only the default portal domain card with no editable controls.
|
||
|
||
Architecture & Components
|
||
|
||
Database Schema: `portal_domains`
|
||
- Migration via Knex (follow `docs/AI_coding_standards.md`). Lives in CE repo so schema exists everywhere; EE code governs usage.
|
||
- Columns:
|
||
- `id` UUID PK
|
||
- `tenant_id` FK → `tenant.tenant_id`
|
||
- `domain` CITEXT unique (vanity hostname requested by tenant)
|
||
- `canonical_host` CITEXT (stored `<tenant7>.portal.algapsa.com`), unique per tenant
|
||
- `status` ENUM: `pending_dns`, `verifying_dns`, `dns_failed`, `pending_certificate`, `certificate_issuing`, `certificate_failed`, `deploying`, `active`, `disabled`
|
||
- `status_message` TEXT (human-friendly guidance)
|
||
- `last_checked_at` timestamptz
|
||
- `verification_method` ENUM default `cname`
|
||
- `verification_details` JSONB (e.g. `{ "expected_cname": "abc1234.portal.algapsa.com" }`)
|
||
- `certificate_secret_name` TEXT (`portal-domain-{tenant_id}`)
|
||
- `last_synced_resource_version` TEXT (for VirtualService/Gateway tracking)
|
||
- Timestamps: `created_at`, `updated_at`
|
||
- Unique constraint `(tenant_id)`
|
||
|
||
Server Layer (RBAC-aware)
|
||
- Actions exposed under `server/src/lib/actions/tenant-actions/portalDomainActions.ts`:
|
||
- `getPortalDomainStatus` (read current row + derived hints). Always safe for CE but returns read-only stub when edition ≠ EE.
|
||
- `requestPortalDomainRegistration(domain)` (validate, ensure Admin RBAC, persist row, enqueue workflow).
|
||
- `refreshPortalDomainStatus()` (poll DB + optionally trigger Temporal query to tighten freshness).
|
||
- `disablePortalDomain()` (mark disabled, signal workflow, enqueue reconciliation).
|
||
- REST endpoints (optional) under `/api/settings/client-portal/domain` to support CLI/automation. Wrap each handler with RBAC guard + audit trail entry.
|
||
- Edition gating: CE build exports stubs returning the default hosted domain.
|
||
|
||
Temporal Workflow & Activities
|
||
- Workflow: `PortalDomainRegistrationWorkflow` located at `ee/temporal-workflows/src/workflows/portal-domains/registration.workflow.ts`.
|
||
- Activities (all in `ee/temporal-workflows/src/activities/portal-domains`):
|
||
1. `recordStatus(domainId, status, message?)` – updates DB row via shared repository helper.
|
||
2. `verifyCname(domain, canonicalHost)` – performs repeated DNS lookups until two consecutive matches (5 min backoff). On timeout → `dns_failed` with guidance.
|
||
3. `renderAndApplyKubernetesState()` – fetches **all** managed rows, renders Istio Gateway/VirtualService and cert-manager `Certificate` manifests, writes optional JSON snapshots to the GitOps worktree (`PORTAL_DOMAIN_MANIFEST_DIR`), applies updates via the Kubernetes API client, records resource versions, and prunes anything no longer present.
|
||
4. `waitForCertificateReady(namespace, certificateName)` – polls cert-manager status via K8s API, emits PostHog metrics (issuance duration) and OpenTelemetry spans, and confirms HTTP-01 challenges are reachable (failing fast with actionable messaging when challenge pods lack ingress).
|
||
5. `waitForIstioSync(domainId)` – checks VirtualService/Gateway `resourceVersion` matches recorded value; updates DB.
|
||
6. `finalizeActivation(domainId)` – optional HTTP probe to `https://domain` to confirm 200 status, then mark `active`.
|
||
7. Shared `handleFailure(domainId, error, stage)` – surfaces sanitized details to `status_message` and emits PostHog failure metric.
|
||
- Signals:
|
||
- `removeDomain` – invoked when tenant disables domain to short-circuit and move to tear-down path.
|
||
- Observability: Wrap workflow + activities with OpenTelemetry instrumentation (`workflow.logger` + `trace`) and PostHog event names (`portal_domain.dns_verified`, `portal_domain.cert_ready`, etc.).
|
||
|
||
Kubernetes & TLS Strategy
|
||
- Canonical host: Each tenant routes through `<tenant7>.portal.algapsa.com`. This stays backed by the wildcard certificate we already manage (`*.portal.algapsa.com`).
|
||
- Vanity host: When the tenant’s vanity domain CNAMEs to the canonical host, cert-manager issues an individual certificate using the HTTP-01 solver. We must ensure Istio routes `/.well-known/acme-challenge/*` to cert-manager’s challenge service; if that path is unavailable we will have the Temporal workflow render the required challenge response assets until issuance completes. Secrets named `portal-domain-{tenant}` live in namespace `msp`.
|
||
- Resources:
|
||
- `Certificate` per vanity domain (namespace `msp`, `issuerRef` from `PORTAL_DOMAIN_CERT_*` env, default `letsencrypt-dns`) with deterministic secret name `portal-domain-<tenant7>`.
|
||
- `Gateway` per domain (namespace `istio-system` by default) exposes HTTP→HTTPS redirect and SNI-bound TLS server using the generated secret.
|
||
- `VirtualService` per domain (namespace `msp`) routes vanity traffic to `sebastian.msp.svc.cluster.local:3000`; optional `/.well-known/acme-challenge/*` route forwards to `PORTAL_DOMAIN_CHALLENGE_HOST` when enabled.
|
||
- Reconciliation flushes the desired resource set on every run and prunes labelled resources that no longer map to active domains.
|
||
|
||
GitOps Workflow (nm-kube-config)
|
||
- Production manifests live under `~/nm-kube-config/alga-psa/portal-domains/<tenantSlug>.yaml`. Each file contains the rendered `Certificate`, `Gateway`, and `VirtualService` separated by `---` so kubectl/Helm can apply them directly.
|
||
- Staging (hv-dev2) mirrors the layout at `~/nm-kube-config/argo-workflow/alga-psa-dev/portal-domains/<tenantSlug>.yaml` to keep dev/test traffic isolated. The workflow picks the target root based on environment.
|
||
- `PORTAL_DOMAIN_MANIFEST_DIR` points at the appropriate root folder; on every reconciliation the activity rewrites the per-tenant YAML from database state (sorted keys, deterministic ordering) so Git diffs stay readable.
|
||
- After files are updated the activity runs `kubectl apply -f <tenantSlug>.yaml` (or batched apply) against the cluster, stages the changes with `git add portal-domains`, commits with a message like `chore(portal-domains): sync <tenantSlug>`, and pushes to the shared repo.
|
||
- A helper CLI (`pnpm nm-kube-sync`) will encapsulate diff detection, safe commit messages, optional PR creation, and fall back to printing `kubectl` commands for manual review when auto-apply is disabled.
|
||
- Operational playbook: review the generated Git diff, merge/push (or approve the automation’s push), verify Argo/Flux sync health, then trigger the Temporal refresh action so DB status aligns with the cluster.
|
||
|
||
UI Updates (CE vs EE)
|
||
- CE (`server/src/components/settings/general/ClientPortalSettings.tsx`): Replace “Coming Soon” with a read-only card highlighting default portal address (`<tenant7>.portal.algapsa.com`) and note that custom domains require Enterprise.
|
||
- EE (`ee/server/src/components/settings/general/ClientPortalSettings.tsx`):
|
||
- Status banner showing `domain`, `canonical_host`, `status`, `last_checked_at`, and `status_message`.
|
||
- Form with input `id="client-portal-domain-input"`, submit button `id="client-portal-domain-submit"`, and optional `id="client-portal-domain-refresh"` button to poll immediately.
|
||
- DNS instructions card (canonical host, sample CNAME record) and hints for propagation delays.
|
||
- Error panel listing actionable steps (derived from `status_message`).
|
||
- Remove button (`id="client-portal-domain-remove"`) once status is non-pending.
|
||
- Poll status with SWR (15s) while in non-terminal state; respect AI coding standards for naming, instrumentation IDs, and toast usage.
|
||
|
||
Security & Permissions
|
||
- RBAC: Reuse existing tenant admin role checks (`requireTenantPermission('settings.manage_portal')` or similar). Deny actions for non-admins with clear error.
|
||
- Audit logging: Insert entries via existing helper for every create, refresh, remove action.
|
||
- Validation: Enforce ASCII/LDH host rules, reject apex domains without CNAME capability, normalise to lowercase.
|
||
|
||
Observability
|
||
- Temporal: Use OpenTelemetry tracing already wired in `ee/temporal-workflows` to emit spans for each activity. Tag spans with `tenant_id`, `domain`, `stage` (avoid PII beyond hostnames).
|
||
- Metrics: Emit PostHog events (counts, durations, failure reasons). Add dashboards to monitor time-to-activate, failure rate, and number of active domains.
|
||
- Logging: Standardise structured logs with correlation IDs (workflow run ID).
|
||
|
||
Progress Update – 2025-09-19
|
||
- ✅ `portal_domains` migration, enums, and model helpers landed with canonical-host storage and normalization utilities.
|
||
- ✅ CE/EE server actions implemented with RBAC enforcement, Temporal enqueue hooks, and PostHog capture; CE build returns read-only status.
|
||
- ✅ CE UI now surfaces the canonical host; EE UI delivers status badges, form submission, refresh/disable actions, and onboarding guidance.
|
||
- ✅ Workflow client + Temporal workflow now render and apply Istio Gateway/VirtualService + Certificate resources via the Kubernetes API, prune stale objects, and capture manifest snapshots when configured.
|
||
- ✅ Support documentation added (`ee/docs/guides/portal-domain-runbook.md`) describing operational flows and observability.
|
||
- ✅ Optional GitOps export writes JSON manifests for each tenant into `PORTAL_DOMAIN_MANIFEST_DIR`, ready for nm-kube-config commits.
|
||
- 🟡 Automated tests remain sparse (initial manifest-render unit test added); workflow + UI coverage still outstanding.
|
||
|
||
Implementation Plan
|
||
1. **Schema & Models** – ✅ Completed (2025-09-19)
|
||
- Migration and model helpers merged; canonical host stored per tenant.
|
||
2. **Server Actions & API** – ✅ Completed (2025-09-19)
|
||
- RBAC-guarded EE actions and CE stubs implemented; PostHog instrumentation in place. REST surface still optional (not started).
|
||
3. **Temporal Workflow & Activities** – 🟡 In progress
|
||
- DNS verification + reconciliation activities now apply/prune Kubernetes resources and write manifest snapshots; remaining work covers certificate readiness polling, richer status messaging, and workflow tests.
|
||
4. **GitOps & Reconciliation Tooling** – 🟡 In progress
|
||
- Per-tenant YAML written to `nm-kube-config/{alga-psa|argo-workflow/alga-psa-dev}/portal-domains`; still need the CLI to diff/commit/apply and docs for the automation flow.
|
||
5. **Kubernetes Reconciliation Templates & HTTP-01 Pathing** – 🟡 In progress
|
||
- Gateway/VirtualService templates implemented with optional HTTP-01 routing via `PORTAL_DOMAIN_CHALLENGE_*`; still need to standardise the challenge-serving workload and productionize readiness probes.
|
||
6. **UI (CE + EE)** – ✅ Completed (2025-09-19)
|
||
- CE shows default host; EE form with status badges, refresh/disable flows, and guidance shipped. Cypress/Playwright coverage outstanding.
|
||
7. **Testing & Verification** – 🔄 Not started
|
||
- Unit/integration tests, workflow harness, and staging checklist to be added.
|
||
8. **Rollout** – 🔄 Not started
|
||
- Migration deployment, worker release, and manual onboarding plan remain after backend completion.
|
||
|
||
Terminal Status UX Matrix
|
||
| Status | UI Treatment | Available Actions | Next-Step Guidance |
|
||
|----------------------|-------------------------------------------------------------------------------|-------------------------------------------------|-------------------------------------------------------------------------------------|
|
||
| `active` | Success banner with vanity + canonical host, "Active" badge, last verified time | Refresh, Remove | Inform user domain is live; suggest confirming CNAME remains pointed correctly. |
|
||
| `disabled` | Neutral banner noting custom domain disabled and default host in use | Submit new domain, Refresh | Explain portal serves canonical host; advise submitting a new domain when ready. |
|
||
| `dns_failed` | Error banner showing last resolved target + suggested TTL wait | Refresh, Remove, Retry Registration | Prompt user to correct DNS record to the canonical host, then retry once propagated.|
|
||
| `certificate_failed` | Error banner with cert-manager status message and timestamp | Refresh, Remove, Retry Registration | Direct user to ensure HTTP-01 path is reachable; recommend contacting support if blocked. |
|
||
|
||
|
||
Resolved Questions
|
||
- Canonical target storage: store in schema (`canonical_host`).
|
||
- DNS / Certificate scope: rely on wildcard `*.portal.algapsa.com` for canonical ingress; vanity domains CNAME into it and are issued certs without extending Route53 zones.
|
||
- ACME challenge method: enforce HTTP-01 challenges by exposing `/.well-known/acme-challenge/*` through Istio or temporary challenge-serving workloads managed by the workflow.
|
||
- Reconciliation location: handled entirely inside the Temporal activity suite; no separate Kubernetes operator required.
|
||
- Resource teardown: reconciliation activity renders the full desired list and prunes anything missing, ensuring deleted domains remove all K8s resources automatically.
|
||
|
||
Remaining Work & Follow-ups
|
||
- Add cert-manager readiness polling + HTTP probes before marking domains `active`; emit PostHog timings + richer status messages for failure cases.
|
||
- Finalise HTTP-01 challenge serving (shared solver service or on-demand pod) and bake the required `PORTAL_DOMAIN_CHALLENGE_*` defaults into staging/production.
|
||
- Build the GitOps helper CLI (`pnpm nm-kube-sync`) to diff manifests, open PRs, and optionally apply changes; update runbook once available.
|
||
- Expand automated coverage: workflow unit/integration tests, CE/EE action tests, and mocked EE UI e2e flows plus a staging validation checklist.
|
||
- Validate the new base VirtualService redirect management in staging once rolled out; regression coverage lives in `ee/temporal-workflows/src/activities/__tests__/portal-domain-activities.git.test.ts` to guard the `/client-portal/dashboard` default route.
|
||
- Provide operational tooling (Temporal signal CLI/script) for forced reconciliation and document the procedure in the runbook.
|
||
- Plan rollout sequencing: migration deployment order, Temporal worker release, customer enablement messaging, and nm-kube-config PR cadence.
|