Some checks are pending
Bidi Control Character Guard / bidi-control-guard (push) Waiting to run
Circular Dependency Check / Check for new circular dependencies (push) Waiting to run
Citus Migration Smoke / Combined migrations on single-node Citus (push) Waiting to run
E2E Fresh Install Tests / fresh-install-e2e (push) Waiting to run
ext-v2 guardrails / Run ext-v2 guard and ESLint (push) Waiting to run
Integration Tests / Check for relevant changes (push) Waiting to run
Integration Tests / ${{ (github.event_name == 'schedule' || github.event.inputs.suite == 'full') && 'Full integration suite' || 'Tier-1 integration subset' }} (push) Blocked by required conditions
Mobile checks / Mobile lint + typecheck (push) Waiting to run
Mobile checks / Mobile unit tests (push) Waiting to run
Mobile checks / Mobile dependency audit (report) (push) Waiting to run
Mobile checks / Mobile reproducibility checks (push) Waiting to run
Secrets guard (env backups) / Ensure no tracked env backup files (push) Waiting to run
Temporal Readiness / fast-readiness (push) Waiting to run
Temporal Readiness / docker-parity (push) Waiting to run
TypeScript Type Check / Nx affected typecheck (push) Waiting to run
Unit Tests / Skipped-test budget (push) Waiting to run
Unit Tests / Nx affected unit tests (push) Waiting to run
Unit Tests / Server unit coverage (informational) (push) Waiting to run
Validate Tenant Management Schema / Check for relevant changes (push) Waiting to run
Validate Tenant Management Schema / Validate Tenant Management Schema (push) Blocked by required conditions
EE Workflows Build Guard / ee-workflows-build-guard (push) Waiting to run
Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz Source: /opt/alga-psa on psa.joliet.tech
18 KiB
18 KiB
Title: Customer Portal Custom Domains – Design & Implementation Plan Date: 2025-09-19
Overview / Rationale
- Goal: Enable enterprise tenants to surface portal traffic on a vanity hostname while preserving seamless operation for CE tenants who remain on the default fleet domains.
- Approach: Tenants create a DNS CNAME that targets a canonical host we control (
<tenant7>.portal.algapsa.com). We manage lifecycle state, DNS verification, and Istio configuration via Temporal so the portal can respond on both the canonical host and any approved vanity CNAMEs. - Success Criteria: Exactly one active domain per tenant, accurate status visibility in the Client Portal settings UI, automated Istio/cert-manager reconciliation, and resilient observability hooks (OpenTelemetry traces, PostHog metrics) around the workflow.
- Non-goals: Supporting multiple domains per tenant, delegating DNS into customer zones, or replacing cert-manager.
Key Decisions & Clarifications
- Canonical target: Store
<first 7 chars of tenant_id>.portal.algapsa.comin the database so DNS guidance remains stable even if tenant metadata shifts. - Certificates: Re-use the existing wildcard
*.portal.algapsa.comcertificate for the canonical ingress path. Vanity domains CNAME to the canonical host; cert-manager issues per-domain certificates through ACME using HTTP-01, leveraging our ingress path. No new Route53 zones required. - Reconciliation: Temporal activity generates desired manifests, writes them into the
nm-kube-config/alga-psarepo, applies viakubectl/Helm, and commits back to Git for auditability (no standalone operator). - Istio config source:
nm-kube-config/alga-psais the single source of truth; the activity rewrites YAML based on DB state, ensuring every change is versioned prior to applying in-cluster. - Observability: Use our existing OpenTelemetry setup for workflow/activity spans and PostHog for counters and timings.
- Security: Gate all settings actions with RBAC (tenant admins only) and audit via existing logging helpers.
Current State Snapshot
- UI: CE build now shows the canonical hosted domain with an Enterprise badge; EE build ships a fully wired domain form (status, refresh/disable controls, DNS instructions) backed by server actions.
- CE vs EE: Webpack alias continues to map
@ee/*to CE stubs, with an enterprise override providing the rich settings panel dynamically. - Persistence & actions:
portal_domainstable, model helpers, and CE/EE server actions are implemented and committed. - Temporal scaffolding: Workflow client, DNS verification activity, and reconciliation handle manifest generation; next iteration moves reconciliation to GitOps by rewriting
nm-kube-configmanifests, applying them, and committing the diff. - Istio & cert-manager:
algapsa-gateway,apps-gateway, andapps-gateway-autocontinue to terminate traffic with cert-manager issued secrets. Per-tenantGateway+VirtualServiceresources are generated automatically; HTTP-01 challenge routing can be enabled viaPORTAL_DOMAIN_CHALLENGE_*env configuration.
Target Tenant Experience
- Admin (EE tenant) opens Settings → Client Portal.
- Page displays current domain state (none, pending, active, failed) plus canonical target instructions.
- Admin submits a vanity host (single domain). The backend persists the request, kicks off Temporal, and shows
pending_dns. - Temporal verifies the CNAME points at the canonical target; success transitions to certificate provisioning and reconciliation.
- Once cert-manager reports Ready and Istio resources are synced, status flips to
active. Failures surface actionablestatus_messagestrings. - Admin can trigger a refresh or remove the domain; removal re-runs reconciliation to prune Kubernetes resources.
- CE tenants see only the default portal domain card with no editable controls.
Architecture & Components
Database Schema: portal_domains
- Migration via Knex (follow
docs/AI_coding_standards.md). Lives in CE repo so schema exists everywhere; EE code governs usage. - Columns:
idUUID PKtenant_idFK →tenant.tenant_iddomainCITEXT unique (vanity hostname requested by tenant)canonical_hostCITEXT (stored<tenant7>.portal.algapsa.com), unique per tenantstatusENUM:pending_dns,verifying_dns,dns_failed,pending_certificate,certificate_issuing,certificate_failed,deploying,active,disabledstatus_messageTEXT (human-friendly guidance)last_checked_attimestamptzverification_methodENUM defaultcnameverification_detailsJSONB (e.g.{ "expected_cname": "abc1234.portal.algapsa.com" })certificate_secret_nameTEXT (portal-domain-{tenant_id})last_synced_resource_versionTEXT (for VirtualService/Gateway tracking)- Timestamps:
created_at,updated_at - Unique constraint
(tenant_id)
Server Layer (RBAC-aware)
- Actions exposed under
server/src/lib/actions/tenant-actions/portalDomainActions.ts:getPortalDomainStatus(read current row + derived hints). Always safe for CE but returns read-only stub when edition ≠ EE.requestPortalDomainRegistration(domain)(validate, ensure Admin RBAC, persist row, enqueue workflow).refreshPortalDomainStatus()(poll DB + optionally trigger Temporal query to tighten freshness).disablePortalDomain()(mark disabled, signal workflow, enqueue reconciliation).
- REST endpoints (optional) under
/api/settings/client-portal/domainto support CLI/automation. Wrap each handler with RBAC guard + audit trail entry. - Edition gating: CE build exports stubs returning the default hosted domain.
Temporal Workflow & Activities
- Workflow:
PortalDomainRegistrationWorkflowlocated atee/temporal-workflows/src/workflows/portal-domains/registration.workflow.ts. - Activities (all in
ee/temporal-workflows/src/activities/portal-domains):recordStatus(domainId, status, message?)– updates DB row via shared repository helper.verifyCname(domain, canonicalHost)– performs repeated DNS lookups until two consecutive matches (5 min backoff). On timeout →dns_failedwith guidance.renderAndApplyKubernetesState()– fetches all managed rows, renders Istio Gateway/VirtualService and cert-managerCertificatemanifests, writes optional JSON snapshots to the GitOps worktree (PORTAL_DOMAIN_MANIFEST_DIR), applies updates via the Kubernetes API client, records resource versions, and prunes anything no longer present.waitForCertificateReady(namespace, certificateName)– polls cert-manager status via K8s API, emits PostHog metrics (issuance duration) and OpenTelemetry spans, and confirms HTTP-01 challenges are reachable (failing fast with actionable messaging when challenge pods lack ingress).waitForIstioSync(domainId)– checks VirtualService/GatewayresourceVersionmatches recorded value; updates DB.finalizeActivation(domainId)– optional HTTP probe tohttps://domainto confirm 200 status, then markactive.- Shared
handleFailure(domainId, error, stage)– surfaces sanitized details tostatus_messageand emits PostHog failure metric.
- Signals:
removeDomain– invoked when tenant disables domain to short-circuit and move to tear-down path.
- Observability: Wrap workflow + activities with OpenTelemetry instrumentation (
workflow.logger+trace) and PostHog event names (portal_domain.dns_verified,portal_domain.cert_ready, etc.).
Kubernetes & TLS Strategy
- Canonical host: Each tenant routes through
<tenant7>.portal.algapsa.com. This stays backed by the wildcard certificate we already manage (*.portal.algapsa.com). - Vanity host: When the tenant’s vanity domain CNAMEs to the canonical host, cert-manager issues an individual certificate using the HTTP-01 solver. We must ensure Istio routes
/.well-known/acme-challenge/*to cert-manager’s challenge service; if that path is unavailable we will have the Temporal workflow render the required challenge response assets until issuance completes. Secrets namedportal-domain-{tenant}live in namespacemsp. - Resources:
Certificateper vanity domain (namespacemsp,issuerReffromPORTAL_DOMAIN_CERT_*env, defaultletsencrypt-dns) with deterministic secret nameportal-domain-<tenant7>.Gatewayper domain (namespaceistio-systemby default) exposes HTTP→HTTPS redirect and SNI-bound TLS server using the generated secret.VirtualServiceper domain (namespacemsp) routes vanity traffic tosebastian.msp.svc.cluster.local:3000; optional/.well-known/acme-challenge/*route forwards toPORTAL_DOMAIN_CHALLENGE_HOSTwhen enabled.- Reconciliation flushes the desired resource set on every run and prunes labelled resources that no longer map to active domains.
GitOps Workflow (nm-kube-config)
- Production manifests live under
~/nm-kube-config/alga-psa/portal-domains/<tenantSlug>.yaml. Each file contains the renderedCertificate,Gateway, andVirtualServiceseparated by---so kubectl/Helm can apply them directly. - Staging (hv-dev2) mirrors the layout at
~/nm-kube-config/argo-workflow/alga-psa-dev/portal-domains/<tenantSlug>.yamlto keep dev/test traffic isolated. The workflow picks the target root based on environment. PORTAL_DOMAIN_MANIFEST_DIRpoints at the appropriate root folder; on every reconciliation the activity rewrites the per-tenant YAML from database state (sorted keys, deterministic ordering) so Git diffs stay readable.- After files are updated the activity runs
kubectl apply -f <tenantSlug>.yaml(or batched apply) against the cluster, stages the changes withgit add portal-domains, commits with a message likechore(portal-domains): sync <tenantSlug>, and pushes to the shared repo. - A helper CLI (
pnpm nm-kube-sync) will encapsulate diff detection, safe commit messages, optional PR creation, and fall back to printingkubectlcommands for manual review when auto-apply is disabled. - Operational playbook: review the generated Git diff, merge/push (or approve the automation’s push), verify Argo/Flux sync health, then trigger the Temporal refresh action so DB status aligns with the cluster.
UI Updates (CE vs EE)
- CE (
server/src/components/settings/general/ClientPortalSettings.tsx): Replace “Coming Soon” with a read-only card highlighting default portal address (<tenant7>.portal.algapsa.com) and note that custom domains require Enterprise. - EE (
ee/server/src/components/settings/general/ClientPortalSettings.tsx):- Status banner showing
domain,canonical_host,status,last_checked_at, andstatus_message. - Form with input
id="client-portal-domain-input", submit buttonid="client-portal-domain-submit", and optionalid="client-portal-domain-refresh"button to poll immediately. - DNS instructions card (canonical host, sample CNAME record) and hints for propagation delays.
- Error panel listing actionable steps (derived from
status_message). - Remove button (
id="client-portal-domain-remove") once status is non-pending. - Poll status with SWR (15s) while in non-terminal state; respect AI coding standards for naming, instrumentation IDs, and toast usage.
- Status banner showing
Security & Permissions
- RBAC: Reuse existing tenant admin role checks (
requireTenantPermission('settings.manage_portal')or similar). Deny actions for non-admins with clear error. - Audit logging: Insert entries via existing helper for every create, refresh, remove action.
- Validation: Enforce ASCII/LDH host rules, reject apex domains without CNAME capability, normalise to lowercase.
Observability
- Temporal: Use OpenTelemetry tracing already wired in
ee/temporal-workflowsto emit spans for each activity. Tag spans withtenant_id,domain,stage(avoid PII beyond hostnames). - Metrics: Emit PostHog events (counts, durations, failure reasons). Add dashboards to monitor time-to-activate, failure rate, and number of active domains.
- Logging: Standardise structured logs with correlation IDs (workflow run ID).
Progress Update – 2025-09-19
- ✅
portal_domainsmigration, enums, and model helpers landed with canonical-host storage and normalization utilities. - ✅ CE/EE server actions implemented with RBAC enforcement, Temporal enqueue hooks, and PostHog capture; CE build returns read-only status.
- ✅ CE UI now surfaces the canonical host; EE UI delivers status badges, form submission, refresh/disable actions, and onboarding guidance.
- ✅ Workflow client + Temporal workflow now render and apply Istio Gateway/VirtualService + Certificate resources via the Kubernetes API, prune stale objects, and capture manifest snapshots when configured.
- ✅ Support documentation added (
ee/docs/guides/portal-domain-runbook.md) describing operational flows and observability. - ✅ Optional GitOps export writes JSON manifests for each tenant into
PORTAL_DOMAIN_MANIFEST_DIR, ready for nm-kube-config commits. - 🟡 Automated tests remain sparse (initial manifest-render unit test added); workflow + UI coverage still outstanding.
Implementation Plan
- Schema & Models – ✅ Completed (2025-09-19)
- Migration and model helpers merged; canonical host stored per tenant.
- Server Actions & API – ✅ Completed (2025-09-19)
- RBAC-guarded EE actions and CE stubs implemented; PostHog instrumentation in place. REST surface still optional (not started).
- Temporal Workflow & Activities – 🟡 In progress
- DNS verification + reconciliation activities now apply/prune Kubernetes resources and write manifest snapshots; remaining work covers certificate readiness polling, richer status messaging, and workflow tests.
- GitOps & Reconciliation Tooling – 🟡 In progress
- Per-tenant YAML written to
nm-kube-config/{alga-psa|argo-workflow/alga-psa-dev}/portal-domains; still need the CLI to diff/commit/apply and docs for the automation flow.
- Per-tenant YAML written to
- Kubernetes Reconciliation Templates & HTTP-01 Pathing – 🟡 In progress
- Gateway/VirtualService templates implemented with optional HTTP-01 routing via
PORTAL_DOMAIN_CHALLENGE_*; still need to standardise the challenge-serving workload and productionize readiness probes.
- Gateway/VirtualService templates implemented with optional HTTP-01 routing via
- UI (CE + EE) – ✅ Completed (2025-09-19)
- CE shows default host; EE form with status badges, refresh/disable flows, and guidance shipped. Cypress/Playwright coverage outstanding.
- Testing & Verification – 🔄 Not started
- Unit/integration tests, workflow harness, and staging checklist to be added.
- Rollout – 🔄 Not started
- Migration deployment, worker release, and manual onboarding plan remain after backend completion.
Terminal Status UX Matrix
| Status | UI Treatment | Available Actions | Next-Step Guidance |
|---|---|---|---|
active |
Success banner with vanity + canonical host, "Active" badge, last verified time | Refresh, Remove | Inform user domain is live; suggest confirming CNAME remains pointed correctly. |
disabled |
Neutral banner noting custom domain disabled and default host in use | Submit new domain, Refresh | Explain portal serves canonical host; advise submitting a new domain when ready. |
dns_failed |
Error banner showing last resolved target + suggested TTL wait | Refresh, Remove, Retry Registration | Prompt user to correct DNS record to the canonical host, then retry once propagated. |
certificate_failed |
Error banner with cert-manager status message and timestamp | Refresh, Remove, Retry Registration | Direct user to ensure HTTP-01 path is reachable; recommend contacting support if blocked. |
Resolved Questions
- Canonical target storage: store in schema (
canonical_host). - DNS / Certificate scope: rely on wildcard
*.portal.algapsa.comfor canonical ingress; vanity domains CNAME into it and are issued certs without extending Route53 zones. - ACME challenge method: enforce HTTP-01 challenges by exposing
/.well-known/acme-challenge/*through Istio or temporary challenge-serving workloads managed by the workflow. - Reconciliation location: handled entirely inside the Temporal activity suite; no separate Kubernetes operator required.
- Resource teardown: reconciliation activity renders the full desired list and prunes anything missing, ensuring deleted domains remove all K8s resources automatically.
Remaining Work & Follow-ups
- Add cert-manager readiness polling + HTTP probes before marking domains
active; emit PostHog timings + richer status messages for failure cases. - Finalise HTTP-01 challenge serving (shared solver service or on-demand pod) and bake the required
PORTAL_DOMAIN_CHALLENGE_*defaults into staging/production. - Build the GitOps helper CLI (
pnpm nm-kube-sync) to diff manifests, open PRs, and optionally apply changes; update runbook once available. - Expand automated coverage: workflow unit/integration tests, CE/EE action tests, and mocked EE UI e2e flows plus a staging validation checklist.
- Validate the new base VirtualService redirect management in staging once rolled out; regression coverage lives in
ee/temporal-workflows/src/activities/__tests__/portal-domain-activities.git.test.tsto guard the/client-portal/dashboarddefault route. - Provide operational tooling (Temporal signal CLI/script) for forced reconciliation and document the procedure in the runbook.
- Plan rollout sequencing: migration deployment order, Temporal worker release, customer enablement messaging, and nm-kube-config PR cadence.