alldigital/PSA

Fork 0

Hermes 284313f908

Bidi Control Character Guard / bidi-control-guard (push) Waiting to run

Details

Circular Dependency Check / Check for new circular dependencies (push) Waiting to run

Details

Citus Migration Smoke / Combined migrations on single-node Citus (push) Waiting to run

Details

E2E Fresh Install Tests / fresh-install-e2e (push) Waiting to run

Details

ext-v2 guardrails / Run ext-v2 guard and ESLint (push) Waiting to run

Details

Integration Tests / Check for relevant changes (push) Waiting to run

Details

Integration Tests / ${{ (github.event_name == 'schedule' || github.event.inputs.suite == 'full') && 'Full integration suite' || 'Tier-1 integration subset' }} (push) Blocked by required conditions

Details

Mobile checks / Mobile lint + typecheck (push) Waiting to run

Details

Mobile checks / Mobile unit tests (push) Waiting to run

Details

Mobile checks / Mobile dependency audit (report) (push) Waiting to run

Details

Mobile checks / Mobile reproducibility checks (push) Waiting to run

Details

Secrets guard (env backups) / Ensure no tracked env backup files (push) Waiting to run

Details

Temporal Readiness / fast-readiness (push) Waiting to run

Details

Temporal Readiness / docker-parity (push) Waiting to run

Details

TypeScript Type Check / Nx affected typecheck (push) Waiting to run

Details

Unit Tests / Skipped-test budget (push) Waiting to run

Details

Unit Tests / Nx affected unit tests (push) Waiting to run

Details

Unit Tests / Server unit coverage (informational) (push) Waiting to run

Details

Validate Tenant Management Schema / Check for relevant changes (push) Waiting to run

Details

Validate Tenant Management Schema / Validate Tenant Management Schema (push) Blocked by required conditions

Details

EE Workflows Build Guard / ee-workflows-build-guard (push) Waiting to run

Details

Initial import of AlgaPSA codebase from PSA server

Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz

Source: /opt/alga-psa on psa.joliet.tech

2026-06-22 16:12:17 -05:00

13 KiB

Raw Permalink Blame History

PRD: Tiered Appliance Bootstrap Status

Summary

Fresh on-premise appliance bootstrap currently feels opaque and overly fragile. A single long-running bootstrap can spend tens of minutes pulling large images, waiting on Helm timeouts, or hiding lower-level blockers behind generic context deadline exceeded messages. Non-login-critical services can also make the whole appliance appear unavailable even when the Alga core application is already usable.

Introduce a tiered appliance bootstrap model with an early token-protected status web UI. The appliance should clearly distinguish platform readiness, core business readiness, login readiness, background-service readiness, and full health. The first customer-visible milestone is Ready to log in, not “all optional/background services healthy.”

Problem Statement

During a local Talos/UTM appliance bootstrap, the install took far longer than expected and required manual investigation. Time was spent on:

Talos installer image pull blocked by DNS (192.168.64.1:53 refused queries).
A fresh reset helper failure (target: unbound variable).
Large alga-psa-ee image pull taking roughly 16 minutes.
alga-core Helm install waiting until a 30-minute timeout while Postgres was blocked by a PVC subPath error.
Temporal Helm install timing out because the live deployment did not run autosetup.
Temporal UI failing from Kubernetes service-link environment variable collision.
Background worker releases blocked by missing image tags.

The operator had to inspect Talos console output, Kubernetes events, HelmRelease conditions, pod descriptions, and logs manually to understand whether the install was progressing, blocked, login-ready, or only background-degraded.

Goals

Provide an early status web UI during bootstrap at a predictable node URL, protected by a generated token.
Define tiered readiness so LOGIN_READY is distinct from FULLY_HEALTHY.
Make the bootstrap CLI print the status URL/token and emit richer phase progress.
Split or organize appliance Flux/Helm rollout so background services do not block core login readiness.
Surface specific, actionable blockers instead of generic Helm timeouts.
Preserve a stable status data model that can later be backed by a controller/CRD without rewriting the UI.
Fold the concrete issues discovered during the local UTM/Talos bootstrap into durable chart/operator fixes.

Non-goals

Building a full replacement for Flux or Helm.
Making every background service optional for all production deployments.
Implementing a complete appliance management portal beyond bootstrap/status/support diagnostics.
Exposing secret values or privileged Kubernetes mutation capabilities in the first status UI.
Solving image publishing/release automation comprehensively beyond validating referenced tags before install.

Users and Personas

Customer/Admin Installer

Needs to know whether the appliance is installing, ready to log in, degraded, or requiring support. Should not need Kubernetes knowledge.

Alga Support / Operator

Needs exact technical blockers, relevant events, component health, image pull failures, HelmRelease states, bootstrap job summaries, and support-bundle entry points.

Developer / Release Engineer

Needs release validation feedback when a manifest references missing image tags or chart wiring causes predictable bootstrap failure.

Primary User Flow

Admin starts appliance bootstrap.
CLI completes Talos/Kubernetes/platform prerequisites.
CLI installs the early appliance-status service.

CLI prints:

Appliance status UI:
  URL:   http://<node-ip>:8080
  Token: <generated-token>

Admin opens the status UI with the token.
Status page shows current phase and whether login is available.
Core Alga reaches LOGIN_READY; UI shows login URL.
Background services continue installing.
If background services fail, UI shows Ready with background issues and specific remediation guidance.

Readiness Model

Readiness is tiered, not binary.

PLATFORM_READY

Required:

Talos node installed and booted from disk.
Kubernetes API reachable.
Node Ready and schedulable.
CoreDNS healthy.
DNS/outbound HTTPS checks pass.
local-path storage installed.
storage smoke test passes.
Flux controllers running.
appliance-status reachable.

CORE_READY

Required:

Postgres ready.
Redis ready.
PgBouncer ready if the app runtime uses PgBouncer.
DB and Redis credentials exist.
required PVCs bound.
no login-critical pod in CreateContainerConfigError, ImagePullBackOff, or CrashLoopBackOff.

BOOTSTRAP_READY

Required:

bootstrap job completed.
database exists.
migrations completed.
seed data exists.
representative seed query passes, such as server.users count greater than zero.
bootstrap mode is no longer destructive fresh semantics after successful first run.

LOGIN_READY

Definition selected for this plan: core business ready.

Required:

CORE_READY.
BOOTSTRAP_READY.
Alga web deployment ready.
app URL responds.
dashboard/login redirect works.
public login URL known.

Not required:

email service.
Temporal.
workflow worker.
temporal worker.
optional integrations.

BACKGROUND_READY

Required background services healthy for the selected release/profile:

email-service.
Temporal and Temporal UI if enabled.
workflow-worker.
temporal-worker.
future integration/background services.

FULLY_HEALTHY

Required:

LOGIN_READY.
BACKGROUND_READY.
all selected HelmReleases ready.
no unacknowledged critical warnings.
no missing image tags.
no image drift.

User-facing Rollup States

State	Meaning
Installing	Platform/core/bootstrap/login not complete and no hard blocker yet.
Ready to log in	`LOGIN_READY=true`; background services may still be installing.
Ready with background issues	`LOGIN_READY=true`; at least one background component failed or is blocked.
Fully healthy	`FULLY_HEALTHY=true`.
Failed / action required	A platform/core/bootstrap/login blocker prevents use.

Proposed Architecture

Early Status Service

Add a new early-installed chart or manifest set:

appliance-status

Expose it on a predictable node URL:

http://<node-ip>:8080

The service is token-protected. The bootstrap script/operator generates a token, stores it locally and in-cluster, and prints it to the admin.

Local path:

~/.alga-psa-appliance/<site-id>/status-token

In-cluster Secret:

namespace: appliance-system
secret: appliance-status-auth
key: token

The status service uses read-only Kubernetes RBAC and reads:

Nodes.
Pods.
Jobs.
PVCs.
Events.
Flux GitRepository.
Flux Kustomization.
Flux HelmRelease.
selected ConfigMaps/Secrets metadata, not secret values.

Hybrid Collector Model

First version may be a small web service that reads Kubernetes directly. Its internal status schema must remain clean and stable so a future controller can publish the same model via CRD or ConfigMap.

UI Layers

Overview

Simple customer/admin status:

Status: Ready to log in
Login URL: http://<node-ip>:3000
Background: 2 services need attention
Top issue: workflow-worker image tag not found

Advanced Diagnostics

Support/operator diagnostics:

readiness tiers.
component table.
top blockers.
recent Kubernetes events.
HelmRelease conditions.
image pull state.
bootstrap job summary.
support bundle action in a later milestone.

Status API

Expose one canonical status document:

GET /api/status

Representative shape:

{
  "siteId": "appliance-single-node",
  "timestamp": "2026-04-30T02:08:10Z",
  "release": {
    "selectedReleaseVersion": "1.0-rc5",
    "appVersion": "1.0-rc5",
    "channel": "candidate",
    "gitRevision": "release/1.0-rc5@sha1:979e2079..."
  },
  "urls": {
    "statusUrl": "http://192.168.64.8:8080",
    "loginUrl": "http://192.168.64.8:3000"
  },
  "rollup": {
    "state": "ready_with_background_issues",
    "message": "Alga is ready to log in. Some background services need attention.",
    "nextAction": "Log in to Alga, then review background service issues."
  },
  "tiers": {
    "platform": { "ready": true, "status": "healthy" },
    "core": { "ready": true, "status": "healthy" },
    "bootstrap": { "ready": true, "status": "healthy" },
    "login": { "ready": true, "status": "healthy" },
    "background": { "ready": false, "status": "degraded" },
    "fullHealth": { "ready": false, "status": "degraded" }
  },
  "topBlockers": [
    {
      "severity": "background",
      "component": "workflow-worker",
      "layer": "image",
      "reason": "Image tag not found: ghcr.io/nine-minds/workflow-worker:61e4a00e",
      "nextAction": "Publish the missing image tag or update the appliance release manifest.",
      "loginBlocking": false
    }
  ]
}

Blocker Detection Requirements

The collector must translate low-level conditions into actionable messages.

Examples from the observed bootstrap:

Low-level signal	User-facing blocker
`lookup factory.talos.dev on 192.168.64.1:53: connection refused`	DNS resolver failure; configure explicit DNS servers and retry.
`failed to create subPath directory for volumeMount "db-data"`	Postgres PVC initialization failed; repair/recreate PVC subPath.
`ImagePullBackOff` and `not found`	image tag not found; publish missing tag or update release manifest.
`context canceled` during image pull	image pull interrupted; wait for retry or restart pod.
Helm install `context deadline exceeded` plus pod-level DB failure	report DB/PVC blocker, not generic Helm timeout.
Temporal `sql schema version compatibility check failed`	Temporal schema not initialized; verify autosetup or schema job.
Temporal UI `cannot unmarshal tcp://... into int`	Kubernetes service-link environment collision; disable service links.

Bootstrap and Flux Flow

Phase 0: Host/Talos bootstrap

Talos maintenance API reachable.
disk detected.
machine config applied.
installer image pulled.
Kubernetes bootstrapped.
kubeconfig retrieved.
node Ready.

Phase 1: Platform prerequisites

CoreDNS resolver config.
local-path storage.
storage smoke test.
Flux controllers.

Phase 2: Status service

Generate token.
Create appliance-system namespace and status auth Secret.
Install appliance-status.
Print status URL and token.

Phase 3: Core app

Install login-critical services.
Wait for CORE_READY, BOOTSTRAP_READY, and LOGIN_READY.
Print login URL as soon as ready.

Phase 4: Background services

Install email-service, Temporal, workflow-worker, temporal-worker, and future background services.
Failures set Ready with background issues rather than blocking login.

Phase 5: Full health/support

Report FULLY_HEALTHY when all selected services are healthy.
Provide support bundle and remediation controls in later iterations.

Flux/Helm Organization

Restructure appliance Flux resources into tiered groups:

ee/appliance/flux/base/
  platform/
    appliance-status.yaml
  core/
    alga-core.yaml
    pgbouncer.yaml
  background/
    temporal.yaml
    email-service.yaml
    workflow-worker.yaml
    temporal-worker.yaml

Prefer separate Flux Kustomizations:

alga-platform
alga-core
alga-background

Dependencies:

alga-core depends on alga-platform
alga-background depends on alga-core

alga-background failure must not change LOGIN_READY=false.

Implementation Milestones

Status schema and CLI visibility.
Early appliance-status chart/service.
Core/background tier split.
Durable chart/operator fixes from the observed run.
Support bundle and guided remediation actions.

Durable Fixes from Observed Run

Explicit DNS support and visibility in Talos bootstrap.
Fix reset-appliance-data.sh target: unbound variable failure.
Prevent operator wrapper from overwriting valid Talos credentials when explicit kubeconfig/talosconfig are supplied.
Fix or avoid Postgres PVC subPath initialization failure.
Ensure Temporal chart runs autosetup correctly.
Disable service links for Temporal UI and any other pod vulnerable to service-env collisions.
Validate release manifest image tags before applying background releases.
Classify large image pulls separately from stalled installs.

Acceptance Criteria

Fresh appliance bootstrap prints a status URL and token within the first few minutes after Kubernetes is ready.
Status UI is reachable at http://<node-ip>:8080 and requires the generated token.
Status UI reports tiered readiness and top blockers.
Alga core reaching LOGIN_READY is reported independently of background services.
A missing workflow-worker image tag produces Ready with background issues when login is otherwise available.
Generic Helm timeout is not the top blocker when a lower-level pod/event cause is known.
Release manifest validation catches missing background image tags before or during background install.
A local UTM/Talos smoke run can reproduce: status UI early, login-ready core, and background-degraded reporting.

13 KiB Raw Permalink Blame History

PRD: Tiered Appliance Bootstrap Status

Summary

Problem Statement

Goals

Non-goals

Users and Personas

Customer/Admin Installer

Alga Support / Operator

Developer / Release Engineer

Primary User Flow

Readiness Model

PLATFORM_READY

CORE_READY

BOOTSTRAP_READY

LOGIN_READY

BACKGROUND_READY

FULLY_HEALTHY

User-facing Rollup States

Proposed Architecture

Early Status Service

Hybrid Collector Model

UI Layers

Overview

Advanced Diagnostics

Status API

Blocker Detection Requirements

Bootstrap and Flux Flow

Phase 0: Host/Talos bootstrap

Phase 1: Platform prerequisites

Phase 2: Status service

Phase 3: Core app

Phase 4: Background services

Phase 5: Full health/support

Flux/Helm Organization

Implementation Milestones

Durable Fixes from Observed Run

Acceptance Criteria

13 KiB

Raw Permalink Blame History