Hermes 284313f908
Some checks are pending
Bidi Control Character Guard / bidi-control-guard (push) Waiting to run
Circular Dependency Check / Check for new circular dependencies (push) Waiting to run
Citus Migration Smoke / Combined migrations on single-node Citus (push) Waiting to run
E2E Fresh Install Tests / fresh-install-e2e (push) Waiting to run
ext-v2 guardrails / Run ext-v2 guard and ESLint (push) Waiting to run
Integration Tests / Check for relevant changes (push) Waiting to run
Integration Tests / ${{ (github.event_name == 'schedule' || github.event.inputs.suite == 'full') && 'Full integration suite' || 'Tier-1 integration subset' }} (push) Blocked by required conditions
Mobile checks / Mobile lint + typecheck (push) Waiting to run
Mobile checks / Mobile unit tests (push) Waiting to run
Mobile checks / Mobile dependency audit (report) (push) Waiting to run
Mobile checks / Mobile reproducibility checks (push) Waiting to run
Secrets guard (env backups) / Ensure no tracked env backup files (push) Waiting to run
Temporal Readiness / fast-readiness (push) Waiting to run
Temporal Readiness / docker-parity (push) Waiting to run
TypeScript Type Check / Nx affected typecheck (push) Waiting to run
Unit Tests / Skipped-test budget (push) Waiting to run
Unit Tests / Nx affected unit tests (push) Waiting to run
Unit Tests / Server unit coverage (informational) (push) Waiting to run
Validate Tenant Management Schema / Check for relevant changes (push) Waiting to run
Validate Tenant Management Schema / Validate Tenant Management Schema (push) Blocked by required conditions
EE Workflows Build Guard / ee-workflows-build-guard (push) Waiting to run
Initial import of AlgaPSA codebase from PSA server
Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz

Source: /opt/alga-psa on psa.joliet.tech
2026-06-22 16:12:17 -05:00

13 KiB

PRD: Tiered Appliance Bootstrap Status

Summary

Fresh on-premise appliance bootstrap currently feels opaque and overly fragile. A single long-running bootstrap can spend tens of minutes pulling large images, waiting on Helm timeouts, or hiding lower-level blockers behind generic context deadline exceeded messages. Non-login-critical services can also make the whole appliance appear unavailable even when the Alga core application is already usable.

Introduce a tiered appliance bootstrap model with an early token-protected status web UI. The appliance should clearly distinguish platform readiness, core business readiness, login readiness, background-service readiness, and full health. The first customer-visible milestone is Ready to log in, not “all optional/background services healthy.”

Problem Statement

During a local Talos/UTM appliance bootstrap, the install took far longer than expected and required manual investigation. Time was spent on:

  • Talos installer image pull blocked by DNS (192.168.64.1:53 refused queries).
  • A fresh reset helper failure (target: unbound variable).
  • Large alga-psa-ee image pull taking roughly 16 minutes.
  • alga-core Helm install waiting until a 30-minute timeout while Postgres was blocked by a PVC subPath error.
  • Temporal Helm install timing out because the live deployment did not run autosetup.
  • Temporal UI failing from Kubernetes service-link environment variable collision.
  • Background worker releases blocked by missing image tags.

The operator had to inspect Talos console output, Kubernetes events, HelmRelease conditions, pod descriptions, and logs manually to understand whether the install was progressing, blocked, login-ready, or only background-degraded.

Goals

  1. Provide an early status web UI during bootstrap at a predictable node URL, protected by a generated token.
  2. Define tiered readiness so LOGIN_READY is distinct from FULLY_HEALTHY.
  3. Make the bootstrap CLI print the status URL/token and emit richer phase progress.
  4. Split or organize appliance Flux/Helm rollout so background services do not block core login readiness.
  5. Surface specific, actionable blockers instead of generic Helm timeouts.
  6. Preserve a stable status data model that can later be backed by a controller/CRD without rewriting the UI.
  7. Fold the concrete issues discovered during the local UTM/Talos bootstrap into durable chart/operator fixes.

Non-goals

  • Building a full replacement for Flux or Helm.
  • Making every background service optional for all production deployments.
  • Implementing a complete appliance management portal beyond bootstrap/status/support diagnostics.
  • Exposing secret values or privileged Kubernetes mutation capabilities in the first status UI.
  • Solving image publishing/release automation comprehensively beyond validating referenced tags before install.

Users and Personas

Customer/Admin Installer

Needs to know whether the appliance is installing, ready to log in, degraded, or requiring support. Should not need Kubernetes knowledge.

Alga Support / Operator

Needs exact technical blockers, relevant events, component health, image pull failures, HelmRelease states, bootstrap job summaries, and support-bundle entry points.

Developer / Release Engineer

Needs release validation feedback when a manifest references missing image tags or chart wiring causes predictable bootstrap failure.

Primary User Flow

  1. Admin starts appliance bootstrap.

  2. CLI completes Talos/Kubernetes/platform prerequisites.

  3. CLI installs the early appliance-status service.

  4. CLI prints:

    Appliance status UI:
      URL:   http://<node-ip>:8080
      Token: <generated-token>
    
  5. Admin opens the status UI with the token.

  6. Status page shows current phase and whether login is available.

  7. Core Alga reaches LOGIN_READY; UI shows login URL.

  8. Background services continue installing.

  9. If background services fail, UI shows Ready with background issues and specific remediation guidance.

Readiness Model

Readiness is tiered, not binary.

PLATFORM_READY

Required:

  • Talos node installed and booted from disk.
  • Kubernetes API reachable.
  • Node Ready and schedulable.
  • CoreDNS healthy.
  • DNS/outbound HTTPS checks pass.
  • local-path storage installed.
  • storage smoke test passes.
  • Flux controllers running.
  • appliance-status reachable.

CORE_READY

Required:

  • Postgres ready.
  • Redis ready.
  • PgBouncer ready if the app runtime uses PgBouncer.
  • DB and Redis credentials exist.
  • required PVCs bound.
  • no login-critical pod in CreateContainerConfigError, ImagePullBackOff, or CrashLoopBackOff.

BOOTSTRAP_READY

Required:

  • bootstrap job completed.
  • database exists.
  • migrations completed.
  • seed data exists.
  • representative seed query passes, such as server.users count greater than zero.
  • bootstrap mode is no longer destructive fresh semantics after successful first run.

LOGIN_READY

Definition selected for this plan: core business ready.

Required:

  • CORE_READY.
  • BOOTSTRAP_READY.
  • Alga web deployment ready.
  • app URL responds.
  • dashboard/login redirect works.
  • public login URL known.

Not required:

  • email service.
  • Temporal.
  • workflow worker.
  • temporal worker.
  • optional integrations.

BACKGROUND_READY

Required background services healthy for the selected release/profile:

  • email-service.
  • Temporal and Temporal UI if enabled.
  • workflow-worker.
  • temporal-worker.
  • future integration/background services.

FULLY_HEALTHY

Required:

  • LOGIN_READY.
  • BACKGROUND_READY.
  • all selected HelmReleases ready.
  • no unacknowledged critical warnings.
  • no missing image tags.
  • no image drift.

User-facing Rollup States

State Meaning
Installing Platform/core/bootstrap/login not complete and no hard blocker yet.
Ready to log in LOGIN_READY=true; background services may still be installing.
Ready with background issues LOGIN_READY=true; at least one background component failed or is blocked.
Fully healthy FULLY_HEALTHY=true.
Failed / action required A platform/core/bootstrap/login blocker prevents use.

Proposed Architecture

Early Status Service

Add a new early-installed chart or manifest set:

appliance-status

Expose it on a predictable node URL:

http://<node-ip>:8080

The service is token-protected. The bootstrap script/operator generates a token, stores it locally and in-cluster, and prints it to the admin.

Local path:

~/.alga-psa-appliance/<site-id>/status-token

In-cluster Secret:

namespace: appliance-system
secret: appliance-status-auth
key: token

The status service uses read-only Kubernetes RBAC and reads:

  • Nodes.
  • Pods.
  • Jobs.
  • PVCs.
  • Events.
  • Flux GitRepository.
  • Flux Kustomization.
  • Flux HelmRelease.
  • selected ConfigMaps/Secrets metadata, not secret values.

Hybrid Collector Model

First version may be a small web service that reads Kubernetes directly. Its internal status schema must remain clean and stable so a future controller can publish the same model via CRD or ConfigMap.

UI Layers

Overview

Simple customer/admin status:

Status: Ready to log in
Login URL: http://<node-ip>:3000
Background: 2 services need attention
Top issue: workflow-worker image tag not found

Advanced Diagnostics

Support/operator diagnostics:

  • readiness tiers.
  • component table.
  • top blockers.
  • recent Kubernetes events.
  • HelmRelease conditions.
  • image pull state.
  • bootstrap job summary.
  • support bundle action in a later milestone.

Status API

Expose one canonical status document:

GET /api/status

Representative shape:

{
  "siteId": "appliance-single-node",
  "timestamp": "2026-04-30T02:08:10Z",
  "release": {
    "selectedReleaseVersion": "1.0-rc5",
    "appVersion": "1.0-rc5",
    "channel": "candidate",
    "gitRevision": "release/1.0-rc5@sha1:979e2079..."
  },
  "urls": {
    "statusUrl": "http://192.168.64.8:8080",
    "loginUrl": "http://192.168.64.8:3000"
  },
  "rollup": {
    "state": "ready_with_background_issues",
    "message": "Alga is ready to log in. Some background services need attention.",
    "nextAction": "Log in to Alga, then review background service issues."
  },
  "tiers": {
    "platform": { "ready": true, "status": "healthy" },
    "core": { "ready": true, "status": "healthy" },
    "bootstrap": { "ready": true, "status": "healthy" },
    "login": { "ready": true, "status": "healthy" },
    "background": { "ready": false, "status": "degraded" },
    "fullHealth": { "ready": false, "status": "degraded" }
  },
  "topBlockers": [
    {
      "severity": "background",
      "component": "workflow-worker",
      "layer": "image",
      "reason": "Image tag not found: ghcr.io/nine-minds/workflow-worker:61e4a00e",
      "nextAction": "Publish the missing image tag or update the appliance release manifest.",
      "loginBlocking": false
    }
  ]
}

Blocker Detection Requirements

The collector must translate low-level conditions into actionable messages.

Examples from the observed bootstrap:

Low-level signal User-facing blocker
lookup factory.talos.dev on 192.168.64.1:53: connection refused DNS resolver failure; configure explicit DNS servers and retry.
failed to create subPath directory for volumeMount "db-data" Postgres PVC initialization failed; repair/recreate PVC subPath.
ImagePullBackOff and not found image tag not found; publish missing tag or update release manifest.
context canceled during image pull image pull interrupted; wait for retry or restart pod.
Helm install context deadline exceeded plus pod-level DB failure report DB/PVC blocker, not generic Helm timeout.
Temporal sql schema version compatibility check failed Temporal schema not initialized; verify autosetup or schema job.
Temporal UI cannot unmarshal tcp://... into int Kubernetes service-link environment collision; disable service links.

Bootstrap and Flux Flow

Phase 0: Host/Talos bootstrap

  • Talos maintenance API reachable.
  • disk detected.
  • machine config applied.
  • installer image pulled.
  • Kubernetes bootstrapped.
  • kubeconfig retrieved.
  • node Ready.

Phase 1: Platform prerequisites

  • CoreDNS resolver config.
  • local-path storage.
  • storage smoke test.
  • Flux controllers.

Phase 2: Status service

  • Generate token.
  • Create appliance-system namespace and status auth Secret.
  • Install appliance-status.
  • Print status URL and token.

Phase 3: Core app

  • Install login-critical services.
  • Wait for CORE_READY, BOOTSTRAP_READY, and LOGIN_READY.
  • Print login URL as soon as ready.

Phase 4: Background services

  • Install email-service, Temporal, workflow-worker, temporal-worker, and future background services.
  • Failures set Ready with background issues rather than blocking login.

Phase 5: Full health/support

  • Report FULLY_HEALTHY when all selected services are healthy.
  • Provide support bundle and remediation controls in later iterations.

Flux/Helm Organization

Restructure appliance Flux resources into tiered groups:

ee/appliance/flux/base/
  platform/
    appliance-status.yaml
  core/
    alga-core.yaml
    pgbouncer.yaml
  background/
    temporal.yaml
    email-service.yaml
    workflow-worker.yaml
    temporal-worker.yaml

Prefer separate Flux Kustomizations:

alga-platform
alga-core
alga-background

Dependencies:

alga-core depends on alga-platform
alga-background depends on alga-core

alga-background failure must not change LOGIN_READY=false.

Implementation Milestones

  1. Status schema and CLI visibility.
  2. Early appliance-status chart/service.
  3. Core/background tier split.
  4. Durable chart/operator fixes from the observed run.
  5. Support bundle and guided remediation actions.

Durable Fixes from Observed Run

  • Explicit DNS support and visibility in Talos bootstrap.
  • Fix reset-appliance-data.sh target: unbound variable failure.
  • Prevent operator wrapper from overwriting valid Talos credentials when explicit kubeconfig/talosconfig are supplied.
  • Fix or avoid Postgres PVC subPath initialization failure.
  • Ensure Temporal chart runs autosetup correctly.
  • Disable service links for Temporal UI and any other pod vulnerable to service-env collisions.
  • Validate release manifest image tags before applying background releases.
  • Classify large image pulls separately from stalled installs.

Acceptance Criteria

  1. Fresh appliance bootstrap prints a status URL and token within the first few minutes after Kubernetes is ready.
  2. Status UI is reachable at http://<node-ip>:8080 and requires the generated token.
  3. Status UI reports tiered readiness and top blockers.
  4. Alga core reaching LOGIN_READY is reported independently of background services.
  5. A missing workflow-worker image tag produces Ready with background issues when login is otherwise available.
  6. Generic Helm timeout is not the top blocker when a lower-level pod/event cause is known.
  7. Release manifest validation catches missing background image tags before or during background install.
  8. A local UTM/Talos smoke run can reproduce: status UI early, login-ready core, and background-degraded reporting.