Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz Source: /opt/alga-psa on psa.joliet.tech
13 KiB
PRD: Tiered Appliance Bootstrap Status
Summary
Fresh on-premise appliance bootstrap currently feels opaque and overly fragile. A single long-running bootstrap can spend tens of minutes pulling large images, waiting on Helm timeouts, or hiding lower-level blockers behind generic context deadline exceeded messages. Non-login-critical services can also make the whole appliance appear unavailable even when the Alga core application is already usable.
Introduce a tiered appliance bootstrap model with an early token-protected status web UI. The appliance should clearly distinguish platform readiness, core business readiness, login readiness, background-service readiness, and full health. The first customer-visible milestone is Ready to log in, not “all optional/background services healthy.”
Problem Statement
During a local Talos/UTM appliance bootstrap, the install took far longer than expected and required manual investigation. Time was spent on:
- Talos installer image pull blocked by DNS (
192.168.64.1:53refused queries). - A fresh reset helper failure (
target: unbound variable). - Large
alga-psa-eeimage pull taking roughly 16 minutes. alga-coreHelm install waiting until a 30-minute timeout while Postgres was blocked by a PVC subPath error.- Temporal Helm install timing out because the live deployment did not run autosetup.
- Temporal UI failing from Kubernetes service-link environment variable collision.
- Background worker releases blocked by missing image tags.
The operator had to inspect Talos console output, Kubernetes events, HelmRelease conditions, pod descriptions, and logs manually to understand whether the install was progressing, blocked, login-ready, or only background-degraded.
Goals
- Provide an early status web UI during bootstrap at a predictable node URL, protected by a generated token.
- Define tiered readiness so
LOGIN_READYis distinct fromFULLY_HEALTHY. - Make the bootstrap CLI print the status URL/token and emit richer phase progress.
- Split or organize appliance Flux/Helm rollout so background services do not block core login readiness.
- Surface specific, actionable blockers instead of generic Helm timeouts.
- Preserve a stable status data model that can later be backed by a controller/CRD without rewriting the UI.
- Fold the concrete issues discovered during the local UTM/Talos bootstrap into durable chart/operator fixes.
Non-goals
- Building a full replacement for Flux or Helm.
- Making every background service optional for all production deployments.
- Implementing a complete appliance management portal beyond bootstrap/status/support diagnostics.
- Exposing secret values or privileged Kubernetes mutation capabilities in the first status UI.
- Solving image publishing/release automation comprehensively beyond validating referenced tags before install.
Users and Personas
Customer/Admin Installer
Needs to know whether the appliance is installing, ready to log in, degraded, or requiring support. Should not need Kubernetes knowledge.
Alga Support / Operator
Needs exact technical blockers, relevant events, component health, image pull failures, HelmRelease states, bootstrap job summaries, and support-bundle entry points.
Developer / Release Engineer
Needs release validation feedback when a manifest references missing image tags or chart wiring causes predictable bootstrap failure.
Primary User Flow
-
Admin starts appliance bootstrap.
-
CLI completes Talos/Kubernetes/platform prerequisites.
-
CLI installs the early
appliance-statusservice. -
CLI prints:
Appliance status UI: URL: http://<node-ip>:8080 Token: <generated-token> -
Admin opens the status UI with the token.
-
Status page shows current phase and whether login is available.
-
Core Alga reaches
LOGIN_READY; UI shows login URL. -
Background services continue installing.
-
If background services fail, UI shows
Ready with background issuesand specific remediation guidance.
Readiness Model
Readiness is tiered, not binary.
PLATFORM_READY
Required:
- Talos node installed and booted from disk.
- Kubernetes API reachable.
- Node
Readyand schedulable. - CoreDNS healthy.
- DNS/outbound HTTPS checks pass.
- local-path storage installed.
- storage smoke test passes.
- Flux controllers running.
appliance-statusreachable.
CORE_READY
Required:
- Postgres ready.
- Redis ready.
- PgBouncer ready if the app runtime uses PgBouncer.
- DB and Redis credentials exist.
- required PVCs bound.
- no login-critical pod in
CreateContainerConfigError,ImagePullBackOff, orCrashLoopBackOff.
BOOTSTRAP_READY
Required:
- bootstrap job completed.
- database exists.
- migrations completed.
- seed data exists.
- representative seed query passes, such as
server.userscount greater than zero. - bootstrap mode is no longer destructive
freshsemantics after successful first run.
LOGIN_READY
Definition selected for this plan: core business ready.
Required:
CORE_READY.BOOTSTRAP_READY.- Alga web deployment ready.
- app URL responds.
- dashboard/login redirect works.
- public login URL known.
Not required:
- email service.
- Temporal.
- workflow worker.
- temporal worker.
- optional integrations.
BACKGROUND_READY
Required background services healthy for the selected release/profile:
- email-service.
- Temporal and Temporal UI if enabled.
- workflow-worker.
- temporal-worker.
- future integration/background services.
FULLY_HEALTHY
Required:
LOGIN_READY.BACKGROUND_READY.- all selected HelmReleases ready.
- no unacknowledged critical warnings.
- no missing image tags.
- no image drift.
User-facing Rollup States
| State | Meaning |
|---|---|
| Installing | Platform/core/bootstrap/login not complete and no hard blocker yet. |
| Ready to log in | LOGIN_READY=true; background services may still be installing. |
| Ready with background issues | LOGIN_READY=true; at least one background component failed or is blocked. |
| Fully healthy | FULLY_HEALTHY=true. |
| Failed / action required | A platform/core/bootstrap/login blocker prevents use. |
Proposed Architecture
Early Status Service
Add a new early-installed chart or manifest set:
appliance-status
Expose it on a predictable node URL:
http://<node-ip>:8080
The service is token-protected. The bootstrap script/operator generates a token, stores it locally and in-cluster, and prints it to the admin.
Local path:
~/.alga-psa-appliance/<site-id>/status-token
In-cluster Secret:
namespace: appliance-system
secret: appliance-status-auth
key: token
The status service uses read-only Kubernetes RBAC and reads:
- Nodes.
- Pods.
- Jobs.
- PVCs.
- Events.
- Flux
GitRepository. - Flux
Kustomization. - Flux
HelmRelease. - selected ConfigMaps/Secrets metadata, not secret values.
Hybrid Collector Model
First version may be a small web service that reads Kubernetes directly. Its internal status schema must remain clean and stable so a future controller can publish the same model via CRD or ConfigMap.
UI Layers
Overview
Simple customer/admin status:
Status: Ready to log in
Login URL: http://<node-ip>:3000
Background: 2 services need attention
Top issue: workflow-worker image tag not found
Advanced Diagnostics
Support/operator diagnostics:
- readiness tiers.
- component table.
- top blockers.
- recent Kubernetes events.
- HelmRelease conditions.
- image pull state.
- bootstrap job summary.
- support bundle action in a later milestone.
Status API
Expose one canonical status document:
GET /api/status
Representative shape:
{
"siteId": "appliance-single-node",
"timestamp": "2026-04-30T02:08:10Z",
"release": {
"selectedReleaseVersion": "1.0-rc5",
"appVersion": "1.0-rc5",
"channel": "candidate",
"gitRevision": "release/1.0-rc5@sha1:979e2079..."
},
"urls": {
"statusUrl": "http://192.168.64.8:8080",
"loginUrl": "http://192.168.64.8:3000"
},
"rollup": {
"state": "ready_with_background_issues",
"message": "Alga is ready to log in. Some background services need attention.",
"nextAction": "Log in to Alga, then review background service issues."
},
"tiers": {
"platform": { "ready": true, "status": "healthy" },
"core": { "ready": true, "status": "healthy" },
"bootstrap": { "ready": true, "status": "healthy" },
"login": { "ready": true, "status": "healthy" },
"background": { "ready": false, "status": "degraded" },
"fullHealth": { "ready": false, "status": "degraded" }
},
"topBlockers": [
{
"severity": "background",
"component": "workflow-worker",
"layer": "image",
"reason": "Image tag not found: ghcr.io/nine-minds/workflow-worker:61e4a00e",
"nextAction": "Publish the missing image tag or update the appliance release manifest.",
"loginBlocking": false
}
]
}
Blocker Detection Requirements
The collector must translate low-level conditions into actionable messages.
Examples from the observed bootstrap:
| Low-level signal | User-facing blocker |
|---|---|
lookup factory.talos.dev on 192.168.64.1:53: connection refused |
DNS resolver failure; configure explicit DNS servers and retry. |
failed to create subPath directory for volumeMount "db-data" |
Postgres PVC initialization failed; repair/recreate PVC subPath. |
ImagePullBackOff and not found |
image tag not found; publish missing tag or update release manifest. |
context canceled during image pull |
image pull interrupted; wait for retry or restart pod. |
Helm install context deadline exceeded plus pod-level DB failure |
report DB/PVC blocker, not generic Helm timeout. |
Temporal sql schema version compatibility check failed |
Temporal schema not initialized; verify autosetup or schema job. |
Temporal UI cannot unmarshal tcp://... into int |
Kubernetes service-link environment collision; disable service links. |
Bootstrap and Flux Flow
Phase 0: Host/Talos bootstrap
- Talos maintenance API reachable.
- disk detected.
- machine config applied.
- installer image pulled.
- Kubernetes bootstrapped.
- kubeconfig retrieved.
- node Ready.
Phase 1: Platform prerequisites
- CoreDNS resolver config.
- local-path storage.
- storage smoke test.
- Flux controllers.
Phase 2: Status service
- Generate token.
- Create
appliance-systemnamespace and status auth Secret. - Install
appliance-status. - Print status URL and token.
Phase 3: Core app
- Install login-critical services.
- Wait for
CORE_READY,BOOTSTRAP_READY, andLOGIN_READY. - Print login URL as soon as ready.
Phase 4: Background services
- Install email-service, Temporal, workflow-worker, temporal-worker, and future background services.
- Failures set
Ready with background issuesrather than blocking login.
Phase 5: Full health/support
- Report
FULLY_HEALTHYwhen all selected services are healthy. - Provide support bundle and remediation controls in later iterations.
Flux/Helm Organization
Restructure appliance Flux resources into tiered groups:
ee/appliance/flux/base/
platform/
appliance-status.yaml
core/
alga-core.yaml
pgbouncer.yaml
background/
temporal.yaml
email-service.yaml
workflow-worker.yaml
temporal-worker.yaml
Prefer separate Flux Kustomizations:
alga-platform
alga-core
alga-background
Dependencies:
alga-core depends on alga-platform
alga-background depends on alga-core
alga-background failure must not change LOGIN_READY=false.
Implementation Milestones
- Status schema and CLI visibility.
- Early
appliance-statuschart/service. - Core/background tier split.
- Durable chart/operator fixes from the observed run.
- Support bundle and guided remediation actions.
Durable Fixes from Observed Run
- Explicit DNS support and visibility in Talos bootstrap.
- Fix
reset-appliance-data.shtarget: unbound variablefailure. - Prevent operator wrapper from overwriting valid Talos credentials when explicit kubeconfig/talosconfig are supplied.
- Fix or avoid Postgres PVC subPath initialization failure.
- Ensure Temporal chart runs autosetup correctly.
- Disable service links for Temporal UI and any other pod vulnerable to service-env collisions.
- Validate release manifest image tags before applying background releases.
- Classify large image pulls separately from stalled installs.
Acceptance Criteria
- Fresh appliance bootstrap prints a status URL and token within the first few minutes after Kubernetes is ready.
- Status UI is reachable at
http://<node-ip>:8080and requires the generated token. - Status UI reports tiered readiness and top blockers.
- Alga core reaching
LOGIN_READYis reported independently of background services. - A missing workflow-worker image tag produces
Ready with background issueswhen login is otherwise available. - Generic Helm timeout is not the top blocker when a lower-level pod/event cause is known.
- Release manifest validation catches missing background image tags before or during background install.
- A local UTM/Talos smoke run can reproduce: status UI early, login-ready core, and background-degraded reporting.