PSA/ee/docs/premise/talos-operations-and-troubleshooting.md
Hermes 284313f908
Some checks are pending
Bidi Control Character Guard / bidi-control-guard (push) Waiting to run
Circular Dependency Check / Check for new circular dependencies (push) Waiting to run
Citus Migration Smoke / Combined migrations on single-node Citus (push) Waiting to run
E2E Fresh Install Tests / fresh-install-e2e (push) Waiting to run
ext-v2 guardrails / Run ext-v2 guard and ESLint (push) Waiting to run
Integration Tests / Check for relevant changes (push) Waiting to run
Integration Tests / ${{ (github.event_name == 'schedule' || github.event.inputs.suite == 'full') && 'Full integration suite' || 'Tier-1 integration subset' }} (push) Blocked by required conditions
Mobile checks / Mobile lint + typecheck (push) Waiting to run
Mobile checks / Mobile unit tests (push) Waiting to run
Mobile checks / Mobile dependency audit (report) (push) Waiting to run
Mobile checks / Mobile reproducibility checks (push) Waiting to run
Secrets guard (env backups) / Ensure no tracked env backup files (push) Waiting to run
Temporal Readiness / fast-readiness (push) Waiting to run
Temporal Readiness / docker-parity (push) Waiting to run
TypeScript Type Check / Nx affected typecheck (push) Waiting to run
Unit Tests / Skipped-test budget (push) Waiting to run
Unit Tests / Nx affected unit tests (push) Waiting to run
Unit Tests / Server unit coverage (informational) (push) Waiting to run
Validate Tenant Management Schema / Check for relevant changes (push) Waiting to run
Validate Tenant Management Schema / Validate Tenant Management Schema (push) Blocked by required conditions
EE Workflows Build Guard / ee-workflows-build-guard (push) Waiting to run
Initial import of AlgaPSA codebase from PSA server
Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz

Source: /opt/alga-psa on psa.joliet.tech
2026-06-22 16:12:17 -05:00

5.1 KiB

Talos Operations And Troubleshooting

Purpose

Talos appliance failures are easier to recover when they are classified by layer. Most wasted time comes from debugging the wrong layer first.

Support should start by collecting a support bundle whenever possible. The layered checks below define what that bundle needs to capture and how to interpret it.

Use this order:

  1. host and Talos reachability
  2. Kubernetes node health and storage
  3. Flux source and reconcile state
  4. application bootstrap and runtime health

Layer 1: Host And Talos

Check this layer first when:

  • the node disappears from the network
  • kubectl stops answering
  • talosctl cannot reach the API

Common issues:

  • the VM booted from installer media instead of disk
  • network changes were never persisted into machine config
  • the machine came back with the wrong NIC or resolver configuration

Typical interpretation:

  • if ICMP and Talos API are both gone, start at the console
  • if the console says Talos is installed but booted from another media, remove the ISO and boot from disk
  • if networking only works after manual intervention, write the fix into machine configuration

Layer 2: Kubernetes Node And Storage

Check this layer after Talos is healthy.

Focus on:

  • node conditions
  • taints
  • schedulability
  • storage classes
  • PVC binding

Common issues:

  • single-node control-plane taint prevents workload scheduling
  • no persistent volume provisioner exists yet
  • PVCs are pending, so Postgres and Redis never become healthy

Steady-state rule:

  • single-node appliance clusters should persist allowSchedulingOnControlPlanes: true

That is better than recovering by removing the control-plane taint manually after every reboot.

Layer 3: Flux Source And GitOps

Check this layer when:

  • Flux controllers are running but releases do not progress
  • the cluster still seems to be reconciling an old OCI config-bundle artifact
  • HelmRelease objects exist but do not pick up new changes

Common issues:

  • source-controller cannot fetch the OCIRepository because cluster egress is broken
  • the source artifact is stale even though the channel changed
  • the wrong config-bundle digest or path is configured for the appliance profile

Operational rule:

  • verify OCIRepository readiness and artifact revision before assuming a chart fix is in-cluster

Do not keep debugging a Helm failure if Flux is still serving an older artifact.

Layer 4: Application Bootstrap

Check this layer when:

  • Postgres and Redis are healthy
  • Flux is synced
  • alga-core is still not usable

Focus on:

  • bootstrap job existence and logs
  • db-credentials availability
  • whether the database was actually initialized
  • whether the app pod is waiting on bootstrap or failing after it

Common issues:

  • bootstrap job lifecycle is wrong for the install path
  • database credentials rotated against an existing Postgres volume
  • the runtime image does not contain the setup path the job expects
  • the server started before bootstrap completed

Resource Pressure In Local Hypervisors

Local Talos appliance testing under emulation or desktop virtualization can look like a network or image issue when it is mostly a CPU issue.

Practical signs:

  • image pulls are extremely slow but eventually succeed
  • ContainerCreating lasts a long time without OOM evidence
  • node CPU is pegged while memory pressure remains false

Implications:

  • do not assume ErrImagePull or long pulls are only registry problems in local lab runs
  • check node CPU count and allocatable resources before over-diagnosing memory
  • if the environment is a single emulated VM, increasing visible vCPUs may materially improve bring-up reliability

Distinguishing Network From Runtime Failures

Not every slow start is a network problem.

Useful distinctions:

  • if a small public image can be pulled, basic egress likely works
  • if node conditions show no memory pressure, OOM is less likely to be the root cause
  • if container runtime services are timing out or in unknown state, runtime instability may be the real blocker

The layered approach matters:

  • fix host reachability first
  • then cluster egress
  • then runtime scheduling and pull behavior
  • then app bootstrap

Fresh-Install Validation Checklist

A generic fresh-install validation should confirm:

  1. Talos API is reachable and the node is healthy.
  2. The Kubernetes node is Ready.
  3. Single-node scheduling is enabled.
  4. A working storage class exists for PVC-backed workloads.
  5. Flux source is synced to the intended repo revision.
  6. alga-core bootstrap job runs.
  7. Postgres databases and roles are created.
  8. Migrations complete.
  9. Seeds run once.
  10. The server and dependent services become ready.

Recovery Guidance

Prefer durable fixes over repeated hand-applied recovery steps.

Examples:

  • persist NIC and DNS changes in machine config instead of retyping them after reboot
  • persist control-plane scheduling in Talos config instead of repeatedly removing taints
  • fix Helm bootstrap ordering in the chart instead of deleting failed jobs forever
  • pin explicit image tags in the bootstrap path instead of relying on moving tags

That discipline is what turns a fragile lab sequence into an appliance model.