PSA/ee/docs/premise/talos-operations-and-troubleshooting.md

# Talos Operations And Troubleshooting

## Purpose

Talos appliance failures are easier to recover when they are classified by layer. Most wasted time comes from debugging the wrong layer first.

Support should start by collecting a support bundle whenever possible. The layered checks below define what that bundle needs to capture and how to interpret it.

Use this order:

1. host and Talos reachability
2. Kubernetes node health and storage
3. Flux source and reconcile state
4. application bootstrap and runtime health

## Layer 1: Host And Talos

Check this layer first when:

- the node disappears from the network
- `kubectl` stops answering
- `talosctl` cannot reach the API

Common issues:

- the VM booted from installer media instead of disk
- network changes were never persisted into machine config
- the machine came back with the wrong NIC or resolver configuration

Typical interpretation:

- if ICMP and Talos API are both gone, start at the console
- if the console says Talos is installed but booted from another media, remove the ISO and boot from disk
- if networking only works after manual intervention, write the fix into machine configuration

## Layer 2: Kubernetes Node And Storage

Check this layer after Talos is healthy.

Focus on:

- node conditions
- taints
- schedulability
- storage classes
- PVC binding

Common issues:

- single-node control-plane taint prevents workload scheduling
- no persistent volume provisioner exists yet
- PVCs are pending, so Postgres and Redis never become healthy

Steady-state rule:

- single-node appliance clusters should persist `allowSchedulingOnControlPlanes: true`

That is better than recovering by removing the control-plane taint manually after every reboot.

## Layer 3: Flux Source And GitOps

Check this layer when:

- Flux controllers are running but releases do not progress
- the cluster still seems to be reconciling an old OCI config-bundle artifact
- `HelmRelease` objects exist but do not pick up new changes

Common issues:

- `source-controller` cannot fetch the OCIRepository because cluster egress is broken
- the source artifact is stale even though the channel changed
- the wrong config-bundle digest or path is configured for the appliance profile

Operational rule:

- verify `OCIRepository` readiness and artifact revision before assuming a chart fix is in-cluster

Do not keep debugging a Helm failure if Flux is still serving an older artifact.

## Layer 4: Application Bootstrap

Check this layer when:

- Postgres and Redis are healthy
- Flux is synced
- `alga-core` is still not usable

Focus on:

- bootstrap job existence and logs
- `db-credentials` availability
- whether the database was actually initialized
- whether the app pod is waiting on bootstrap or failing after it

Common issues:

- bootstrap job lifecycle is wrong for the install path
- database credentials rotated against an existing Postgres volume
- the runtime image does not contain the setup path the job expects
- the server started before bootstrap completed

## Resource Pressure In Local Hypervisors

Local Talos appliance testing under emulation or desktop virtualization can look like a network or image issue when it is mostly a CPU issue.

Practical signs:

- image pulls are extremely slow but eventually succeed
- `ContainerCreating` lasts a long time without OOM evidence
- node CPU is pegged while memory pressure remains false

Implications:

- do not assume `ErrImagePull` or long pulls are only registry problems in local lab runs
- check node CPU count and allocatable resources before over-diagnosing memory
- if the environment is a single emulated VM, increasing visible vCPUs may materially improve bring-up reliability

## Distinguishing Network From Runtime Failures

Not every slow start is a network problem.

Useful distinctions:

- if a small public image can be pulled, basic egress likely works
- if node conditions show no memory pressure, OOM is less likely to be the root cause
- if container runtime services are timing out or in unknown state, runtime instability may be the real blocker

The layered approach matters:

- fix host reachability first
- then cluster egress
- then runtime scheduling and pull behavior
- then app bootstrap

## Fresh-Install Validation Checklist

A generic fresh-install validation should confirm:

1. Talos API is reachable and the node is healthy.
2. The Kubernetes node is `Ready`.
3. Single-node scheduling is enabled.
4. A working storage class exists for PVC-backed workloads.
5. Flux source is synced to the intended repo revision.
6. `alga-core` bootstrap job runs.
7. Postgres databases and roles are created.
8. Migrations complete.
9. Seeds run once.
10. The server and dependent services become ready.

## Recovery Guidance

Prefer durable fixes over repeated hand-applied recovery steps.

Examples:

- persist NIC and DNS changes in machine config instead of retyping them after reboot
- persist control-plane scheduling in Talos config instead of repeatedly removing taints
- fix Helm bootstrap ordering in the chart instead of deleting failed jobs forever
- pin explicit image tags in the bootstrap path instead of relying on moving tags

That discipline is what turns a fragile lab sequence into an appliance model.