PSA/ee/docs/premise/talos-operations-and-troubleshooting.md
Hermes 284313f908
Some checks are pending
Bidi Control Character Guard / bidi-control-guard (push) Waiting to run
Circular Dependency Check / Check for new circular dependencies (push) Waiting to run
Citus Migration Smoke / Combined migrations on single-node Citus (push) Waiting to run
E2E Fresh Install Tests / fresh-install-e2e (push) Waiting to run
ext-v2 guardrails / Run ext-v2 guard and ESLint (push) Waiting to run
Integration Tests / Check for relevant changes (push) Waiting to run
Integration Tests / ${{ (github.event_name == 'schedule' || github.event.inputs.suite == 'full') && 'Full integration suite' || 'Tier-1 integration subset' }} (push) Blocked by required conditions
Mobile checks / Mobile lint + typecheck (push) Waiting to run
Mobile checks / Mobile unit tests (push) Waiting to run
Mobile checks / Mobile dependency audit (report) (push) Waiting to run
Mobile checks / Mobile reproducibility checks (push) Waiting to run
Secrets guard (env backups) / Ensure no tracked env backup files (push) Waiting to run
Temporal Readiness / fast-readiness (push) Waiting to run
Temporal Readiness / docker-parity (push) Waiting to run
TypeScript Type Check / Nx affected typecheck (push) Waiting to run
Unit Tests / Skipped-test budget (push) Waiting to run
Unit Tests / Nx affected unit tests (push) Waiting to run
Unit Tests / Server unit coverage (informational) (push) Waiting to run
Validate Tenant Management Schema / Check for relevant changes (push) Waiting to run
Validate Tenant Management Schema / Validate Tenant Management Schema (push) Blocked by required conditions
EE Workflows Build Guard / ee-workflows-build-guard (push) Waiting to run
Initial import of AlgaPSA codebase from PSA server
Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz

Source: /opt/alga-psa on psa.joliet.tech
2026-06-22 16:12:17 -05:00

162 lines
5.1 KiB
Markdown

# Talos Operations And Troubleshooting
## Purpose
Talos appliance failures are easier to recover when they are classified by layer. Most wasted time comes from debugging the wrong layer first.
Support should start by collecting a support bundle whenever possible. The layered checks below define what that bundle needs to capture and how to interpret it.
Use this order:
1. host and Talos reachability
2. Kubernetes node health and storage
3. Flux source and reconcile state
4. application bootstrap and runtime health
## Layer 1: Host And Talos
Check this layer first when:
- the node disappears from the network
- `kubectl` stops answering
- `talosctl` cannot reach the API
Common issues:
- the VM booted from installer media instead of disk
- network changes were never persisted into machine config
- the machine came back with the wrong NIC or resolver configuration
Typical interpretation:
- if ICMP and Talos API are both gone, start at the console
- if the console says Talos is installed but booted from another media, remove the ISO and boot from disk
- if networking only works after manual intervention, write the fix into machine configuration
## Layer 2: Kubernetes Node And Storage
Check this layer after Talos is healthy.
Focus on:
- node conditions
- taints
- schedulability
- storage classes
- PVC binding
Common issues:
- single-node control-plane taint prevents workload scheduling
- no persistent volume provisioner exists yet
- PVCs are pending, so Postgres and Redis never become healthy
Steady-state rule:
- single-node appliance clusters should persist `allowSchedulingOnControlPlanes: true`
That is better than recovering by removing the control-plane taint manually after every reboot.
## Layer 3: Flux Source And GitOps
Check this layer when:
- Flux controllers are running but releases do not progress
- the cluster still seems to be reconciling an old OCI config-bundle artifact
- `HelmRelease` objects exist but do not pick up new changes
Common issues:
- `source-controller` cannot fetch the OCIRepository because cluster egress is broken
- the source artifact is stale even though the channel changed
- the wrong config-bundle digest or path is configured for the appliance profile
Operational rule:
- verify `OCIRepository` readiness and artifact revision before assuming a chart fix is in-cluster
Do not keep debugging a Helm failure if Flux is still serving an older artifact.
## Layer 4: Application Bootstrap
Check this layer when:
- Postgres and Redis are healthy
- Flux is synced
- `alga-core` is still not usable
Focus on:
- bootstrap job existence and logs
- `db-credentials` availability
- whether the database was actually initialized
- whether the app pod is waiting on bootstrap or failing after it
Common issues:
- bootstrap job lifecycle is wrong for the install path
- database credentials rotated against an existing Postgres volume
- the runtime image does not contain the setup path the job expects
- the server started before bootstrap completed
## Resource Pressure In Local Hypervisors
Local Talos appliance testing under emulation or desktop virtualization can look like a network or image issue when it is mostly a CPU issue.
Practical signs:
- image pulls are extremely slow but eventually succeed
- `ContainerCreating` lasts a long time without OOM evidence
- node CPU is pegged while memory pressure remains false
Implications:
- do not assume `ErrImagePull` or long pulls are only registry problems in local lab runs
- check node CPU count and allocatable resources before over-diagnosing memory
- if the environment is a single emulated VM, increasing visible vCPUs may materially improve bring-up reliability
## Distinguishing Network From Runtime Failures
Not every slow start is a network problem.
Useful distinctions:
- if a small public image can be pulled, basic egress likely works
- if node conditions show no memory pressure, OOM is less likely to be the root cause
- if container runtime services are timing out or in unknown state, runtime instability may be the real blocker
The layered approach matters:
- fix host reachability first
- then cluster egress
- then runtime scheduling and pull behavior
- then app bootstrap
## Fresh-Install Validation Checklist
A generic fresh-install validation should confirm:
1. Talos API is reachable and the node is healthy.
2. The Kubernetes node is `Ready`.
3. Single-node scheduling is enabled.
4. A working storage class exists for PVC-backed workloads.
5. Flux source is synced to the intended repo revision.
6. `alga-core` bootstrap job runs.
7. Postgres databases and roles are created.
8. Migrations complete.
9. Seeds run once.
10. The server and dependent services become ready.
## Recovery Guidance
Prefer durable fixes over repeated hand-applied recovery steps.
Examples:
- persist NIC and DNS changes in machine config instead of retyping them after reboot
- persist control-plane scheduling in Talos config instead of repeatedly removing taints
- fix Helm bootstrap ordering in the chart instead of deleting failed jobs forever
- pin explicit image tags in the bootstrap path instead of relying on moving tags
That discipline is what turns a fragile lab sequence into an appliance model.