PSA/ee/docs/premise/talos-host-configuration.md
Hermes 284313f908
Some checks are pending
Bidi Control Character Guard / bidi-control-guard (push) Waiting to run
Circular Dependency Check / Check for new circular dependencies (push) Waiting to run
Citus Migration Smoke / Combined migrations on single-node Citus (push) Waiting to run
E2E Fresh Install Tests / fresh-install-e2e (push) Waiting to run
ext-v2 guardrails / Run ext-v2 guard and ESLint (push) Waiting to run
Integration Tests / Check for relevant changes (push) Waiting to run
Integration Tests / ${{ (github.event_name == 'schedule' || github.event.inputs.suite == 'full') && 'Full integration suite' || 'Tier-1 integration subset' }} (push) Blocked by required conditions
Mobile checks / Mobile lint + typecheck (push) Waiting to run
Mobile checks / Mobile unit tests (push) Waiting to run
Mobile checks / Mobile dependency audit (report) (push) Waiting to run
Mobile checks / Mobile reproducibility checks (push) Waiting to run
Secrets guard (env backups) / Ensure no tracked env backup files (push) Waiting to run
Temporal Readiness / fast-readiness (push) Waiting to run
Temporal Readiness / docker-parity (push) Waiting to run
TypeScript Type Check / Nx affected typecheck (push) Waiting to run
Unit Tests / Skipped-test budget (push) Waiting to run
Unit Tests / Nx affected unit tests (push) Waiting to run
Unit Tests / Server unit coverage (informational) (push) Waiting to run
Validate Tenant Management Schema / Check for relevant changes (push) Waiting to run
Validate Tenant Management Schema / Validate Tenant Management Schema (push) Blocked by required conditions
EE Workflows Build Guard / ee-workflows-build-guard (push) Waiting to run
Initial import of AlgaPSA codebase from PSA server
Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz

Source: /opt/alga-psa on psa.joliet.tech
2026-06-22 16:12:17 -05:00

88 lines
3.5 KiB
Markdown

# Talos Host Configuration
## Purpose
Talos should be treated as an immutable OS with one durable configuration boundary: machine configuration. If a change must survive reboot, it belongs there.
## Persistence Boundary
Temporary fixes are not enough for appliance behavior. The following must be expressed in machine configuration rather than one-off boot changes:
- network interface selection
- DHCP or static addressing
- DNS resolver selection
- host naming behavior
- single-node control-plane scheduling policy
- installer image selection
If a setting only exists in maintenance mode, boot media, or an ad hoc runtime patch and is not written into machine configuration, assume it can be lost on reboot or reinstall.
## Single-Node Appliance Scheduling
For a single-node appliance, workloads must be allowed to run on the control-plane node. The durable setting is:
```yaml
cluster:
allowSchedulingOnControlPlanes: true
```
This is preferable to repeatedly removing the `node-role.kubernetes.io/control-plane:NoSchedule` taint by hand. Manual untainting is a recovery step, not the desired steady state.
## Network Configuration
Persistent networking should be expressed with Talos network config documents rather than relying on ephemeral interface choices.
Typical durable pieces are:
- `DHCPv4Config` or static address config for the intended NIC
- `ResolverConfig` for non-DHCP resolvers when appliance DNS must be fixed
- `HostnameConfig` when a stable Talos hostname policy is desired
Prefer selectors or deterministic device identification over brittle assumptions when possible. If the appliance depends on a specific interface, make that explicit in machine configuration.
## Installer Configuration
This Talos-era guidance is historical. Supported Ubuntu/k3s appliances no longer use Talos installer image selection, and release metadata is resolved from OCI artifacts rather than local repository files.
Do not hand-edit installer image references without also re-establishing which published appliance release the node now represents.
## Boot Media Rule
After Talos is installed to disk, subsequent boots must come from the installed disk, not the installer ISO.
If the machine is started from installer media again, Talos may halt with the equivalent of:
- Talos is already installed to disk
- the machine booted from another media
- reboot from disk
That is expected behavior. For appliance operations, the steady-state rule is:
1. boot from ISO for installation
2. install Talos to disk
3. remove or detach the ISO
4. boot from disk from then on
## Storage Assumption
The current single-node appliance profile assumes node-local persistent storage. In practice, that means the Kubernetes cluster must have a working `StorageClass` suitable for local PVCs before the application stack is expected to settle.
This is a cluster-level dependency, but on a Talos appliance it is effectively part of host bring-up because the application stack relies on it for:
- Postgres
- Redis
- local file storage
- optionally Temporal persistence
## Operational Guidance
When recovering a Talos appliance node, follow this order:
1. confirm the node is booting from disk rather than installer media
2. confirm the intended machine configuration is still applied
3. confirm network and resolver configuration are present in machine config
4. confirm single-node scheduling is enabled in config
5. only then move up to Kubernetes and Flux diagnosis
That order avoids spending time on higher-level symptoms caused by a lost host configuration.