alldigital/PSA

Fork 0

Hermes 284313f908

Bidi Control Character Guard / bidi-control-guard (push) Waiting to run

Details

Circular Dependency Check / Check for new circular dependencies (push) Waiting to run

Details

Citus Migration Smoke / Combined migrations on single-node Citus (push) Waiting to run

Details

E2E Fresh Install Tests / fresh-install-e2e (push) Waiting to run

Details

ext-v2 guardrails / Run ext-v2 guard and ESLint (push) Waiting to run

Details

Integration Tests / Check for relevant changes (push) Waiting to run

Details

Integration Tests / ${{ (github.event_name == 'schedule' || github.event.inputs.suite == 'full') && 'Full integration suite' || 'Tier-1 integration subset' }} (push) Blocked by required conditions

Details

Mobile checks / Mobile lint + typecheck (push) Waiting to run

Details

Mobile checks / Mobile unit tests (push) Waiting to run

Details

Mobile checks / Mobile dependency audit (report) (push) Waiting to run

Details

Mobile checks / Mobile reproducibility checks (push) Waiting to run

Details

Secrets guard (env backups) / Ensure no tracked env backup files (push) Waiting to run

Details

Temporal Readiness / fast-readiness (push) Waiting to run

Details

Temporal Readiness / docker-parity (push) Waiting to run

Details

TypeScript Type Check / Nx affected typecheck (push) Waiting to run

Details

Unit Tests / Skipped-test budget (push) Waiting to run

Details

Unit Tests / Nx affected unit tests (push) Waiting to run

Details

Unit Tests / Server unit coverage (informational) (push) Waiting to run

Details

Validate Tenant Management Schema / Check for relevant changes (push) Waiting to run

Details

Validate Tenant Management Schema / Validate Tenant Management Schema (push) Blocked by required conditions

Details

EE Workflows Build Guard / ee-workflows-build-guard (push) Waiting to run

Details

Initial import of AlgaPSA codebase from PSA server

Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz

Source: /opt/alga-psa on psa.joliet.tech

2026-06-22 16:12:17 -05:00

5.1 KiB

Raw Permalink Blame History

Talos Operations And Troubleshooting

Purpose

Talos appliance failures are easier to recover when they are classified by layer. Most wasted time comes from debugging the wrong layer first.

Support should start by collecting a support bundle whenever possible. The layered checks below define what that bundle needs to capture and how to interpret it.

Use this order:

host and Talos reachability
Kubernetes node health and storage
Flux source and reconcile state
application bootstrap and runtime health

Layer 1: Host And Talos

Check this layer first when:

the node disappears from the network
kubectl stops answering
talosctl cannot reach the API

Common issues:

the VM booted from installer media instead of disk
network changes were never persisted into machine config
the machine came back with the wrong NIC or resolver configuration

Typical interpretation:

if ICMP and Talos API are both gone, start at the console
if the console says Talos is installed but booted from another media, remove the ISO and boot from disk
if networking only works after manual intervention, write the fix into machine configuration

Layer 2: Kubernetes Node And Storage

Check this layer after Talos is healthy.

Focus on:

node conditions
taints
schedulability
storage classes
PVC binding

Common issues:

single-node control-plane taint prevents workload scheduling
no persistent volume provisioner exists yet
PVCs are pending, so Postgres and Redis never become healthy

Steady-state rule:

single-node appliance clusters should persist allowSchedulingOnControlPlanes: true

That is better than recovering by removing the control-plane taint manually after every reboot.

Layer 3: Flux Source And GitOps

Check this layer when:

Flux controllers are running but releases do not progress
the cluster still seems to be reconciling an old OCI config-bundle artifact
HelmRelease objects exist but do not pick up new changes

Common issues:

source-controller cannot fetch the OCIRepository because cluster egress is broken
the source artifact is stale even though the channel changed
the wrong config-bundle digest or path is configured for the appliance profile

Operational rule:

verify OCIRepository readiness and artifact revision before assuming a chart fix is in-cluster

Do not keep debugging a Helm failure if Flux is still serving an older artifact.

Layer 4: Application Bootstrap

Check this layer when:

Postgres and Redis are healthy
Flux is synced
alga-core is still not usable

Focus on:

bootstrap job existence and logs
db-credentials availability
whether the database was actually initialized
whether the app pod is waiting on bootstrap or failing after it

Common issues:

bootstrap job lifecycle is wrong for the install path
database credentials rotated against an existing Postgres volume
the runtime image does not contain the setup path the job expects
the server started before bootstrap completed

Resource Pressure In Local Hypervisors

Local Talos appliance testing under emulation or desktop virtualization can look like a network or image issue when it is mostly a CPU issue.

Practical signs:

image pulls are extremely slow but eventually succeed
ContainerCreating lasts a long time without OOM evidence
node CPU is pegged while memory pressure remains false

Implications:

do not assume ErrImagePull or long pulls are only registry problems in local lab runs
check node CPU count and allocatable resources before over-diagnosing memory
if the environment is a single emulated VM, increasing visible vCPUs may materially improve bring-up reliability

Distinguishing Network From Runtime Failures

Not every slow start is a network problem.

Useful distinctions:

if a small public image can be pulled, basic egress likely works
if node conditions show no memory pressure, OOM is less likely to be the root cause
if container runtime services are timing out or in unknown state, runtime instability may be the real blocker

The layered approach matters:

fix host reachability first
then cluster egress
then runtime scheduling and pull behavior
then app bootstrap

Fresh-Install Validation Checklist

A generic fresh-install validation should confirm:

Talos API is reachable and the node is healthy.
The Kubernetes node is Ready.
Single-node scheduling is enabled.
A working storage class exists for PVC-backed workloads.
Flux source is synced to the intended repo revision.
alga-core bootstrap job runs.
Postgres databases and roles are created.
Migrations complete.
Seeds run once.
The server and dependent services become ready.

Recovery Guidance

Prefer durable fixes over repeated hand-applied recovery steps.

Examples:

persist NIC and DNS changes in machine config instead of retyping them after reboot
persist control-plane scheduling in Talos config instead of repeatedly removing taints
fix Helm bootstrap ordering in the chart instead of deleting failed jobs forever
pin explicit image tags in the bootstrap path instead of relying on moving tags

That discipline is what turns a fragile lab sequence into an appliance model.

5.1 KiB Raw Permalink Blame History

Talos Operations And Troubleshooting

Purpose

Layer 1: Host And Talos

Layer 2: Kubernetes Node And Storage

Layer 3: Flux Source And GitOps

Layer 4: Application Bootstrap

Resource Pressure In Local Hypervisors

Distinguishing Network From Runtime Failures

Fresh-Install Validation Checklist

Recovery Guidance

5.1 KiB

Raw Permalink Blame History