PSA/docs/architecture/citus-migration-best-practices.md
Hermes 284313f908
Some checks are pending
Bidi Control Character Guard / bidi-control-guard (push) Waiting to run
Circular Dependency Check / Check for new circular dependencies (push) Waiting to run
Citus Migration Smoke / Combined migrations on single-node Citus (push) Waiting to run
E2E Fresh Install Tests / fresh-install-e2e (push) Waiting to run
ext-v2 guardrails / Run ext-v2 guard and ESLint (push) Waiting to run
Integration Tests / Check for relevant changes (push) Waiting to run
Integration Tests / ${{ (github.event_name == 'schedule' || github.event.inputs.suite == 'full') && 'Full integration suite' || 'Tier-1 integration subset' }} (push) Blocked by required conditions
Mobile checks / Mobile lint + typecheck (push) Waiting to run
Mobile checks / Mobile unit tests (push) Waiting to run
Mobile checks / Mobile dependency audit (report) (push) Waiting to run
Mobile checks / Mobile reproducibility checks (push) Waiting to run
Secrets guard (env backups) / Ensure no tracked env backup files (push) Waiting to run
Temporal Readiness / fast-readiness (push) Waiting to run
Temporal Readiness / docker-parity (push) Waiting to run
TypeScript Type Check / Nx affected typecheck (push) Waiting to run
Unit Tests / Skipped-test budget (push) Waiting to run
Unit Tests / Nx affected unit tests (push) Waiting to run
Unit Tests / Server unit coverage (informational) (push) Waiting to run
Validate Tenant Management Schema / Check for relevant changes (push) Waiting to run
Validate Tenant Management Schema / Validate Tenant Management Schema (push) Blocked by required conditions
EE Workflows Build Guard / ee-workflows-build-guard (push) Waiting to run
Initial import of AlgaPSA codebase from PSA server
Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz

Source: /opt/alga-psa on psa.joliet.tech
2026-06-22 16:12:17 -05:00

9.4 KiB

CitusDB Migration Best Practices

This document outlines best practices for writing database migrations that work correctly with both standard PostgreSQL and CitusDB (distributed PostgreSQL).

Key Differences with CitusDB

CitusDB distributes tables across multiple worker nodes (shards). This creates unique challenges for migrations:

  1. Distributed Data: Table data is split across multiple physical tables (shards)
  2. Coordinator vs Workers: The coordinator node has metadata, workers have actual data
  3. Query Routing: Some queries run on coordinator, some on workers, some on both
  4. Schema Changes: DDL operations may need to be applied to both coordinator and shards

Common Issues

Issue 1: ALTER TABLE NOT NULL Fails Despite No NULL Values

Symptom: Migration fails with "column contains null values" even though SELECT queries show 0 NULL values.

Cause:

  • SELECT queries may only check coordinator or cached data
  • ALTER TABLE validates across ALL shards, which may find NULL values queries missed
  • Metadata sync issues between coordinator and workers

Solution: Use the Citus-aware approach shown in migration 20251113120000_add_project_number.cjs:

// Check if this is a Citus distributed table
const isCitus = await knex.raw(`
  SELECT EXISTS (
    SELECT 1 FROM pg_dist_partition WHERE logicalrelid = 'table_name'::regclass
  ) as is_distributed
`);

if (isCitus.rows[0]?.is_distributed) {
  // Set NOT NULL on all shards first
  await knex.raw(`
    SELECT * FROM run_command_on_shards(
      'table_name',
      $$ALTER TABLE %s ALTER COLUMN column_name SET NOT NULL$$
    )
  `);

  // Update coordinator metadata
  await knex.raw(`
    UPDATE pg_attribute
    SET attnotnull = true
    WHERE attrelid = 'table_name'::regclass
    AND attname = 'column_name'
    AND attnotnull = false
  `);
} else {
  // Standard PostgreSQL
  await knex.raw(`ALTER TABLE table_name ALTER COLUMN column_name SET NOT NULL`);
}

Issue 2: Data Backfill Not Visible Immediately

Symptom: UPDATE statements complete successfully, but subsequent SELECT shows unchanged data.

Cause: Eventual consistency in distributed systems - changes take time to propagate across shards.

Solution:

  • Add delays after bulk updates (3-5 seconds)
  • Use raw SQL queries to force fresh reads
  • Query actual rows, not just COUNT(*), to force distributed query execution
// After backfill
await new Promise(resolve => setTimeout(resolve, 5000));

// Force distributed check by querying actual rows
const nullProjects = await knex.raw(`
  SELECT project_id, tenant, project_name
  FROM projects
  WHERE column_name IS NULL
  LIMIT 100
`);

Issue 3: Transaction Isolation Problems

Symptom: Changes made in same migration aren't visible to later queries in the migration.

Cause: Citus doesn't support all operations inside transactions.

Solution: Disable transactions for migrations with distributed operations:

// At end of migration file
exports.config = { transaction: false };

Issue 4: ALTER ... SET NOT NULL Fails After Distributing a Non-Empty Table

Symptom: ALTER TABLE x ALTER COLUMN c SET NOT NULL fails with column "c" of relation "x" contains null values, even though every SELECT ... WHERE c IS NULL (including per-shard checks and backfills) returns zero rows and the distributed data is fully populated.

Cause: create_distributed_table('x', ...) was run while x already contained rows, and the official follow-up was not run. Citus copies the existing rows into the shard tables (x_<shardid>) but leaves the originals in the coordinator's physical heap for the parent relation. Those shadow rows are unreachable by every Citus-routed statement (all DML is rewritten to the shards), so backfills and SELECT checks correctly see them as gone. But ALTER TABLE ... SET NOT NULL is core PostgreSQL DDL — ATRewriteTable scans the parent relation's physical heap directly and trips over the stranded rows' NULL values.

Solution: Run Citus's supported cleanup, truncate_local_data_after_distributing_table('x'), after distributing the table. It empties only the coordinator-local parent heap of distributed tables — never shard data. If a non-distributed table holds an FK referencing x, the function refuses to run (it will not implicitly TRUNCATE-cascade a local table). The correct fix is to make that referrer a proper distributed table co-located with x (not to drop/recreate the FK, and never to mutate pg_dist_* catalogs). See server/migrations/20260513100800_distribute_email_reply_tokens.cjs.

The rule: any migration that runs create_distributed_table() on a table that may already contain rows MUST immediately follow with truncate_local_data_after_distributing_table() on that table. Skipping it is invisible until the first parent-heap-scanning DDL (like SET NOT NULL) runs — possibly many migrations later.

Best Practices

1. Always Check for Citus (Safely)

Before performing DDL that might behave differently on Citus, check if the table is distributed. Important: Wrap the check in try-catch since pg_dist_partition doesn't exist in standard PostgreSQL:

let isCitusDistributed = false;
try {
  const citusCheck = await knex.raw(`
    SELECT EXISTS (
      SELECT 1 FROM pg_dist_partition WHERE logicalrelid = 'table_name'::regclass
    ) as is_distributed
  `);
  isCitusDistributed = citusCheck.rows[0]?.is_distributed;
} catch (error) {
  // pg_dist_partition doesn't exist - standard PostgreSQL
  isCitusDistributed = false;
}

if (isCitusDistributed) {
  // Use Citus-specific approach
} else {
  // Use standard PostgreSQL approach
}

2. Use Raw SQL for DDL

Knex's schema builder may not handle Citus correctly. Use raw SQL:

// ❌ Avoid
await knex.schema.alterTable('table', (table) => {
  table.string('column').notNullable().alter();
});

// ✅ Prefer
await knex.raw(`ALTER TABLE table ALTER COLUMN column SET NOT NULL`);

3. Make Migrations Idempotent

Always check if changes already exist:

// Check if column exists
const columnExists = await knex.raw(`
  SELECT column_name
  FROM information_schema.columns
  WHERE table_name = 'table_name'
  AND column_name = 'column_name'
`);

if (columnExists.rows.length === 0) {
  // Add column
}

4. Verify Across All Shards

Don't just count - query actual data:

// ❌ May miss issues
const count = await knex('table').whereNull('column').count();

// ✅ Better - forces distributed query
const rows = await knex.raw(`
  SELECT id FROM table WHERE column IS NULL LIMIT 10
`);

5. Use Citus Helper Functions

Citus provides functions specifically for distributed operations:

  • run_command_on_shards(table, command) - Run SQL on all shards
  • run_command_on_all_nodes(command) - Run on coordinator and all workers
  • run_command_on_workers(command) - Run on all workers

Example:

await knex.raw(`
  SELECT * FROM run_command_on_shards(
    'projects',
    $$UPDATE %s SET column = value WHERE condition$$
  )
`);

6. Add Appropriate Delays

After bulk updates in Citus, add delays:

// After backfilling data
console.log('Waiting for distributed changes to propagate...');
await new Promise(resolve => setTimeout(resolve, 5000));

7. Truncate Local Data After Distributing a Non-Empty Table

Whenever a migration distributes a table that may already hold rows, follow it immediately with the official cleanup:

await knex.raw("SELECT create_distributed_table('x', 'tenant', colocate_with => 'y')");
// REQUIRED follow-up — strands no shadow rows in the coordinator parent heap:
await knex.raw('SELECT truncate_local_data_after_distributing_table(?::regclass)', ['x']);

Guard it so it stays a no-op on clean installs and re-runs: skip when Citus is absent, when the table is not distributed, or when pg_relation_size('x'::regclass) is already 0 (a cleanly-distributed table has a 0-byte parent heap). See Issue 4 for why omitting this surfaces as a confusing SET NOT NULL failure later.

Testing Migrations

Test on Both PostgreSQL and Citus

  1. Local development: Usually runs standard PostgreSQL
  2. Staging: Should mirror production (Citus if prod uses Citus)
  3. Production: May use Citus

Ensure migrations work in all environments:

// Good pattern - works everywhere
const isCitus = await knex.raw(`
  SELECT EXISTS (
    SELECT 1 FROM pg_dist_partition WHERE logicalrelid = 'table'::regclass
  ) as is_distributed
`);

if (isCitus.rows[0]?.is_distributed) {
  // Citus-specific logic
} else {
  // Standard PostgreSQL logic
}

Migration Checklist

Before deploying a migration:

  • Migration is idempotent (can be run multiple times safely)
  • If it distributes a possibly-non-empty table, it calls truncate_local_data_after_distributing_table() right after (see Issue 4)
  • Tested on both PostgreSQL and CitusDB
  • Large backfills include delays for propagation
  • DDL changes check for Citus and handle appropriately
  • Uses exports.config = { transaction: false } if needed
  • Includes verification steps after data changes
  • Has proper rollback logic in exports.down
  • Logs progress for debugging

References