# CitusDB Migration Best Practices This document outlines best practices for writing database migrations that work correctly with both standard PostgreSQL and CitusDB (distributed PostgreSQL). ## Key Differences with CitusDB CitusDB distributes tables across multiple worker nodes (shards). This creates unique challenges for migrations: 1. **Distributed Data**: Table data is split across multiple physical tables (shards) 2. **Coordinator vs Workers**: The coordinator node has metadata, workers have actual data 3. **Query Routing**: Some queries run on coordinator, some on workers, some on both 4. **Schema Changes**: DDL operations may need to be applied to both coordinator and shards ## Common Issues ### Issue 1: ALTER TABLE NOT NULL Fails Despite No NULL Values **Symptom**: Migration fails with "column contains null values" even though SELECT queries show 0 NULL values. **Cause**: - SELECT queries may only check coordinator or cached data - ALTER TABLE validates across ALL shards, which may find NULL values queries missed - Metadata sync issues between coordinator and workers **Solution**: Use the Citus-aware approach shown in migration `20251113120000_add_project_number.cjs`: ```javascript // Check if this is a Citus distributed table const isCitus = await knex.raw(` SELECT EXISTS ( SELECT 1 FROM pg_dist_partition WHERE logicalrelid = 'table_name'::regclass ) as is_distributed `); if (isCitus.rows[0]?.is_distributed) { // Set NOT NULL on all shards first await knex.raw(` SELECT * FROM run_command_on_shards( 'table_name', $$ALTER TABLE %s ALTER COLUMN column_name SET NOT NULL$$ ) `); // Update coordinator metadata await knex.raw(` UPDATE pg_attribute SET attnotnull = true WHERE attrelid = 'table_name'::regclass AND attname = 'column_name' AND attnotnull = false `); } else { // Standard PostgreSQL await knex.raw(`ALTER TABLE table_name ALTER COLUMN column_name SET NOT NULL`); } ``` ### Issue 2: Data Backfill Not Visible Immediately **Symptom**: UPDATE statements complete successfully, but subsequent SELECT shows unchanged data. **Cause**: Eventual consistency in distributed systems - changes take time to propagate across shards. **Solution**: - Add delays after bulk updates (3-5 seconds) - Use raw SQL queries to force fresh reads - Query actual rows, not just COUNT(*), to force distributed query execution ```javascript // After backfill await new Promise(resolve => setTimeout(resolve, 5000)); // Force distributed check by querying actual rows const nullProjects = await knex.raw(` SELECT project_id, tenant, project_name FROM projects WHERE column_name IS NULL LIMIT 100 `); ``` ### Issue 3: Transaction Isolation Problems **Symptom**: Changes made in same migration aren't visible to later queries in the migration. **Cause**: Citus doesn't support all operations inside transactions. **Solution**: Disable transactions for migrations with distributed operations: ```javascript // At end of migration file exports.config = { transaction: false }; ``` ### Issue 4: ALTER ... SET NOT NULL Fails After Distributing a Non-Empty Table **Symptom**: `ALTER TABLE x ALTER COLUMN c SET NOT NULL` fails with `column "c" of relation "x" contains null values`, even though every `SELECT ... WHERE c IS NULL` (including per-shard checks and backfills) returns zero rows and the distributed data is fully populated. **Cause**: `create_distributed_table('x', ...)` was run while `x` already contained rows, and the official follow-up was **not** run. Citus copies the existing rows into the shard tables (`x_`) but **leaves the originals in the coordinator's physical heap for the parent relation**. Those shadow rows are unreachable by every Citus-routed statement (all DML is rewritten to the shards), so backfills and `SELECT` checks correctly see them as gone. But `ALTER TABLE ... SET NOT NULL` is core PostgreSQL DDL — `ATRewriteTable` scans the parent relation's physical heap directly and trips over the stranded rows' NULL values. **Solution**: Run Citus's supported cleanup, `truncate_local_data_after_distributing_table('x')`, after distributing the table. It empties only the coordinator-local parent heap of distributed tables — never shard data. If a **non-distributed** table holds an FK referencing `x`, the function refuses to run (it will not implicitly TRUNCATE-cascade a local table). The correct fix is to make that referrer a proper distributed table co-located with `x` (not to drop/recreate the FK, and never to mutate `pg_dist_*` catalogs). See `server/migrations/20260513100800_distribute_email_reply_tokens.cjs`. **The rule**: any migration that runs `create_distributed_table()` on a table that may already contain rows MUST immediately follow with `truncate_local_data_after_distributing_table()` on that table. Skipping it is invisible until the first parent-heap-scanning DDL (like `SET NOT NULL`) runs — possibly many migrations later. ## Best Practices ### 1. Always Check for Citus (Safely) Before performing DDL that might behave differently on Citus, check if the table is distributed. **Important**: Wrap the check in try-catch since `pg_dist_partition` doesn't exist in standard PostgreSQL: ```javascript let isCitusDistributed = false; try { const citusCheck = await knex.raw(` SELECT EXISTS ( SELECT 1 FROM pg_dist_partition WHERE logicalrelid = 'table_name'::regclass ) as is_distributed `); isCitusDistributed = citusCheck.rows[0]?.is_distributed; } catch (error) { // pg_dist_partition doesn't exist - standard PostgreSQL isCitusDistributed = false; } if (isCitusDistributed) { // Use Citus-specific approach } else { // Use standard PostgreSQL approach } ``` ### 2. Use Raw SQL for DDL Knex's schema builder may not handle Citus correctly. Use raw SQL: ```javascript // ❌ Avoid await knex.schema.alterTable('table', (table) => { table.string('column').notNullable().alter(); }); // ✅ Prefer await knex.raw(`ALTER TABLE table ALTER COLUMN column SET NOT NULL`); ``` ### 3. Make Migrations Idempotent Always check if changes already exist: ```javascript // Check if column exists const columnExists = await knex.raw(` SELECT column_name FROM information_schema.columns WHERE table_name = 'table_name' AND column_name = 'column_name' `); if (columnExists.rows.length === 0) { // Add column } ``` ### 4. Verify Across All Shards Don't just count - query actual data: ```javascript // ❌ May miss issues const count = await knex('table').whereNull('column').count(); // ✅ Better - forces distributed query const rows = await knex.raw(` SELECT id FROM table WHERE column IS NULL LIMIT 10 `); ``` ### 5. Use Citus Helper Functions Citus provides functions specifically for distributed operations: - `run_command_on_shards(table, command)` - Run SQL on all shards - `run_command_on_all_nodes(command)` - Run on coordinator and all workers - `run_command_on_workers(command)` - Run on all workers Example: ```javascript await knex.raw(` SELECT * FROM run_command_on_shards( 'projects', $$UPDATE %s SET column = value WHERE condition$$ ) `); ``` ### 6. Add Appropriate Delays After bulk updates in Citus, add delays: ```javascript // After backfilling data console.log('Waiting for distributed changes to propagate...'); await new Promise(resolve => setTimeout(resolve, 5000)); ``` ### 7. Truncate Local Data After Distributing a Non-Empty Table Whenever a migration distributes a table that may already hold rows, follow it immediately with the official cleanup: ```javascript await knex.raw("SELECT create_distributed_table('x', 'tenant', colocate_with => 'y')"); // REQUIRED follow-up — strands no shadow rows in the coordinator parent heap: await knex.raw('SELECT truncate_local_data_after_distributing_table(?::regclass)', ['x']); ``` Guard it so it stays a no-op on clean installs and re-runs: skip when Citus is absent, when the table is not distributed, or when `pg_relation_size('x'::regclass)` is already 0 (a cleanly-distributed table has a 0-byte parent heap). See Issue 4 for why omitting this surfaces as a confusing `SET NOT NULL` failure later. ## Testing Migrations ### Test on Both PostgreSQL and Citus 1. **Local development**: Usually runs standard PostgreSQL 2. **Staging**: Should mirror production (Citus if prod uses Citus) 3. **Production**: May use Citus Ensure migrations work in all environments: ```javascript // Good pattern - works everywhere const isCitus = await knex.raw(` SELECT EXISTS ( SELECT 1 FROM pg_dist_partition WHERE logicalrelid = 'table'::regclass ) as is_distributed `); if (isCitus.rows[0]?.is_distributed) { // Citus-specific logic } else { // Standard PostgreSQL logic } ``` ## Migration Checklist Before deploying a migration: - [ ] Migration is idempotent (can be run multiple times safely) - [ ] If it distributes a possibly-non-empty table, it calls `truncate_local_data_after_distributing_table()` right after (see Issue 4) - [ ] Tested on both PostgreSQL and CitusDB - [ ] Large backfills include delays for propagation - [ ] DDL changes check for Citus and handle appropriately - [ ] Uses `exports.config = { transaction: false }` if needed - [ ] Includes verification steps after data changes - [ ] Has proper rollback logic in `exports.down` - [ ] Logs progress for debugging ## References - [Citus Documentation](https://docs.citusdata.com/) - [Citus Schema-Based Sharding](https://docs.citusdata.com/en/stable/sharding/data_modeling.html) - Migration example: `server/migrations/20251113120000_add_project_number.cjs`