Some checks are pending
Bidi Control Character Guard / bidi-control-guard (push) Waiting to run
Circular Dependency Check / Check for new circular dependencies (push) Waiting to run
Citus Migration Smoke / Combined migrations on single-node Citus (push) Waiting to run
E2E Fresh Install Tests / fresh-install-e2e (push) Waiting to run
ext-v2 guardrails / Run ext-v2 guard and ESLint (push) Waiting to run
Integration Tests / Check for relevant changes (push) Waiting to run
Integration Tests / ${{ (github.event_name == 'schedule' || github.event.inputs.suite == 'full') && 'Full integration suite' || 'Tier-1 integration subset' }} (push) Blocked by required conditions
Mobile checks / Mobile lint + typecheck (push) Waiting to run
Mobile checks / Mobile unit tests (push) Waiting to run
Mobile checks / Mobile dependency audit (report) (push) Waiting to run
Mobile checks / Mobile reproducibility checks (push) Waiting to run
Secrets guard (env backups) / Ensure no tracked env backup files (push) Waiting to run
Temporal Readiness / fast-readiness (push) Waiting to run
Temporal Readiness / docker-parity (push) Waiting to run
TypeScript Type Check / Nx affected typecheck (push) Waiting to run
Unit Tests / Skipped-test budget (push) Waiting to run
Unit Tests / Nx affected unit tests (push) Waiting to run
Unit Tests / Server unit coverage (informational) (push) Waiting to run
Validate Tenant Management Schema / Check for relevant changes (push) Waiting to run
Validate Tenant Management Schema / Validate Tenant Management Schema (push) Blocked by required conditions
EE Workflows Build Guard / ee-workflows-build-guard (push) Waiting to run
Excluded: .git, node_modules, secrets/, compose.env, assemblyscript tgz Source: /opt/alga-psa on psa.joliet.tech
262 lines
9.4 KiB
Markdown
262 lines
9.4 KiB
Markdown
# CitusDB Migration Best Practices
|
|
|
|
This document outlines best practices for writing database migrations that work correctly with both standard PostgreSQL and CitusDB (distributed PostgreSQL).
|
|
|
|
## Key Differences with CitusDB
|
|
|
|
CitusDB distributes tables across multiple worker nodes (shards). This creates unique challenges for migrations:
|
|
|
|
1. **Distributed Data**: Table data is split across multiple physical tables (shards)
|
|
2. **Coordinator vs Workers**: The coordinator node has metadata, workers have actual data
|
|
3. **Query Routing**: Some queries run on coordinator, some on workers, some on both
|
|
4. **Schema Changes**: DDL operations may need to be applied to both coordinator and shards
|
|
|
|
## Common Issues
|
|
|
|
### Issue 1: ALTER TABLE NOT NULL Fails Despite No NULL Values
|
|
|
|
**Symptom**: Migration fails with "column contains null values" even though SELECT queries show 0 NULL values.
|
|
|
|
**Cause**:
|
|
- SELECT queries may only check coordinator or cached data
|
|
- ALTER TABLE validates across ALL shards, which may find NULL values queries missed
|
|
- Metadata sync issues between coordinator and workers
|
|
|
|
**Solution**: Use the Citus-aware approach shown in migration `20251113120000_add_project_number.cjs`:
|
|
|
|
```javascript
|
|
// Check if this is a Citus distributed table
|
|
const isCitus = await knex.raw(`
|
|
SELECT EXISTS (
|
|
SELECT 1 FROM pg_dist_partition WHERE logicalrelid = 'table_name'::regclass
|
|
) as is_distributed
|
|
`);
|
|
|
|
if (isCitus.rows[0]?.is_distributed) {
|
|
// Set NOT NULL on all shards first
|
|
await knex.raw(`
|
|
SELECT * FROM run_command_on_shards(
|
|
'table_name',
|
|
$$ALTER TABLE %s ALTER COLUMN column_name SET NOT NULL$$
|
|
)
|
|
`);
|
|
|
|
// Update coordinator metadata
|
|
await knex.raw(`
|
|
UPDATE pg_attribute
|
|
SET attnotnull = true
|
|
WHERE attrelid = 'table_name'::regclass
|
|
AND attname = 'column_name'
|
|
AND attnotnull = false
|
|
`);
|
|
} else {
|
|
// Standard PostgreSQL
|
|
await knex.raw(`ALTER TABLE table_name ALTER COLUMN column_name SET NOT NULL`);
|
|
}
|
|
```
|
|
|
|
### Issue 2: Data Backfill Not Visible Immediately
|
|
|
|
**Symptom**: UPDATE statements complete successfully, but subsequent SELECT shows unchanged data.
|
|
|
|
**Cause**: Eventual consistency in distributed systems - changes take time to propagate across shards.
|
|
|
|
**Solution**:
|
|
- Add delays after bulk updates (3-5 seconds)
|
|
- Use raw SQL queries to force fresh reads
|
|
- Query actual rows, not just COUNT(*), to force distributed query execution
|
|
|
|
```javascript
|
|
// After backfill
|
|
await new Promise(resolve => setTimeout(resolve, 5000));
|
|
|
|
// Force distributed check by querying actual rows
|
|
const nullProjects = await knex.raw(`
|
|
SELECT project_id, tenant, project_name
|
|
FROM projects
|
|
WHERE column_name IS NULL
|
|
LIMIT 100
|
|
`);
|
|
```
|
|
|
|
### Issue 3: Transaction Isolation Problems
|
|
|
|
**Symptom**: Changes made in same migration aren't visible to later queries in the migration.
|
|
|
|
**Cause**: Citus doesn't support all operations inside transactions.
|
|
|
|
**Solution**: Disable transactions for migrations with distributed operations:
|
|
|
|
```javascript
|
|
// At end of migration file
|
|
exports.config = { transaction: false };
|
|
```
|
|
|
|
### Issue 4: ALTER ... SET NOT NULL Fails After Distributing a Non-Empty Table
|
|
|
|
**Symptom**: `ALTER TABLE x ALTER COLUMN c SET NOT NULL` fails with `column "c" of relation "x" contains null values`, even though every `SELECT ... WHERE c IS NULL` (including per-shard checks and backfills) returns zero rows and the distributed data is fully populated.
|
|
|
|
**Cause**: `create_distributed_table('x', ...)` was run while `x` already contained rows, and the official follow-up was **not** run. Citus copies the existing rows into the shard tables (`x_<shardid>`) but **leaves the originals in the coordinator's physical heap for the parent relation**. Those shadow rows are unreachable by every Citus-routed statement (all DML is rewritten to the shards), so backfills and `SELECT` checks correctly see them as gone. But `ALTER TABLE ... SET NOT NULL` is core PostgreSQL DDL — `ATRewriteTable` scans the parent relation's physical heap directly and trips over the stranded rows' NULL values.
|
|
|
|
**Solution**: Run Citus's supported cleanup, `truncate_local_data_after_distributing_table('x')`, after distributing the table. It empties only the coordinator-local parent heap of distributed tables — never shard data. If a **non-distributed** table holds an FK referencing `x`, the function refuses to run (it will not implicitly TRUNCATE-cascade a local table). The correct fix is to make that referrer a proper distributed table co-located with `x` (not to drop/recreate the FK, and never to mutate `pg_dist_*` catalogs). See `server/migrations/20260513100800_distribute_email_reply_tokens.cjs`.
|
|
|
|
**The rule**: any migration that runs `create_distributed_table()` on a table that may already contain rows MUST immediately follow with `truncate_local_data_after_distributing_table()` on that table. Skipping it is invisible until the first parent-heap-scanning DDL (like `SET NOT NULL`) runs — possibly many migrations later.
|
|
|
|
## Best Practices
|
|
|
|
### 1. Always Check for Citus (Safely)
|
|
|
|
Before performing DDL that might behave differently on Citus, check if the table is distributed. **Important**: Wrap the check in try-catch since `pg_dist_partition` doesn't exist in standard PostgreSQL:
|
|
|
|
```javascript
|
|
let isCitusDistributed = false;
|
|
try {
|
|
const citusCheck = await knex.raw(`
|
|
SELECT EXISTS (
|
|
SELECT 1 FROM pg_dist_partition WHERE logicalrelid = 'table_name'::regclass
|
|
) as is_distributed
|
|
`);
|
|
isCitusDistributed = citusCheck.rows[0]?.is_distributed;
|
|
} catch (error) {
|
|
// pg_dist_partition doesn't exist - standard PostgreSQL
|
|
isCitusDistributed = false;
|
|
}
|
|
|
|
if (isCitusDistributed) {
|
|
// Use Citus-specific approach
|
|
} else {
|
|
// Use standard PostgreSQL approach
|
|
}
|
|
```
|
|
|
|
### 2. Use Raw SQL for DDL
|
|
|
|
Knex's schema builder may not handle Citus correctly. Use raw SQL:
|
|
|
|
```javascript
|
|
// ❌ Avoid
|
|
await knex.schema.alterTable('table', (table) => {
|
|
table.string('column').notNullable().alter();
|
|
});
|
|
|
|
// ✅ Prefer
|
|
await knex.raw(`ALTER TABLE table ALTER COLUMN column SET NOT NULL`);
|
|
```
|
|
|
|
### 3. Make Migrations Idempotent
|
|
|
|
Always check if changes already exist:
|
|
|
|
```javascript
|
|
// Check if column exists
|
|
const columnExists = await knex.raw(`
|
|
SELECT column_name
|
|
FROM information_schema.columns
|
|
WHERE table_name = 'table_name'
|
|
AND column_name = 'column_name'
|
|
`);
|
|
|
|
if (columnExists.rows.length === 0) {
|
|
// Add column
|
|
}
|
|
```
|
|
|
|
### 4. Verify Across All Shards
|
|
|
|
Don't just count - query actual data:
|
|
|
|
```javascript
|
|
// ❌ May miss issues
|
|
const count = await knex('table').whereNull('column').count();
|
|
|
|
// ✅ Better - forces distributed query
|
|
const rows = await knex.raw(`
|
|
SELECT id FROM table WHERE column IS NULL LIMIT 10
|
|
`);
|
|
```
|
|
|
|
### 5. Use Citus Helper Functions
|
|
|
|
Citus provides functions specifically for distributed operations:
|
|
|
|
- `run_command_on_shards(table, command)` - Run SQL on all shards
|
|
- `run_command_on_all_nodes(command)` - Run on coordinator and all workers
|
|
- `run_command_on_workers(command)` - Run on all workers
|
|
|
|
Example:
|
|
```javascript
|
|
await knex.raw(`
|
|
SELECT * FROM run_command_on_shards(
|
|
'projects',
|
|
$$UPDATE %s SET column = value WHERE condition$$
|
|
)
|
|
`);
|
|
```
|
|
|
|
### 6. Add Appropriate Delays
|
|
|
|
After bulk updates in Citus, add delays:
|
|
|
|
```javascript
|
|
// After backfilling data
|
|
console.log('Waiting for distributed changes to propagate...');
|
|
await new Promise(resolve => setTimeout(resolve, 5000));
|
|
```
|
|
|
|
### 7. Truncate Local Data After Distributing a Non-Empty Table
|
|
|
|
Whenever a migration distributes a table that may already hold rows, follow it immediately with the official cleanup:
|
|
|
|
```javascript
|
|
await knex.raw("SELECT create_distributed_table('x', 'tenant', colocate_with => 'y')");
|
|
// REQUIRED follow-up — strands no shadow rows in the coordinator parent heap:
|
|
await knex.raw('SELECT truncate_local_data_after_distributing_table(?::regclass)', ['x']);
|
|
```
|
|
|
|
Guard it so it stays a no-op on clean installs and re-runs: skip when Citus is absent, when the table is not distributed, or when `pg_relation_size('x'::regclass)` is already 0 (a cleanly-distributed table has a 0-byte parent heap). See Issue 4 for why omitting this surfaces as a confusing `SET NOT NULL` failure later.
|
|
|
|
## Testing Migrations
|
|
|
|
### Test on Both PostgreSQL and Citus
|
|
|
|
1. **Local development**: Usually runs standard PostgreSQL
|
|
2. **Staging**: Should mirror production (Citus if prod uses Citus)
|
|
3. **Production**: May use Citus
|
|
|
|
Ensure migrations work in all environments:
|
|
|
|
```javascript
|
|
// Good pattern - works everywhere
|
|
const isCitus = await knex.raw(`
|
|
SELECT EXISTS (
|
|
SELECT 1 FROM pg_dist_partition WHERE logicalrelid = 'table'::regclass
|
|
) as is_distributed
|
|
`);
|
|
|
|
if (isCitus.rows[0]?.is_distributed) {
|
|
// Citus-specific logic
|
|
} else {
|
|
// Standard PostgreSQL logic
|
|
}
|
|
```
|
|
|
|
## Migration Checklist
|
|
|
|
Before deploying a migration:
|
|
|
|
- [ ] Migration is idempotent (can be run multiple times safely)
|
|
- [ ] If it distributes a possibly-non-empty table, it calls `truncate_local_data_after_distributing_table()` right after (see Issue 4)
|
|
- [ ] Tested on both PostgreSQL and CitusDB
|
|
- [ ] Large backfills include delays for propagation
|
|
- [ ] DDL changes check for Citus and handle appropriately
|
|
- [ ] Uses `exports.config = { transaction: false }` if needed
|
|
- [ ] Includes verification steps after data changes
|
|
- [ ] Has proper rollback logic in `exports.down`
|
|
- [ ] Logs progress for debugging
|
|
|
|
## References
|
|
|
|
- [Citus Documentation](https://docs.citusdata.com/)
|
|
- [Citus Schema-Based Sharding](https://docs.citusdata.com/en/stable/sharding/data_modeling.html)
|
|
- Migration example: `server/migrations/20251113120000_add_project_number.cjs`
|