Guide
Database migrations explained
Harbor Supply's checkout API ran fine in staging. Production deploy at
2 a.m. added a fulfillment_status column with
NOT NULL and no default — PostgreSQL rewrote the
entire orders table under an exclusive lock while eight API
pods still served traffic against the old schema. Checkout returned 500s
for forty minutes. The fix was not “run migrations faster”;
it was treating database migrations as a first-class
release artifact: versioned, reviewed, ordered with application deploys,
and designed for zero-downtime when tables are large.
Migrations are how teams evolve schema — new columns, indexes,
constraints, partitions — without losing data or taking unplanned
outages. They differ from ad-hoc SQL in production because every change
is tracked, replayable on fresh environments, and auditable. This guide
covers migration fundamentals, forward-only vs reversible scripts, the
expand-contract pattern, tool ecosystems, a Harbor Supply order-schema
worked example, an approach decision table, common pitfalls, and a
production checklist. Pair migrations with
transaction and isolation
knowledge and
PostgreSQL internals
when you need to reason about locks and MVCC during DDL.
What a database migration is
A migration is a numbered, ordered change to database structure or
reference data. Each step is typically a SQL file (or generated SQL
from an ORM) with an up direction that applies the change
and sometimes a down direction that reverses it. A
migration history table (e.g.
schema_migrations, _prisma_migrations,
flyway_schema_history) records which versions ran on
which database so the same sequence can be replayed on developer
laptops, CI databases, staging, and production.
Migrations solve three problems manual SQL does not. First, reproducibility: a new engineer clones the repo and gets an identical schema by running the chain from empty to current. Second, coordination: application code and schema version move together — deploy v2.4 only after migration 20240610_add_status ran. Third, reviewability: DDL diffs go through pull request review like application code, catching full-table locks and destructive drops before they hit production. Migrations are not a substitute for backups or disaster recovery; they are the controlled path for intentional schema evolution.
Versioning and ordering rules
Every migration needs a unique, monotonically increasing identifier.
Common schemes: timestamp prefixes (20240610120000_add_index),
sequential integers (V14__add_orders_index.sql), or
content hashes. Never reuse or reorder applied migrations — once
migration 14 ran in production, editing file 14 breaks every
environment that already recorded it. If you made a mistake, ship
migration 15 that corrects it.
Immutability is the golden rule: applied migrations are append-only history. Teams that rewrite old files eventually fork dev and prod histories, requiring painful manual repair. Squashing is allowed only before any shared environment has applied the squashed range — typically early in a greenfield project, not on a live orders table with five years of history.
Run migrations in CI against ephemeral databases on every pull request. Fail the build if migration N applies but N+1 fails, if generated SQL differs from committed files, or if a migration takes longer than your lock budget on a representative data volume.
Forward-only vs reversible migrations
Reversible (up/down) migrations let you roll back
schema by running down scripts. They work well in early
development and for additive changes (drop column in down after add
column in up). In production, true reversibility is rare: you cannot
un-drop a column if new rows wrote data into it, and dropping an
index that query plans now depend on may cause sudden full scans.
Mature teams default to forward-only migrations plus application rollback. If deploy v2.5 is bad, roll the app back to v2.4 — but leave schema at the v2.5 level if the new column is nullable and ignored by old code. Forward-only pairs naturally with expand-contract (below). Document “rollback plan” as “redeploy previous app binary” plus “feature flag off,” not “run down.sql on prod Friday night.”
Zero-downtime: the expand-contract pattern
Breaking schema changes — rename column, change type, add
NOT NULL, split table — need multiple deploy
steps so old and new application versions coexist during rolling
updates. The expand-contract (or expand-migrate-contract)
pattern splits work across releases:
- Expand. Add the new column/table/index in a backward-compatible way (nullable column, new table alongside old). Deploy migration; old code ignores the addition.
- Migrate data. Backfill in batches with
UPDATE ... WHERE id BETWEENor background jobs. Avoid one transaction that locks millions of rows. UseCONCURRENTLYfor index creation in PostgreSQL. - Dual-write / dual-read. Application writes to both old and new fields; reads prefer new with fallback. Verify consistency with reconciliation jobs.
- Contract. After all pods run new code, migration removes the old column, tightens constraints, or drops deprecated objects.
Renaming status to fulfillment_status without
downtime: add fulfillment_status, backfill from
status, deploy code that writes both, switch reads,
deploy code that writes only new column, drop status in a
final migration. Skipping steps causes the Harbor Supply outage pattern:
schema and code changed in a single atomic deploy on a locked table.
DDL operations and lock awareness
Not all DDL is equal. Adding a nullable column with
no default is fast on PostgreSQL 11+ (metadata-only). Adding
NOT NULL without a default forces a full table rewrite
and exclusive lock. Creating an index without
CONCURRENTLY blocks writes. Foreign keys validated
immediately lock both tables.
Before shipping migration SQL, ask: what lock mode does this take, how
long on our largest table, and can we run it online? Use
pg_stat_activity and
lock_timeout in maintenance windows for unavoidable
exclusive operations. For
index design,
plan indexes in migrations separate from column adds so you can create
them concurrently and monitor bloat.
Migration tools and ORM integration
Tools fall into SQL-first and schema-first families. Flyway and Liquibase run versioned SQL or XML changelogs; popular in Java and polyglot shops. Alembic pairs with SQLAlchemy in Python. golang-migrate and goose serve Go services. Prisma Migrate, Drizzle Kit, and Rails Active Record generate SQL from declarative schema files — convenient for TypeScript and Ruby teams but requiring discipline to hand-edit generated SQL for production-safe DDL.
ORM-generated migrations excel at developer velocity; SQL-first tools
excel when DBAs own review. Hybrid teams generate drafts, then curate
SQL for CONCURRENTLY, batch backfills, and comments.
Whichever tool you pick, store migrations in git beside application
code and run them from
CI/CD
with the same promotion gates as binaries.
Harbor Supply order schema evolution (worked example)
Harbor Supply needed to split a single status enum
(“pending”, “shipped”, “cancelled”)
into separate payment_status and
fulfillment_status fields for a new warehouse integration.
The wrong approach — one migration renaming and retyping under
lock — caused their outage. The corrected plan:
- Migration 041 (expand).
ALTER TABLE orders ADD COLUMN payment_status text, ADD COLUMN fulfillment_status text;Both nullable. Deploy; v3.2 app unchanged. - Backfill job. Batch update 10k rows at a time:
map old
statusto both new columns. Run off-peak; monitor replication lag. - Deploy v3.3. Write both old and new columns on
every order update; read from new columns with fallback to
status. - Deploy v3.4. Read/write only new columns.
- Migration 042 (contract). Drop
status; add check constraints on allowed enum values; add partial index onfulfillment_statuswhere not terminal.
Total calendar time: two weeks. Zero checkout downtime. The team added
a migration review checklist to every PR touching
prisma/migrations/ and required staging load tests on a
snapshot with production row counts.
Approach decision table
| Scenario | Recommended approach | Avoid |
|---|---|---|
| New nullable column, app can ignore it | Single forward migration; deploy app after | Blocking NOT NULL in same release |
| Rename or retype column on large table | Expand-contract across 2–4 releases | ALTER ... RENAME during traffic |
| New index on 100M-row table | CREATE INDEX CONCURRENTLY in dedicated migration |
Default index creation inside transaction |
| Seed reference data (countries, roles) | Idempotent INSERT ... ON CONFLICT in migration | Destructive DELETE then INSERT without guard |
| Destructive drop column/table | Contract phase after telemetry proves zero reads | Dropping before all app versions migrated |
| Multi-service shared database (legacy) | Coordinate migration order; feature flags per service | Assuming one deploy owns all readers/writers |
Common pitfalls
- Editing applied migrations. Forks history; always append a fix-forward migration.
- Schema ahead of code. New NOT NULL column deploys before app writes it — insert failures.
- Code ahead of schema. App queries new column before migration runs — column missing errors.
- Long transactions in migrations. Backfills in one TX hold locks and fill WAL; batch instead.
- Missing lock timeouts. A stuck migration blocks all writers until someone kills the session.
- Environment drift. Manual hotfix SQL in prod not captured in git — staging never matches prod.
- Data loss on down. Down migration drops column with production data; forward-only plus backups is safer.
- Ignoring replication. Heavy DDL on primary stalls replicas; watch lag during backfills.
Production checklist
- Every schema change lives in a versioned migration file committed to git.
- CI applies full migration chain on ephemeral DB for each PR.
- Review checklist covers lock mode, estimated duration, and rollback (app version, not down.sql).
- Large-table changes use expand-contract or online DDL (
CONCURRENTLY). - Backfills are batched with progress metrics and replication lag alerts.
- Migration runs before or during deploy per documented order; document which.
- Staging migration tested on snapshot with production-scale row counts.
- Backup or PITR verified before irreversible contract-phase migrations.
- Migration history table is never hand-edited except documented disaster recovery.
- Post-deploy monitors query errors and slow queries tied to new indexes or constraints.
Key takeaways
- Migrations are release artifacts. Schema and application version must advance in a planned sequence.
- Immutability preserves trust. Append fix-forward migrations; never rewrite applied history.
- Zero-downtime is a design choice. Expand-contract turns risky DDL into boring, incremental steps.
- Locks matter more than syntax. Understand what PostgreSQL does on your largest table before merge.
- Forward-only plus app rollback beats reversible down scripts in real production incidents.
Related reading
- Database transactions and isolation levels explained — ACID, MVCC, deadlocks, and why long migrations hurt concurrency
- PostgreSQL fundamentals explained — MVCC, WAL, VACUUM, and index internals
- Prisma fundamentals explained — schema-first modeling and Migrate workflows
- Database indexing explained — B-trees, compound keys, and EXPLAIN plans