Explainer · 7 June 2026

How database replication and read replicas work

A single database server eventually hits a ceiling — CPU for query execution, disk IOPS for random reads, or connection slots for concurrent clients. Replication keeps a copy of your data on one or more additional machines so you can spread read load, survive hardware failure, or place data closer to users. A read replica (also called a standby, follower, or secondary) applies the same ordered stream of changes as the primary but typically rejects client writes. The catch is that copies are almost never perfectly synchronized: replication lag means a user who updates their profile on the primary may not see the change on a replica for milliseconds — or seconds under load. This explainer walks through primary-follower topology, how changes propagate, sync vs async trade-offs, routing reads safely, and what happens when the primary dies.

Primary-follower topology

The most common pattern is primary-follower (leader-replica, master-slave in older docs). One node — the primary — accepts all writes: INSERT, UPDATE, DELETE, and DDL. It records each change in a durable log (Postgres WAL, MySQL binlog, MongoDB oplog). One or more followers pull or receive that log and replay it locally, updating their own copy of tables and indexes.

Followers are usually opened for read-only queries. Application code or a proxy routes SELECT traffic to replicas while writes stay on the primary. This is horizontal read scaling: three replicas can roughly triple read throughput if queries are embarrassingly parallel and replicas stay warm. It does not scale writes — every write still lands on one primary, and the log must be shipped to every follower.

Multi-primary (active-active) replication exists but is rarer for OLTP because conflicting concurrent writes on two primaries require conflict resolution. Most teams start with one writable primary and treat replicas as read-only extensions until they have a concrete need for multi-region writes.

How changes propagate: physical vs logical replication

Physical replication ships byte-level WAL records or disk page diffs. The follower is a near-identical clone — same major version, same extensions, same table OIDs. Postgres streaming replication and MySQL row-based binlog replay (when applied to a replica instance) behave this way. Physical replicas are fast and simple but replicate everything in the database cluster, including internal catalog changes.

Logical replication decodes the WAL into row-level change events (INSERT INTO users VALUES (...)) and applies them to subscriber tables. Postgres logical replication, MySQL replication filters, and change-data-capture (CDC) tools like Debezium use this model. Logical replication can subscribe to a subset of tables, fan out to analytics warehouses, or replicate across major versions — at the cost of more CPU on the publisher and occasional schema-migration footguns.

Both models preserve ordering within a replication stream: transaction T2's changes never appear before T1 on a follower if T1 committed first on the primary. Cross-table ordering matters when a user row and a billing row update in one transaction — the replica must apply both or neither.

Synchronous vs asynchronous replication

The durability-vs-latency knob is when the primary considers a transaction committed:

Asynchronous — The primary writes WAL locally, returns success to the client, and ships WAL to followers in the background. If the primary crashes before a follower catches up, the last few committed transactions may be lost unless another mechanism (consensus layer, cloud storage) holds them. Latency is minimal; RPO (recovery point objective) may be nonzero.
Synchronous — The primary waits until at least one follower acknowledges it has persisted the WAL (often synchronous_commit = on with a named standby in Postgres). Commit latency includes a network round trip to the replica. If the sync standby dies, commits may block until promotion or config change — availability traded for durability.
Quorum / semi-sync — Wait for k of n replicas (CockroachDB, Galera, some cloud multi-AZ primaries). This sits between full sync and fire-and-forget async; see our Raft consensus explainer for how quorum commit rules generalize beyond databases.

Managed services (RDS, Cloud SQL, Aurora) hide the knobs but the trade-off remains: multi-AZ with synchronous storage replication gives smaller RPO; read replicas in another region are usually async with measurable lag.

Replication lag and stale reads

Replication lag is the delay between commit on the primary and visibility on a follower. It shows up as:

Time lag — "this replica is 2.3 seconds behind primary" (pg_stat_replication.replay_lag in Postgres).
Byte lag — WAL bytes not yet applied; spikes during bulk loads or long-running replica queries that block replay.
Transaction ID lag — how far behind the replica's visible snapshot is in XID space.

Lag sources include network bandwidth, follower disk speed, replay single- threading (one WAL applier per follower in classic Postgres), heavy read load on the replica contending for I/O, and large transactions that generate megabytes of WAL at once. A follower running a 30-second analytics scan can indirectly increase lag if replay and queries share disks.

Stale reads happen when an app reads from a lagging replica immediately after a write to the primary. Classic bug: user saves settings, page reloads, settings appear reverted because the read replica has not replayed the update. Mitigations:

Read-your-writes — Route the user's next reads to the primary for a short window after their write, or stick sessions to primary until lag drops below a threshold.
Monotonic reads — Pin a user to one replica for the session so they never time-travel backward across replicas at different positions.
Consistent read timestamp — Some systems expose "read at least as fresh as T" APIs; Spanner uses TrueTime; Postgres lacks this natively on async replicas.
Accept staleness — Dashboards, search indexes, and recommendation feeds often tolerate seconds of lag; payment balances usually do not.

The CAP theorem framing applies: async replication chooses availability and partition tolerance while allowing stale reads — not a bug, a documented trade-off.

Routing reads and connection pools

Read routing lives in one of three places:

Application driver — ORMs or drivers with read/write split endpoints (Rails multi-db, Prisma read replicas, JDBC failover URLs).
Connection pooler / proxy — PgBouncer, RDS Proxy, or HAProxy inspects query type or hints and sends SELECT to replica pools. See our connection pooling guide for why each replica needs its own pool — replicas do not share the primary's connection count.
DNS / service discovery — read.example.com resolves to a load-balanced replica fleet; writes go to write.example.com.

Load balancers in front of replicas should use health checks that include lag, not just TCP up: a replica 60 seconds behind may be "healthy" for uptime but poisonous for user-facing reads. Some teams remove outliers from the pool automatically when lag exceeds SLO.

Query planning differs between primary and replica: a replica may have slightly different statistics if replay is continuous but ANALYZE runs only on the primary. Rarely, a bad plan on replicas alone causes mysterious slow reads — worth checking with EXPLAIN on the replica.

Failover and promotion

When the primary fails, a follower must be promoted to accept writes. Manual promotion: operator picks the least-lag standby, runs pg_promote() or equivalent, repoints DNS, and hopes clients reconnect. Automated failover: Patroni, Orchestrator, or cloud RDS Multi-AZ detects failure, promotes a standby, and updates a virtual IP — typically 30–120 seconds of write unavailability.

Split-brain is the nightmare: two nodes both believe they are primary and accept writes, diverging data. Prevention uses STONITH (shoot the other node), consensus (etcd + Patroni), or cloud control planes that revoke the old primary's write access at the hypervisor. After failover, former primaries must rejoin as followers and resync — sometimes requiring a full base backup if WAL diverged too far.

Cascading replicas (replica of a replica) reduce primary WAL fan-out but add hop lag and complicate promotion: if the intermediate node dies, downstream replicas need re-parenting.

When replicas help — and when they do not

Replicas excel at:

Read-heavy workloads (product catalogs, feeds, reporting) where stale data is acceptable or routed carefully.
Geographic read locality — EU users query an EU replica while writes go to a US primary (with lag accepted).
Backup and maintenance — take snapshots from a replica without pausing the primary; run pg_dump off the hot path.
Disaster recovery — warm standby ready to promote.

Replicas do not fix:

Write bottlenecks — shard, partition, or redesign hot keys instead.
Strong cross-row consistency on reads without routing to primary or using ACID isolation on one node.
Cache invalidation — replicas do not invalidate CDN or app caches when data changes; you still need event-driven invalidation or TTL discipline.

Common pitfalls

Assuming replicas are instant — load tests that write then read from replica in the same millisecond flake in production.
One pool for everything — long analytics queries on replicas starve replay or user-facing reads unless resource groups or separate replica tiers exist.
Ignoring replication slot bloat — logical slots that stall can prevent WAL recycling on the primary and fill the disk.
Failover without connection drain — apps with stale primary connections error until pools refresh; use fast-fail DNS TTL and pooler health checks.
Testing failover never — promoted replica missing extensions, sequences, or logical subscriptions discovered only during an outage.
Treating lag metrics as vanity — alert on p99 lag and bytes behind, not just average.

Practical checklist

Document RPO/RTO targets and whether replicas are async or sync.
Define which queries require primary reads vs replica-safe staleness.
Monitor lag per replica; auto-drain replicas that exceed SLO.
Size replica I/O for replay + read load combined, not reads alone.
Run failover game days: promote, fail back, measure client recovery time.
Pair replication with backups — replicas are not backups if corruption replicates logically.

Related on Solana Garden: ACID transactions, Raft consensus, Connection pooling, CAP theorem, Explainers hub.