Guide

Database sharding explained: horizontal partitioning, shard keys and hotspots

Database sharding splits one logical dataset across multiple physical databases so no single server must hold every row or absorb every write. When a monolithic PostgreSQL or MySQL instance hits CPU, disk I/O, or connection limits, teams reach for sharding — but it is one of the hardest scaling moves because it breaks assumptions about joins, transactions, and global uniqueness. This guide explains horizontal vs vertical partitioning, how to choose a shard key, why celebrity users create hotspots, and which simpler steps — indexes, read replicas, and caching — should come first.

What sharding is (and what it is not)

A shard is a subset of rows that lives on its own database node. The application (or a proxy like Vitess, Citus, or MongoDB’s mongos router) routes each query to the correct shard based on a shard key — typically a column such as user_id, tenant_id, or order_id. Rows with the same shard key value always land on the same shard, which preserves local ACID transactions for that key’s working set.

Sharding is not the same as:

  • Replication — copying the full dataset to read replicas; every replica still holds all rows.
  • Vertical partitioning — splitting columns or tables onto different servers (users table on DB A, orders on DB B) without splitting rows within a table.
  • Native table partitioning — PostgreSQL declarative partitions or MySQL RANGE/LIST partitions on one server; useful for pruning old data, but the hardware ceiling remains.

True horizontal sharding distributes rows of the same table across machines. The payoff is higher aggregate write throughput and storage headroom. The cost is application complexity: cross-shard joins become expensive or impossible, and two-phase commit across shards is rare in production OLTP systems.

Why teams shard: the scaling wall

A single primary database eventually bottlenecks on one or more dimensions:

  • Write throughput — WAL fsync rate, lock contention on hot rows, index maintenance on every INSERT.
  • Storage — disk size and backup/restore windows grow with total row count.
  • Memory — the working set of hot indexes may exceed RAM, causing cache thrashing.
  • Connections — each app instance opens pooled connections; thousands of clients exhaust max_connections even when queries are cheap.

Before sharding, exhaust cheaper levers. Profile slow queries and add covering indexes. Scale the primary vertically (more CPU/RAM/IOps). Add read replicas for read-heavy workloads. Introduce Redis or CDN caching for idempotent reads. Archive cold data to object storage. Many products never need sharding; those that do usually have sustained write rates in the tens of thousands per second per logical table or multi-terabyte active sets with strict latency SLOs.

Choosing a shard key

The shard key is the most important design decision. A good key satisfies three properties:

  1. High cardinality — many distinct values so load spreads evenly (user IDs, not country codes).
  2. Query locality — most queries filter on the shard key so the router can target one shard instead of broadcasting to all.
  3. Stability — the key should not change after insert; resharding because you picked the wrong column is painful.

Common patterns

Tenant-based sharding (B2B SaaS): shard on tenant_id. Each customer’s data is isolated; cross-tenant analytics runs on a warehouse, not OLTP shards. User-based sharding (consumer apps): shard on user_id so a user’s posts, settings, and sessions co-locate. Hash sharding: apply a hash function to the key modulo N shards for even distribution when natural keys are skewed.

Hotspots and the celebrity problem

Even with hashing, skew happens. A viral post, a flash-sale SKU, or a single tenant_id for your largest enterprise customer can concentrate writes on one shard while siblings sit idle. Mitigations include:

  • Sub-sharding hot tenants onto dedicated hardware.
  • Splitting write-heavy counters into separate tables with randomized suffix keys (write sharding).
  • Using time-based partition keys for append-only logs (events per day per shard).
  • Monitoring per-shard QPS and storage; alert when any shard exceeds 2× the fleet average.

Auto-increment primary keys are a poor shard key: new rows always hit the “latest” shard if you range-partition by ID. Prefer hash(user_id) or UUID v4 with hash routing.

Consistent hashing and rebalancing

When you add or remove shards, naive hash(key) % N remaps almost every row — a full data migration. Consistent hashing maps keys and shards onto a ring; adding a shard only moves keys in adjacent ranges. Managed systems (DynamoDB, Cassandra, Vitess) hide this behind virtual nodes and automatic rebalancing jobs.

Resharding in self-managed Postgres often means: dual-write to old and new shard maps, backfill historical rows, verify counts, flip read traffic, then retire old nodes. Plan maintenance windows or online migration tools (gh-ost, logical replication slots) because getting this wrong loses data or doubles it.

In microservices architectures, each service may own a small database that is already “sharded” by bounded context — orders DB, inventory DB, billing DB. That is vertical decomposition at the service level; intra-service sharding is the next step when one service’s database alone outgrows a machine.

Cross-shard queries and distributed transactions

SQL joins across shards do not work like local joins. The query planner must:

  • Scatter-gather — send the query to all shards, merge results in the coordinator (expensive for large scans).
  • Denormalize — duplicate lookup fields on each shard to avoid joins (same trade-off as denormalization).
  • Fan-out via search — index cross-shard data in Elasticsearch or a warehouse for analytics while OLTP stays shard-local.

Multi-shard ACID transactions (two-phase commit) add latency and failure modes; most teams avoid them. Instead, design workflows so a single request touches one shard: place an order and all its line items on the shard determined by order_id. Cross-shard side effects propagate through message queues with idempotent consumers — the same eventual-consistency patterns described in distributed systems consistency.

Global uniqueness (emails, invoice numbers) needs a central sequence service, UUIDs, or snowflake IDs generated outside any single shard’s auto-increment.

Sharding in practice: tools and trade-offs

Teams rarely build shard routers from scratch. Common options:

  • PostgreSQL + Citus — extension that distributes tables; co-located shards share a node for related rows.
  • MySQL + Vitess — YouTube-origin proxy layer used by PlanetScale; handles resharding workflows.
  • MongoDB — native sharded clusters with config servers and mongos routers.
  • AWS DynamoDB / Azure Cosmos DB — managed partition keys with per-partition throughput limits (hot partition throttling is the managed equivalent of hotspots).

Connection pooling becomes more complex: each app pod may need pools per shard or a single proxy (PgBouncer, Vitess vtgate) that multiplexes. See connection pooling guidance for sizing when shard count multiplies backend endpoints.

Observability must break out latency and error rates per shard. A healthy fleet average can hide one dying shard that serves 5% of users. Tag metrics with shard_id and run synthetic probes per shard.

When not to shard

Sharding is irreversible complexity. Delay it if:

  • Reads dominate and replicas plus cache fix the problem.
  • Your schema has frequent cross-table joins in the hot path.
  • You lack operational muscle for multi-node failover and backup per shard.
  • Data volume fits one modern NVMe instance with headroom for 12–18 months.

Event sourcing and append-only logs sometimes scale writes by partitioning events by aggregate ID without full relational sharding — but read models still need their own scaling plan. Splitting by event store partition is complementary, not a substitute for understanding relational shard trade-offs.

Practical checklist before you shard

  1. Profile the top 10 write queries — can indexes or query rewrites cut load 50%?
  2. Measure replica lag; if reads are the pain, add replicas and cache layers first.
  3. Pick a shard key with high cardinality and query locality; document hotspot mitigation for your largest tenants.
  4. Prototype scatter-gather cost for your three most common cross-shard reports.
  5. Define idempotent async flows for any operation that spans shards.
  6. Plan resharding before day one — how will you add shard #N+1 without downtime?
  7. Instrument per-shard metrics, backups, and restore drills independently.

Key takeaways

  • Sharding splits rows across databases to scale writes and storage beyond one machine.
  • The shard key drives even load, query locality, and migration cost — hash of user_id or tenant_id is the usual starting point.
  • Hotspots from skewed keys need monitoring and sub-sharding, not just more shards.
  • Cross-shard joins are expensive; denormalize, queue async updates, or query a warehouse instead.
  • Try indexes, vertical scale, replicas, and caching before accepting sharding’s operational tax.