Guide

Load balancing explained

A load balancer sits between clients and a pool of backend servers, distributing incoming requests so no single machine bears the full traffic load. Without one, scaling means hoping users hit different servers by chance — or manually updating DNS every time you add capacity. Load balancers solve three problems at once: capacity (spread work across replicas), availability (route around failed nodes), and operability (drain or replace backends without downtime). This guide covers the two main balancer types (Layer 4 and Layer 7), the algorithms that decide which server gets each request, how health checks keep bad nodes out of rotation, and the production mistakes that turn a load balancer into a single point of failure.

Why load balancing exists

Modern web apps rarely run on a single server. Traffic spikes, zero-downtime deploys, and geographic latency all push teams toward horizontal scaling — running multiple identical copies of the same service behind a shared entry point. The load balancer is that entry point: clients connect to one stable hostname or IP, and the balancer forwards each connection or HTTP request to a healthy backend.

Vertical scaling (bigger CPU, more RAM on one box) hits hardware ceilings and creates a fragile single point of failure. Horizontal scaling trades that for coordination complexity — session state, cache coherence, database connections — but the load balancer is the piece that makes the trade worthwhile. In Kubernetes, the Service resource is effectively a built-in load balancer that maps a cluster IP to a set of pod endpoints, updated automatically as replicas scale up or down.

Load balancer vs reverse proxy

The terms overlap. A reverse proxy (nginx, HAProxy, Envoy) terminates client connections and forwards them upstream. When it distributes across multiple upstreams, it is acting as a load balancer. Cloud-managed products (AWS Application Load Balancer, Google Cloud Load Balancing) bundle distribution with TLS termination, WAF rules, and autoscaling integration. The distinction matters less than knowing what layer your balancer operates at and which algorithm it uses.

Layer 4 vs Layer 7 load balancing

Layer 4 (transport layer)

An L4 balancer routes based on IP address and TCP/UDP port — it does not inspect HTTP headers or URL paths. It is fast and protocol-agnostic: the same balancer can front MySQL replicas, gRPC services, or WebSocket servers. AWS Network Load Balancer (NLB) is a common L4 product. Because L4 balancers see only packets, they cannot route /api to one pool and /static to another; every connection to port 443 goes to the same backend group.

Layer 7 (application layer)

An L7 balancer understands HTTP semantics: Host header, URL path, cookies, and request method. That enables path-based routing (/v2/* to the new API fleet), header-based canary releases (send 5% of traffic with header X-Beta: true to staging), and content-type routing. L7 balancers can also cache responses, compress bodies, and inject security headers — features that overlap with a CDN at the edge. The tradeoff is higher per-request CPU cost and tighter coupling to HTTP; non-HTTP protocols need L4 or a dedicated gateway.

Many production stacks use both: an L4 balancer at the network edge for high-throughput TCP passthrough, and an L7 reverse proxy inside the cluster for routing and TLS termination close to the application.

Load balancing algorithms

The algorithm determines which backend receives the next request. Picking the wrong one causes uneven load, cache stampedes, or broken sessions.

Round robin

Requests cycle through backends in fixed order: A, B, C, A, B, C. Simple and stateless. Works well when every server has identical capacity and requests have similar cost. Fails when one server is slower (it still gets equal share) or when request sizes vary wildly (one heavy report skews a replica that also serves lightweight pings).

Weighted round robin

Assign each backend a weight proportional to its capacity. A machine with twice the CPU might get weight 2 and receive roughly twice the traffic. Useful during rolling deploys when the new version starts with weight 1 and ramps to weight 10 after soak time, or when mixing instance types in the same pool.

Least connections

Send the next request to the backend with the fewest active connections. Better than plain round robin when requests have variable duration — long-polling, WebSocket sessions, or slow database queries. Requires the balancer to track per-backend connection counts, which adds state but pays off under skewed workloads.

IP hash and consistent hashing

IP hash maps each client IP to a fixed backend. The same user always hits the same server — a crude form of session affinity without cookies. Consistent hashing (used in distributed caches and some service meshes) maps both clients and servers onto a hash ring so adding or removing a node only reshuffles a fraction of keys, not the entire mapping. Valuable for cache-heavy architectures where cold backends after a scale event would otherwise trigger a cache stampede.

Random and power of two choices

Pure random selection is surprisingly effective at scale. Power of two choices — pick two random backends and send to the less loaded one — gives near-optimal distribution with minimal balancer state. Envoy and many modern proxies use variants of this for large backend pools.

Health checks and failover

A load balancer is only as good as its picture of backend health. Active health checks periodically probe each server — typically an HTTP GET to /health or a TCP connect on the service port. Backends that fail consecutive probes are removed from rotation; recovered backends are added back after passing a success threshold.

Design health endpoints carefully. A /health that always returns 200 even when the database is down will keep sending traffic to a broken app. A deep health check verifies critical dependencies (DB ping, cache reachability) but runs less frequently or on a separate /ready endpoint so Kubernetes liveness probes do not restart pods during brief dependency blips. The split between liveness (process up?) and readiness (can serve traffic?) is standard in orchestrated environments.

Configure check intervals and failure thresholds to match your SLO. Aggressive checks (every second, fail after one miss) cause flapping during deploys. Lenient checks (30-second interval, fail after five misses) keep dying servers in rotation for minutes. Pair health-check metrics with observability — alert when healthy backend count drops below your replica minimum.

Sticky sessions and stateful applications

Stateless apps — where any replica can handle any request — are the ideal load-balanced target. Session affinity (sticky sessions) pins a user to one backend, usually via a cookie the balancer sets (SERVERID=backend-3) or via consistent client-IP hashing. Sticky sessions are a crutch for apps that store session data in local memory instead of a shared store like Redis.

Sticky sessions complicate deploys and failover: draining a backend means waiting for sticky users to finish or forcibly reassigning them (and losing in-memory state). Prefer externalizing session state to Redis or the database, then run stateless replicas with any distribution algorithm. If you must use stickiness, set a short cookie TTL and plan for backend loss — users will re-authenticate or lose cart state when their pinned server disappears.

TLS termination and connection handling

Most production balancers terminate TLS at the edge: clients speak HTTPS to the balancer, and the balancer speaks plain HTTP (or re-encrypted HTTPS) to backends on a private network. Centralizing certificate management simplifies renewal and lets you enforce TLS versions and cipher suites in one place. See TLS and HTTPS explained for the handshake details.

Connection draining (graceful shutdown) tells the balancer to stop sending new requests to a backend while existing connections finish. Essential during deploys: without draining, in-flight requests get reset the moment a pod terminates. Set a drain timeout long enough for your slowest endpoint — payment webhooks, file uploads — or clients will see 502 errors mid-request.

Watch for connection limits. Each backend accepts a finite number of concurrent connections; the balancer multiplies client connections across replicas but can itself become a bottleneck at extreme scale. HTTP/2 multiplexing reduces connection count but concentrates many streams on one TCP link — a slow backend blocks all streams on that connection under some implementations.

DNS, global load balancing, and autoscaling

A regional load balancer handles traffic inside one data center or cloud region. Global load balancing uses DNS or anycast to route users to the nearest healthy region. DNS-based GSLB (Route 53 latency routing, Cloudflare Load Balancing) returns different A records based on client geography. TTL and caching mean DNS failover is slower than regional balancer failover — plan for minutes, not seconds, when a whole region goes dark. See DNS explained for how TTL affects cutover speed.

Pair load balancers with autoscaling: when average CPU or request rate crosses a threshold, add replicas; the balancer discovers new endpoints via service registry or cloud API. Scale-in policies should respect connection draining and minimum healthy host counts so you never shrink to zero during a traffic lull that suddenly spikes.

Common production pitfalls

Health check lies. Returning 200 while dependencies are down keeps broken servers in rotation until users complain.
Ignoring the balancer's own limits. Connection table exhaustion, certificate expiry, and WAF rule misconfiguration take down every backend at once because traffic never reaches them.
Assuming even distribution. Round robin with heterogeneous instance types or request durations creates hot spots. Monitor per-backend CPU and latency, not just aggregate throughput.
Sticky sessions without a plan. Deploys that kill pinned backends without draining cause mysterious logouts and duplicate-submit bugs.
No rate limiting at the edge. A load balancer spreads attack traffic across all replicas instead of absorbing it. Pair with API rate limiting or WAF rules at the balancer layer.
WebSocket and long-poll neglect. Idle timeouts on the balancer or cloud load balancer close connections the app considers open. Tune idle timeout to exceed your longest expected quiet period.

Key takeaways

Load balancers distribute traffic for capacity, availability, and zero-downtime deploys.
L4 routes by IP/port (fast, protocol-agnostic); L7 routes by HTTP content (flexible, higher cost).
Algorithm choice matters: least connections beats round robin for variable-duration work; consistent hashing protects cache locality.
Health checks must reflect real readiness — split liveness from readiness in orchestrated stacks.
Prefer stateless replicas over sticky sessions; terminate TLS at the edge and drain connections during deploys.