Explainer · 7 June 2026

How the Raft consensus algorithm works

Run three copies of a database and a network glitch can leave each copy with a different value for the same key. Consensus is the family of protocols that keeps replicas agreeing on one ordered history of updates. Raft, published by Diego Ongaro and John Ousterhout in 2014, was designed to be teachable: one elected leader, a replicated log of commands, and a majority quorum before anything is considered committed. That simplicity is why etcd, Consul, CockroachDB's early layers, and countless homework assignments use Raft instead of wrestling with Paxos on a whiteboard.

The problem consensus solves

A replicated state machine starts from the same initial state on every node. Clients send commands — "set balance to 42", "register hostname api.example" — and every correct replica must apply them in the same order so they reach the same final state. If replica A commits x=1 then x=2 while replica B commits x=2 then x=1, you no longer have a database; you have a coin flip.

Consensus guarantees safety (never commit conflicting histories) and liveness (eventually commit client commands) as long as a majority of nodes can talk to each other. That majority rule is the same quorum intuition behind the CAP theorem: during a partition, only one side can hold a quorum and make progress; the minority side stops accepting writes rather than fork the log.

Raft roles: leader, follower, candidate

At any moment each node is a follower, a candidate, or the leader. Followers are passive: they respond to RPCs from the leader and candidates, and they redirect client writes to the leader. The leader is the only node that accepts new client commands and replicates them to followers.

Leaders are elected through terms — monotonically increasing epoch numbers stored on every node. If followers stop receiving heartbeats from the leader within a randomized election timeout (typically 150–300 ms, tuned per deployment), they increment their term, vote for themselves, and become candidates. A candidate wins if it receives votes from a majority of the cluster in the same term. Ties split by randomized timeouts so another candidate usually wins the next round.

Each follower grants at most one vote per term and only votes for candidates whose log is at least as up-to-date as its own. That last rule is critical: it prevents a lagging node that just came back online from winning an election and overwriting committed entries with stale data.

Log replication: the heart of Raft

Client commands append to the leader's log as entries with a term number and index. The leader sends AppendEntries RPCs to followers. Each RPC carries the previous log index and term; followers reject mismatches and the leader decrements its next index until it finds a prefix both sides agree on — then it ships the missing suffix.

An entry is committed once the leader knows a majority of nodes have stored it. The leader then applies committed entries to its state machine and tells followers which index is safe to apply. Committed entries are permanent: Raft's safety proof shows no future leader can overwrite them, because any future leader must have been voted in by a majority that overlaps the prior commit quorum — and overlapping majorities share at least one node with the highest committed index.

The log is conceptually similar to a hash-linked chain of state transitions, though Raft entries are numbered tuples rather than Merkle nodes. Snapshots compact old log prefixes: the leader ships a snapshot plus a lastIncludedIndex so slow followers can catch up without replaying millions of entries.

What clients actually experience

A write hits the leader, waits for majority replication, then returns success. Latency is roughly one round-trip to the leader plus one to the slowest follower in the quorum — often 5–20 ms on a LAN, 50–200 ms cross-region. Reads can be served from the leader without extra round-trips (strong consistency) or from followers with monotonic reads if you track the leader's commit index — a trade-off familiar from load-balanced read replicas that may lag primaries.

If the leader dies mid-write, the client may get a timeout even though the entry later commits — your application must use idempotent command IDs or compare-and-set semantics, the same discipline as event-driven consumers that retry after ambiguous failures.

During leader election the cluster is briefly unavailable for writes — usually well under a second with tuned timeouts. That is why control planes (Kubernetes etcd, service discovery) colocate Raft peers in one region for low latency, while analytics replicas sit elsewhere with asynchronous replication.

Raft vs Paxos vs blockchain BFT

Paxos solves the same problem but describes a single-slot agreement primitive; building a full replicated log from multi-Paxos is notoriously subtle. Raft packages leader election + log replication into one narrative engineers can implement from the paper without hidden lemmas.

Byzantine fault tolerance (BFT) tolerates malicious nodes that lie about votes. Raft assumes crash faults only — nodes fail stop or return to honest behavior. Public blockchains need BFT or proof-of-stake economics because anyone can run a validator. Solana's Tower BFT and similar protocols add slashing and stake-weighted votes on top of partially synchronous assumptions; they optimize for thousands of validators and probabilistic finality, not a five-node etcd cluster.

When you verify an on-chain payment, you are trusting blockchain consensus — not Raft — to make a transfer irreversible after enough confirmations. Application databases behind your API still often use Raft (or cloud-managed equivalents) for their own metadata and session stores.

Membership changes and joint consensus

Adding or removing a node changes quorum size. Naively flipping from three to five nodes can create a window where two different majorities exist across old and new configs — a classic way to fork a log. Raft's joint consensus phase requires majorities in both the old and new configuration before dropping the old one. Operators see this as a two-step membership change in etcd: add new peer as learner, promote, remove old peer.

Learners (non-voting replicas) are a common production extension: they catch up without counting toward commit quorum, useful for cross-region observers and backup nodes that should not slow writes.

Failure modes operators should rehearse

  • Split brain avoided by quorum — two partitions cannot both commit; the minority stops serving writes. Monitor "no leader" alerts.
  • Clock skew — Raft does not need tight NTP for correctness, but wildly skewed logs complicate debugging; keep reasonable time sync.
  • Disk stalls — a follower with a full disk stops acknowledging entries; the leader retries forever. Disk pressure is a consensus outage.
  • Repeated elections — if election timeouts are too aggressive relative to network RTT, nodes flap candidate/leader and throughput collapses. Tune timeouts to measured p99 RTT.
  • Snapshot lag — a node offline for days may need a full snapshot transfer; plan bandwidth and snapshot frequency.

Practical checklist

  • Run an odd number of voters (3 or 5) across failure domains; avoid 2-node "HA" pairs — they cannot survive one loss with quorum.
  • Keep leader and voters in the same region for write latency; use learners for DR copies.
  • Expose metrics: current term, leader id, commit index, election count, append latency, snapshot duration.
  • Test leader kill, partition, and slow follower in staging monthly.
  • Document whether your product reads from followers and what staleness is acceptable.

Related on Solana Garden: CAP theorem explained, Merkle trees explained, Event-driven architecture explained, Verify Solana payments, Explainers hub.