Guide
Raft consensus algorithm explained
Running a database or configuration service on one machine is easy until that machine dies. Replicating state across nodes introduces a harder problem: how do independent servers agree on the same ordered history of operations when networks delay messages, nodes crash, and clocks disagree? Raft is a consensus algorithm designed to be understandable while providing the same safety guarantees as Paxos. It powers etcd (Kubernetes’ brain), HashiCorp Consul, CockroachDB’s replication layer, and many other systems. This guide walks through leader election, log replication, commit rules, split-vote recovery, membership changes, a three-node worked example, a protocol decision table, common pitfalls, and a production checklist — with links to broader distributed consistency and CAP trade-offs for context.
What Raft solves
Consensus means every non-faulty node eventually applies the same sequence of commands to a deterministic state machine. Clients can then read consistent configuration, locks, or metadata even if leaders fail. Raft decomposes the problem into three mostly independent sub-problems:
- Leader election — pick exactly one leader at a time.
- Log replication — the leader accepts client commands and replicates them to followers.
- Safety — committed entries are never lost or overwritten on a majority of nodes.
Unlike ad-hoc primary-backup failover, Raft’s rules prevent two leaders from committing conflicting writes in the same term — the classic split-brain scenario that corrupts replicated logs.
Node roles: follower, candidate, leader
At any moment each node is in one of three states:
- Follower — passive; responds to RPCs from leaders and candidates; does not initiate elections.
- Candidate — temporarily campaigning for leadership after an election timeout.
- Leader — handles all client writes; replicates log entries via
AppendEntriesRPCs.
Normal operation has one leader and N−1 followers. Followers start
election timers when they stop hearing from a leader. If a timer fires, the follower
increments its term (a logical clock), votes for itself, and
broadcasts RequestVote RPCs. A candidate that receives votes from a
majority of the cluster becomes leader and sends heartbeats to reset
follower timers. If another server wins first, the candidate returns to follower.
Election timeouts are randomized (typically 150–300 ms in etcd) so simultaneous candidates rarely split votes forever. Higher terms always supersede lower ones — stale leaders automatically step down when they discover a newer term.
Log replication and commit index
Client commands append to the leader’s replicated log as
entries with term numbers and command payloads. The leader sends
AppendEntries RPCs containing the previous log index/term (for
consistency checking), new entries, and its commit index — the
highest entry known to be safely applied.
Followers accept entries only if they match the prior log prefix; mismatches trigger log backtracking until alignment. Once an entry is replicated on a majority of nodes in the leader’s current term, the leader marks it committed and applies it to its state machine. Followers learn the commit index through subsequent heartbeats and apply committed entries in order.
This ordering guarantee is why Raft-backed stores like etcd expose linearizable reads when you use the right API flags — the log is a single authoritative timeline, similar in spirit to write-ahead log replication in database replication, but with explicit consensus rather than assuming a fixed primary.
Safety properties
Raft’s safety proofs rest on a few intuitive rules:
- Election safety — at most one leader per term.
- Leader completeness — a leader for term T contains all entries committed in terms < T.
- Log matching — if two logs share the same index and term, their prefixes are identical.
- State machine safety — if a node applies an entry at index i, no other node applies a different entry at i.
The voting restriction enforces leader completeness: a candidate can win only if its log is at least as up-to-date as every voter’s log (compared first by last term, then by last index). A lagging node cannot steal leadership and truncate committed history.
Split votes, partitions, and quorum math
A network partition can isolate minorities from majorities. Raft requires a majority quorum to elect leaders and commit entries — in a five-node cluster, three nodes must agree. A partitioned minority cannot elect a leader or commit new writes, so clients see unavailability rather than divergence. That is the CP side of the CAP theorem: consistency and partition tolerance at the cost of write availability in the minority partition.
Split votes occur when no candidate reaches a majority in one term. Raft resolves this by incrementing the term on the next election round with fresh randomized timeouts — eventually one candidate wins. Operators should run odd-sized clusters (3, 5, 7) so majority and full tolerance numbers are clear: a 3-node cluster tolerates 1 failure; 5-node tolerates 2.
Membership changes (joint consensus)
Adding or removing nodes naively can create two majorities that overlap partially — enough to elect two leaders. Raft’s joint consensus phase requires agreement from both the old and new configurations before switching fully. In practice, managed systems (etcd, Consul) expose safe member-add/remove APIs that implement this protocol; operators should never manually copy data directories and expect the cluster to self-heal.
For large fleets, learner/non-voting nodes catch up without participating in quorum — useful for read replicas or cross-region observers before promotion to voter.
Worked example: three-node etcd cluster
Imagine nodes A, B, C running etcd for Kubernetes API storage. All start as followers; A’s election timer fires first. A becomes candidate (term 2), votes for itself, receives yes from B — majority reached — and becomes leader.
A client sends PUT /registry/deploy. A appends entry (index 42, term 2)
locally, streams it to B and C. B acknowledges; C is slow but eventually acks. Two
of three replicas hold index 42 — A commits and responds success. C backfills later
if it was behind.
A crashes. B and C time out; B wins term 3 election because its log matches A’s committed prefix. B continues serving. When A returns, it sees term 3 heartbeats and reverts to follower — no split brain. If the network splits {A} from {B,C}, A cannot commit (no majority); {B,C} continue — clients routed to the majority partition stay consistent.
Protocol decision table
| Approach | Best for | Trade-offs |
|---|---|---|
| Raft | Config stores, coordination (etcd, Consul), understandable ops | Leader bottleneck; majority quorum required for writes |
| Multi-Paxos | High-throughput replicated logs where Raft variants insufficient | Harder to implement and reason about; fewer off-the-shelf tools |
| Primary-backup (manual failover) | Simple RDBMS replication with human promotion | Split-brain risk without fencing; slower failover |
| Leaderless (Dynamo-style quorum) | Multi-region KV with tunable R/W quorums | Conflict resolution, eventual consistency paths |
| Blockchain BFT (PBFT, Tendermint) | Adversarial validators, public chains | Higher latency and complexity; overkill for private control planes |
Common pitfalls
- Even-numbered clusters — 4 nodes tolerate only 1 failure like 3 nodes but cost more; prefer odd counts.
- Clock skew assumptions — Raft does not need synchronized clocks for safety, but operators still confuse wall-clock TTL with log terms.
- Disk latency ignored — fsync-heavy followers slow commits; SSD and dedicated WAL matter.
- Restoring stale snapshots — reintroducing an old node without
--force-new-clusteror proper join can disrupt quorum. - Cross-region voters — WAN latency on every write hurts; use observers locally and voters in one region.
- Treating reads as free — linearizable reads may require leader confirmation or read-index RPCs.
Production checklist
- Size cluster for fault tolerance target (3 for single failure, 5 for two).
- Place voters in independent failure domains (racks, AZs) with stable low-latency links.
- Monitor leader changes, election rates, commit latency, and disk fsync times.
- Automate backups of consistent snapshots; test restore on staging quarterly.
- Use TLS and auth between peers; Raft does not encrypt by default.
- Document safe member-add/remove runbooks; never clone data dirs blindly.
- Cap client request size; large values bloat logs and slow replication.
- Pair with health probes in orchestrators so traffic drains before maintenance.
Key takeaways
- Raft elects a single leader per term and replicates an ordered log to followers.
- Majority quorums gate elections and commits — minorities fail safe by stopping writes.
- Log matching and voting rules prevent committed entries from being erased.
- Joint consensus makes membership changes safe; use managed APIs.
- etcd, Consul, and similar tools embed Raft so platforms like Kubernetes get reliable coordination without implementing Paxos by hand.
Related reading
- Distributed systems consistency explained — CAP, eventual vs strong models, quorum reads
- CAP theorem and consistency models explained — partition tolerance and linearizability
- Database replication explained — primary-replica, WAL shipping, failover patterns
- Microservices architecture explained — why clusters need distributed coordination