Explainer · 7 June 2026
How Merkle trees and content-addressable storage work
You have a million records — transactions, files, database rows — and you need one short fingerprint everyone can agree on. A Merkle tree hashes them into a binary tree whose root hash summarizes the entire set. Change one leaf and the root changes. That property powers content-addressable storage, where the address of data is its hash: Git commits, IPFS files, and the account state roots behind chains like Solana all lean on the same idea.
Hashes as fixed-size fingerprints
A cryptographic hash function (SHA-256 is the common choice) maps input of any length to a fixed output — 256 bits for SHA-256. The mapping is one-way (you cannot reverse it to recover the input) and collision-resistant (finding two different inputs with the same hash is computationally infeasible).
Two properties matter for Merkle trees:
- Determinism — the same bytes always produce the same hash.
- Avalanche effect — flipping one bit in the input changes roughly half the output bits. Tampering is obvious.
A hash is not encryption. Publishing SHA-256("hello") does not
hide that you hashed "hello" if the word is guessable. Hashes are for
integrity and naming, not secrecy.
Building a Merkle tree
Start with leaf nodes: hash each item in your dataset individually. If you have eight transactions, you get eight leaf hashes.
Pair adjacent leaves and hash their concatenation to form parent nodes. Repeat until one node remains — the Merkle root. With eight leaves you need three levels above them; with a million leaves you need about twenty levels because each level halves the count.
Implementation details that matter in production:
- Odd counts — if a level has an odd number of nodes, the last hash is typically duplicated and paired with itself. Protocols must specify this rule; mismatches break compatibility.
- Leaf preprocessing — Bitcoin hashes transaction data twice; Ethereum prefixes leaves with a type byte. The exact serialization before hashing is part of the protocol.
- Order sensitivity — swapping two leaves changes the root. Sorted Merkle trees (where leaves are ordered by key) enable different proof types but cost reordering on insert.
The tree is rarely stored in full. Most systems store leaves plus maybe one level of internal nodes, recomputing paths on demand. The root alone — 32 bytes — is what gets signed, embedded in block headers, or pinned as "current state."
Inclusion proofs: trust without downloading everything
A Merkle proof (inclusion proof) demonstrates that one leaf belongs to a tree with a known root. The prover supplies the leaf hash plus one sibling hash per level — log2(n) hashes for n leaves.
The verifier recomputes upward: hash the leaf with its sibling to get a parent, hash that parent with the next sibling, and so on. If the final value equals the published root, the leaf is included. If any step fails, either the leaf was not in the set or someone lied about a sibling.
This is why light clients on blockchains work. Your phone does not download every transaction in history; it downloads a block header containing a Merkle root and asks a full node for a proof that your payment appears in that block. The math is cheap; bandwidth is not.
Exclusion proofs are harder. Proving "this key is not in the set" requires sorted trees or more advanced structures (Merkle Patricia tries, sparse Merkle trees). Ethereum's state trie is a Patricia trie whose nodes are hashed Merkle-style; Solana's account model stores state differently but still relies on hashed commitments for snapshots and verification paths.
Content-addressable storage: name equals content
In a traditional filesystem, you name a file report.pdf and the
operating system maps that string to disk sectors. Rename the file and the
content address changes even though the bytes did not.
In content-addressable storage (CAS), the identifier is
derived from the content: cid = hash(bytes). Store the bytes under
that hash; retrieve by hash. Identical content deduplicates automatically —
two uploads of the same image share one stored blob.
CAS does not imply a Merkle tree by itself. A flat hash table keyed by content hash is CAS. Merkle trees add structured aggregation: a directory of files can have a Merkle root over all file hashes, so you verify the whole directory with one root while still fetching individual files by their content addresses.
Where you already use this
Git
Git objects — blobs, trees, commits — are content-addressed by SHA-1 (legacy) or SHA-256 (newer repos). A tree object lists filenames pointing to blob hashes; a commit points to a tree hash plus parent commit hashes. The commit hash commits to the entire snapshot. That is a Merkle DAG, not a strict binary tree, but the trust model is identical: change one file and every descendant commit hash changes.
IPFS
IPFS chunks files, hashes each chunk, and builds a Merkle DAG (often balance-shaped for large files). The CID (Content Identifier) you share is the root hash plus codec metadata. Peers fetch only the chunks they need; integrity is verified at every step because a wrong chunk produces the wrong root.
Blockchains and Solana
Bitcoin puts the Merkle root of transactions in each block header. Ethereum puts state, transaction, and receipt roots in each block. Solana validators process slots full of signatures and account updates; snapshots and proofs lean on hashed state commitments so nodes can sync without replaying every historical byte naively.
When you verify a Solana payment, you are checking signatures and account balances against what validators agreed was canonical — downstream of the same commitment machinery that Merkle trees were invented to compress.
Certificate transparency and software supply chains
Certificate Transparency logs append TLS certificates to a Merkle tree and
publish signed tree heads. Browsers can demand proofs that a rogue cert was
logged. Container image registries and package mirrors increasingly publish
digest lists — content hashes — so docker pull at a specific SHA
is CAS in practice.
Tampering, second preimages, and trust anchors
Merkle roots detect tampering: alter a leaf and the root will not match unless you recompute the entire tree and convince everyone to accept a new root. They do not tell you who published the root. Trust always anchors somewhere — a block producer's signature, a Git tag signed with GPG, an HTTPS certificate on the IPFS gateway.
Second-preimage attacks (find different content with the same hash) would break CAS. SHA-256 remains safe at practical sizes; MD5 and SHA-1 are broken for collision attacks and should not be used for new integrity systems.
Reorg risk on blockchains is a social layer on top: the Merkle root in block N is mathematically consistent, but block N+1 might replace N on a fork. Finality rules (see slots and epochs on Solana) tell you when to stop waiting for a deeper root.
Trade-offs and alternatives
Merkle trees excel when you need incremental proofs over a large static or slowly changing set. They are less ideal when:
- Updates are constant and random — rebuilding subtrees on every insert can be expensive; B-trees or LSM-trees win for OLTP databases (see database indexing for query-path trade-offs).
- You need range queries — plain Merkle trees do not support "all keys between A and B" without extra structure.
- Proof size must be tiny on tiny sets — for three items, sending all three might be smaller than a proof.
Verkle trees (vector commitments instead of pairwise hashing) shrink proof sizes at the cost of heavier cryptography — an active research path for Ethereum scaling. Accumulator schemes and zero-knowledge proofs can prove membership without revealing which leaf, at even higher computational cost.
Practical checklist
When you design or audit a system that advertises Merkle roots or content hashes:
- Specify exact byte serialization before hashing — off-by-one metadata breaks cross-client roots.
- Document odd-leaf pairing and hash concatenation order (left||right vs right||left).
- Separate integrity (hash matches content) from authenticity (who signed the root).
- Pin dependencies by digest in CI —
sha256:abc…notlatest. - For on-chain apps, understand which root your client trusts and at what confirmation depth.
Related on Solana Garden: Solana account model, verify Solana payments, slots and epochs, Explainers hub.