Explainer · 7 June 2026

How Bloom filters and probabilistic data structures work

A Bloom filter is a compact data structure that answers one question about set membership: might this item be in the set? It can say definitely not or probably yes — never the reverse. That asymmetry sounds wrong until you realize how often production systems only need a cheap first pass before hitting disk, a remote index, or a Solana RPC call.

The membership problem at scale

Imagine you run a URL shortener with five billion links. A user requests /abc123. You could store every slug in a hash table — fast lookups, but gigabytes of RAM and painful resharding as traffic grows. Or you could query a database on every click — correct, but slow and expensive at peak.

Many workloads have a useful middle path: reject unknown keys instantly and only pay the full lookup cost when the filter says "maybe." False positives waste work; false negatives would serve 404s for real links. Bloom filters forbid false negatives by design — if an element was inserted, every bit it touched stays set until the filter is rebuilt.

The same pattern appears in database indexing (skip pages that cannot contain a key), rate limiters (track seen request IDs in bounded memory), CDN edge caches (avoid origin round-trips for cold objects), and blockchain light clients (compact proofs about which accounts changed). The structure is fifty years old; the trade-off is more relevant than ever.

How a classic Bloom filter is built

Burton Bloom described the structure in 1970. You start with a bit array of length m, all zeros. To insert an element, run it through k independent hash functions, each mapping to an index in [0, m), and set those k bits to 1.

To query, hash the candidate the same way. If any of the k positions is still 0, the element was never inserted — guaranteed. If all k bits are 1, the element might have been inserted — or those bits were set by other elements' hashes colliding in just the right pattern.

There is no deletion in the standard form: clearing a bit could break queries for other elements that shared it. Variants like counting Bloom filters replace bits with small counters (increment on insert, decrement on delete) at the cost of more memory. Cuckoo filters store fingerprints in a cuckoo hash table and support deletion with different false-positive math.

Hash quality matters. Cryptographic hashes (SHA-256) are overkill and slow; fast non-cryptographic hashes (MurmurHash, xxHash) are typical. The k functions are often derived from one seed hash by double-hashing: hi(x) = (h1(x) + i * h2(x)) mod m, which reduces CPU without sacrificing independence in practice.

False positives: tuning m, n, and k

Let n be the number of inserted elements. As the array fills, collisions rise. The approximate false positive rate for large m is:

p ≈ (1 - e-kn/m)k

For fixed n and target p, textbook formulas pick:

  • m ≈ -n ln(p) / (ln 2)2 bits of storage
  • k ≈ (m/n) ln 2 hash functions — often around 7–10 for 1% FP rate

Example: one million URLs at 1% false positive rate needs roughly 1.2 MB of bits plus hash CPU — far less than storing a million 64-bit pointers. At 0.1% FP, memory roughly doubles. Operators choose p by costing a false positive (wasted DB read, extra RPC) against RAM and latency.

Unlike a Merkle tree, which proves exact inclusion with logarithmic proof size, a Bloom filter cannot prove membership — only suggest it. Merkle structures anchor trust in tamper-evident hashes; Bloom filters optimize for space at the expense of certainty. They solve different problems and sometimes compose: a light client might use a Bloom summary to decide which Merkle proofs to fetch.

Where production systems use them

  • Database engines — Cassandra and PostgreSQL extensions use Bloom filters per SSTable to skip disk reads when a key cannot exist in a file. Wrong skip would be a false negative; the filter prevents that.
  • Web caches and CDNs — Negative caches ("this object does not exist on origin") implemented as Bloom filters at the edge reduce thundering herds without storing every 404 forever.
  • Network routers and intrusion detection — Compact blocklists for banned IPs or known-bad packet signatures at line rate.
  • Spell checkers and malware scanners — Dictionary membership in kilobytes instead of megabytes.
  • Distributed joins — MapReduce-style pipelines ship Bloom filters between stages so reducers skip keys no mapper could have emitted.
  • Blockchain light clients — Summarize which accounts or transactions a client might need to verify without downloading full state — false positives mean extra fetches, not wrong balances.

On Solana, validators and indexers juggle millions of accounts. While the chain's primary correctness model is full replication or Merkle-verified slices, adjacent infrastructure (RPC caches, gossip deduplication, indexer bloom partitions) often uses probabilistic filters to shed load before hitting storage. Understanding how accounts are addressed and stored clarifies why "maybe present" hints help but never replace on-chain verification.

Bloom filters vs hash tables vs sketches

A hash table gives exact answers with expected O(1) time but stores keys or pointers — memory proportional to n. A Bloom filter uses O(1) bits per element (with constant chosen by p) and O(k) time per operation, but cannot list members, count distinct items accurately, or return the stored value.

Related probabilistic sketches extend the idea:

  • HyperLogLog — estimates cardinality (how many unique users visited) in ~12 KB with ~2% error.
  • Count-Min Sketch — estimates frequency of heavy hitters in streams (which IP sent the most packets).
  • Quotient filters — cache-friendly alternative to Bloom with better locality and deletion support.

Pick the structure by the question you need answered. Membership with no false negatives → Bloom (or cuckoo filter if you need delete). Distinct count → HyperLogLog. Top-k frequency → Count-Min. Exact key-value → hash table or B-tree index.

Operational pitfalls

  • Cannot shrink or enumerate — Rebuild the filter when n grows past the design capacity or FP rate drifts unacceptable.
  • No deletion in standard Bloom — Use counting variants or rotate filters (maintain two filters, query both during migration).
  • Adversarial keys — If attackers choose inputs to maximize collisions, use keyed hashing (SipHash with secret seed) so they cannot craft false positives offline.
  • Serialization and endianness — Filters shipped between nodes must agree on bit order and hash seeds; version the on-wire format.
  • Assuming "probably yes" means yes — Always confirm with authoritative storage. Payment verification, for example, must still check the chain — a cache hint is not settlement.

When to reach for a Bloom filter

Reach for one when all of these hold:

  1. False negatives are unacceptable; occasional false positives are cheap to absorb.
  2. Memory or bandwidth for an exact set is prohibitive.
  3. You only need membership, not iteration or values.
  4. Insertions are mostly append-only, or you have a deletion-capable variant.

Skip them when you need exact counts, range queries, or cryptographic proofs of inclusion — use indexes, hash tables, or Merkle structures instead. The Bloom filter's genius is admitting imperfection in the right layer so the expensive layer runs less often.

Related on Solana Garden: Merkle trees and content-addressable storage, Database indexing explained, API rate limiting explained, Solana account model, Explainers hub.