Explainer · 7 June 2026
How HyperLogLog and cardinality estimation work
Your analytics dashboard wants one number: how many unique visitors hit the site today? Storing every session ID in a hash set gives an exact answer but costs memory proportional to distinct count — millions of users means hundreds of megabytes per counter. HyperLogLog (HLL) trades exactness for a fixed-size sketch: roughly 12 KB of state estimates cardinality within about 2% relative error, no matter whether you saw ten thousand or ten billion unique IDs.
The distinct-count problem
Cardinality estimation answers: how many different elements have I seen in
this stream? It is not the same as counting total events. A user who
refreshes fifty times contributes one to distinct visitors and fifty to page
views. CDN logs, ad impression trackers, database optimizers estimating
COUNT(DISTINCT user_id), and
observability
pipelines all need distinct counts at scale.
Naive approaches fail in predictable ways:
- Exact hash set — O(n) memory, must store every key. Works for thousands; painful for billions.
- Sort and dedupe — O(n log n) time and still O(n) space. Fine for batch jobs, not streaming edges.
- Bitmap of all possible IDs — Only viable when the universe is tiny (e.g. IPv4 without aggregation).
HyperLogLog belongs to a family of probabilistic sketches — compact summaries that sacrifice precision for bounded memory, like Bloom filters for membership. Where Bloom answers "maybe in set?", HLL answers "approximately how many unique items?"
Intuition: counting rare hash patterns
Hash every element with a uniform function (e.g. SHA-256 truncated to 64 bits).
Convert the hash to binary. Count leading zeros before the first
1 bit. A hash starting 0001011... has three leading zeros.
With a good hash, each extra leading zero is roughly half as likely. Seeing a
pattern with k leading zeros suggests you have observed on the order of
2k distinct items — the rarest pattern you have witnessed
sets a lower bound on how crowded the set must be. Flajolet and Martin formalized
this in 1985; HyperLogLog (Flajolet, Fusy, Gandouet, Meunier, 2007) refined it
into a production-ready algorithm with dramatically lower variance.
A single hash is noisy. HyperLogLog splits the hash: the first
p bits pick one of m = 2p registers;
the remaining bits supply the leading-zero count (plus one, capped). Each
register stores the maximum leading-zero count seen for its bucket. The final
estimate combines all registers via a harmonic mean, with bias correction for
small and large cardinalities.
Algorithm walkthrough
Standard parameters: p = 14 gives m = 16,384 registers
and about 1.6% standard error. Memory is m bytes if each register
holds one byte (enough for counts up to 255 leading zeros).
Add element x:
- Compute 64-bit hash
h(x). - Let
j= toppbits ofh(register index). - Let
w= remaining64 - pbits. - Compute
ρ(w)= number of leading zeros inw, plus 1. - Set
register[j] = max(register[j], ρ(w)).
Estimate cardinality: Let M be the harmonic mean of
2register[j] across all registers, adjusted by bias
correction constant αm. The raw estimate is
E = αm · m² / M. Apply small-range correction (linear
counting) when E is below 2.5m, and large-range
correction when hash space is nearly saturated.
HyperLogLog++ (Google, 2013) adds 64-bit hash preference,
improved bias correction tables for sparse registers, and better handling near
n ≈ 0 and n → 264. Redis, PostgreSQL
extensions, and most modern libraries implement HLL++ or close variants.
Why the error is ~1.04 / √m
Each register is an independent noisy observation of how "full" the hash space
is. Averaging m registers drives variance down like a sample mean:
standard error scales as 1/√m. With m = 16,384,
relative error is about 1.04 / 128 ≈ 0.81% in ideal conditions;
practical implementations quote ~1.6% for p = 14 after hash and
correction overhead.
You tune precision by choosing p:
p = 12→ 4 KB, ~2.3% error — edge counters, IoT.p = 14→ 16 KB, ~1.6% error — Redis default, analytics.p = 16→ 64 KB, ~0.8% error — when billing depends on the number.
Doubling precision quadruples memory. For "how many humans visited?" 2% error is usually fine. For fraud detection on exact duplicate device IDs, use an exact structure or a different sketch (Count-Min for frequencies, not cardinality).
Mergeable sketches and distributed counting
A killer property: two HyperLogLog sketches over disjoint streams merge by
per-register max. Shard A counts East Coast users, shard B counts
West Coast — union is pointwise max of registers, no re-scan of raw events.
That makes HLL natural for:
- MapReduce / Flink — partial aggregates per partition, one merge at reduce.
- Redis —
PFADD key elementupdates sketch;PFCOUNT keyreads estimate;PFMERGE dest src1 src2combines shards. - CDN log rollups — each edge node maintains a sketch; central collector merges hourly without shipping every IP.
- Database approximations — optimizers sample distinct counts for join cardinality estimates (though exact query plans still need fresh stats).
Sketches are not invertible. You cannot list members, delete one user, or check membership. If you need those operations, pair HLL with a Bloom filter or maintain a sampled exact set for debugging.
HLL vs exact sets vs Bloom filters
| Structure | Question answered | Memory | Mergeable |
|---|---|---|---|
| Hash set | Exact distinct count + membership | O(n) keys | Set union (expensive) |
| HyperLogLog | Approximate distinct count | O(2p) bytes fixed | Register max (cheap) |
| Bloom filter | Probably in set? (no count) | O(n) bits tuned by FP rate | Bitwise OR (increases FP) |
Use HyperLogLog when you only need how many unique, error of 1–3% is acceptable, memory must stay bounded, and sketches must merge across machines. Use exact sets when cardinality is small, legal/compliance requires precision, or you must enumerate members for export.
Operational pitfalls
- Correlated inputs — HLL assumes hashes look random. If IDs are sequential integers hashed poorly, bias spikes. Always use a strong hash (MurmurHash3, xxHash, SHA-256 truncated).
- Double-counting after merge overlap — Merging sketches assumes streams were disjoint. If the same user appears in both shards before merge, union is correct; if you merge overlapping windows without deduping at the event level, you overcount.
- Cardinality of intersections —
|A ∩ B|is notPFCOUNT(A) + PFCOUNT(B) - PFCOUNT(merge(A,B))with usable accuracy. Use dedicated intersection sketches (e.g. inclusion-exclusion with multiple HLLs, or Theta sketches in Druid). - Time decay — HLL has no TTL per element. Rolling "unique users last 24h" needs windowed sketches, key rotation, or a streaming structure like Count-Min with time buckets.
- Reporting exact numbers to executives — Display "≈ 4.2M (±2%)" or round sensibly. Presenting
4,198,337implies false precision.
When to reach for HyperLogLog
Reach for HLL when all of these hold:
- You need distinct count, not membership or per-key frequency.
- Approximate answer within ~1–3% is acceptable for the decision.
- Cardinality may reach millions or billions but sketch memory must stay fixed.
- Partial counts from shards must merge cheaply without a central dedupe store.
Skip it when you need exact billing, GDPR export of all identifiers, or sets small enough that a hash table fits comfortably in RAM. The algorithm's insight is statistical: rare hash patterns witness crowd size, and averaging many witnesses tames variance — a pattern that appears wherever exact distinct counting would drown in memory.
Related on Solana Garden: Bloom filters and probabilistic data structures, Hash tables and hash maps, Observability: metrics, logs, and tracing, Database query execution plans, Explainers hub.