Explainer · 7 June 2026

How HyperLogLog and cardinality estimation work

Your analytics dashboard wants one number: how many unique visitors hit the site today? Storing every session ID in a hash set gives an exact answer but costs memory proportional to distinct count — millions of users means hundreds of megabytes per counter. HyperLogLog (HLL) trades exactness for a fixed-size sketch: roughly 12 KB of state estimates cardinality within about 2% relative error, no matter whether you saw ten thousand or ten billion unique IDs.

The distinct-count problem

Cardinality estimation answers: how many different elements have I seen in this stream? It is not the same as counting total events. A user who refreshes fifty times contributes one to distinct visitors and fifty to page views. CDN logs, ad impression trackers, database optimizers estimating COUNT(DISTINCT user_id), and observability pipelines all need distinct counts at scale.

Naive approaches fail in predictable ways:

Exact hash set — O(n) memory, must store every key. Works for thousands; painful for billions.
Sort and dedupe — O(n log n) time and still O(n) space. Fine for batch jobs, not streaming edges.
Bitmap of all possible IDs — Only viable when the universe is tiny (e.g. IPv4 without aggregation).

HyperLogLog belongs to a family of probabilistic sketches — compact summaries that sacrifice precision for bounded memory, like Bloom filters for membership. Where Bloom answers "maybe in set?", HLL answers "approximately how many unique items?"

Intuition: counting rare hash patterns

Hash every element with a uniform function (e.g. SHA-256 truncated to 64 bits). Convert the hash to binary. Count leading zeros before the first 1 bit. A hash starting 0001011... has three leading zeros.

With a good hash, each extra leading zero is roughly half as likely. Seeing a pattern with k leading zeros suggests you have observed on the order of 2^k distinct items — the rarest pattern you have witnessed sets a lower bound on how crowded the set must be. Flajolet and Martin formalized this in 1985; HyperLogLog (Flajolet, Fusy, Gandouet, Meunier, 2007) refined it into a production-ready algorithm with dramatically lower variance.

A single hash is noisy. HyperLogLog splits the hash: the first p bits pick one of m = 2^p registers; the remaining bits supply the leading-zero count (plus one, capped). Each register stores the maximum leading-zero count seen for its bucket. The final estimate combines all registers via a harmonic mean, with bias correction for small and large cardinalities.

Algorithm walkthrough

Standard parameters: p = 14 gives m = 16,384 registers and about 1.6% standard error. Memory is m bytes if each register holds one byte (enough for counts up to 255 leading zeros).

Add element x:

Compute 64-bit hash h(x).
Let j = top p bits of h (register index).
Let w = remaining 64 - p bits.
Compute ρ(w) = number of leading zeros in w, plus 1.
Set register[j] = max(register[j], ρ(w)).

Estimate cardinality: Let M be the harmonic mean of 2^register[j] across all registers, adjusted by bias correction constant α_m. The raw estimate is E = α_m · m² / M. Apply small-range correction (linear counting) when E is below 2.5m, and large-range correction when hash space is nearly saturated.

HyperLogLog++ (Google, 2013) adds 64-bit hash preference, improved bias correction tables for sparse registers, and better handling near n ≈ 0 and n → 2⁶⁴. Redis, PostgreSQL extensions, and most modern libraries implement HLL++ or close variants.

Why the error is ~1.04 / √m

Each register is an independent noisy observation of how "full" the hash space is. Averaging m registers drives variance down like a sample mean: standard error scales as 1/√m. With m = 16,384, relative error is about 1.04 / 128 ≈ 0.81% in ideal conditions; practical implementations quote ~1.6% for p = 14 after hash and correction overhead.

You tune precision by choosing p:

p = 12 → 4 KB, ~2.3% error — edge counters, IoT.
p = 14 → 16 KB, ~1.6% error — Redis default, analytics.
p = 16 → 64 KB, ~0.8% error — when billing depends on the number.

Doubling precision quadruples memory. For "how many humans visited?" 2% error is usually fine. For fraud detection on exact duplicate device IDs, use an exact structure or a different sketch (Count-Min for frequencies, not cardinality).

Mergeable sketches and distributed counting

A killer property: two HyperLogLog sketches over disjoint streams merge by per-register max. Shard A counts East Coast users, shard B counts West Coast — union is pointwise max of registers, no re-scan of raw events.

That makes HLL natural for:

MapReduce / Flink — partial aggregates per partition, one merge at reduce.
Redis — PFADD key element updates sketch; PFCOUNT key reads estimate; PFMERGE dest src1 src2 combines shards.
CDN log rollups — each edge node maintains a sketch; central collector merges hourly without shipping every IP.
Database approximations — optimizers sample distinct counts for join cardinality estimates (though exact query plans still need fresh stats).

Sketches are not invertible. You cannot list members, delete one user, or check membership. If you need those operations, pair HLL with a Bloom filter or maintain a sampled exact set for debugging.

HLL vs exact sets vs Bloom filters

Structure	Question answered	Memory	Mergeable
Hash set	Exact distinct count + membership	O(n) keys	Set union (expensive)
HyperLogLog	Approximate distinct count	O(2^p) bytes fixed	Register max (cheap)
Bloom filter	Probably in set? (no count)	O(n) bits tuned by FP rate	Bitwise OR (increases FP)

Use HyperLogLog when you only need how many unique, error of 1–3% is acceptable, memory must stay bounded, and sketches must merge across machines. Use exact sets when cardinality is small, legal/compliance requires precision, or you must enumerate members for export.

Operational pitfalls

Correlated inputs — HLL assumes hashes look random. If IDs are sequential integers hashed poorly, bias spikes. Always use a strong hash (MurmurHash3, xxHash, SHA-256 truncated).
Double-counting after merge overlap — Merging sketches assumes streams were disjoint. If the same user appears in both shards before merge, union is correct; if you merge overlapping windows without deduping at the event level, you overcount.
Cardinality of intersections — |A ∩ B| is not PFCOUNT(A) + PFCOUNT(B) - PFCOUNT(merge(A,B)) with usable accuracy. Use dedicated intersection sketches (e.g. inclusion-exclusion with multiple HLLs, or Theta sketches in Druid).
Time decay — HLL has no TTL per element. Rolling "unique users last 24h" needs windowed sketches, key rotation, or a streaming structure like Count-Min with time buckets.
Reporting exact numbers to executives — Display "≈ 4.2M (±2%)" or round sensibly. Presenting 4,198,337 implies false precision.

When to reach for HyperLogLog

Reach for HLL when all of these hold:

You need distinct count, not membership or per-key frequency.
Approximate answer within ~1–3% is acceptable for the decision.
Cardinality may reach millions or billions but sketch memory must stay fixed.
Partial counts from shards must merge cheaply without a central dedupe store.

Skip it when you need exact billing, GDPR export of all identifiers, or sets small enough that a hash table fits comfortably in RAM. The algorithm's insight is statistical: rare hash patterns witness crowd size, and averaging many witnesses tames variance — a pattern that appears wherever exact distinct counting would drown in memory.