Guide
LLM reciprocal rank fusion explained
Harbor Support’s internal help-desk bot indexed 18,000 runbooks, API error
codes, and escalation playbooks. Agents asked acronym-heavy questions:
“What is the SSO timeout for the legacy billing shard?” The pipeline
ran BM25 on Elasticsearch and dense vectors on a separate ANN index, then merged
results with a fixed alpha = 0.7 weighted sum after min-max
normalization. Lexical hits ranked #1 on BM25 but dropped out of the fused top-10
when vector scores dominated; semantic paraphrases won on vectors but lost exact
error-code matches. On 200 labeled tickets, recall@10 was 61% and
wrong-article escalations hit 29%.
Engineers replaced score fusion with reciprocal rank fusion (RRF):
each retriever returns a ranked list; RRF assigns a score from rank position only,
sums across lists, and re-sorts. No per-index score calibration. Recall@10 rose to
84%; wrong escalations fell to 9%. This guide covers the RRF formula and
k constant, multi-list fusion with
SPLADE sparse retrieval
and dense bi-encoders, the Harbor Support refactor, a technique decision table versus
weighted linear fusion and
cross-encoder reranking,
pitfalls, and a production checklist.
Why fusing raw scores breaks hybrid search
Hybrid RAG almost always combines at least two retrieval signals: lexical (BM25, Elasticsearch) and semantic (embedding cosine on a vector database). Optional third lists include learned sparse indexes, ColBERT late interaction, or metadata-filtered subsets.
Each scorer outputs values on incompatible scales. BM25 is unbounded and
query-dependent; cosine similarity sits in [0, 1] or [−1, 1] depending on
model; SPLADE dot products can spike on rare tokens. Min-max or z-score
normalization on a per-query top-100 slice helps but is fragile: one outlier
document compresses everyone else; short queries produce flat normalized scores;
the optimal blend weight alpha shifts by domain and query length.
Rank fusion sidesteps calibration. You only need ordered lists.
Reciprocal rank fusion (Cormack, Clarke, and Buettcher, 2009) became the de facto
merge for modern hybrid stacks because it is simple, robust, and needs almost no
tuning beyond a single constant k.
The RRF formula and what k controls
For document d and retrieval lists L1 …
Lm, each list assigns d a 1-based rank
ri(d) (or skips d if absent). RRF score:
RRF(d) = Σi=1..m 1 / (k + ri(d))
Documents appearing in multiple lists accumulate mass. A doc ranked #1 on BM25 and
#3 on vectors beats a doc ranked #2 on only one list. The constant
k dampens how much top ranks dominate:
- k = 60 (common default from the original paper and Elasticsearch hybrid templates) — smooth curve; rank-1 gets 1/61 ≈ 0.0164 per list.
- Lower k (20–40) — sharper preference for top ranks; useful when each list is high precision at rank 1–3.
- Higher k (80–100) — flatter; rank-10 documents still contribute meaningfully; better when lists are long-tailed recall channels.
Optional per-list weights wi multiply each
term: wi / (k + ri(d)). Use weights sparingly
— one bad weight reintroduces the tuning problem RRF avoids. Harbor Support
uses equal weights across BM25, dense, and SPLADE; they only down-weight a
stale-archive list with w = 0.5.
Worked micro-example
Query: SSO timeout billing shard. Doc A (exact playbook): BM25 rank 1,
dense rank 8. Doc B (generic SSO overview): BM25 rank 12, dense rank 1. With
k = 60:
- Doc A: 1/(60+1) + 1/(60+8) = 0.01639 + 0.01471 = 0.03110
- Doc B: 1/(60+12) + 1/(60+1) = 0.01389 + 0.01639 = 0.03028
Doc A wins despite weak dense rank because it is strong lexically — exactly
the behavior acronym tickets need. Weighted score fusion with alpha = 0.7
had promoted Doc B because vector similarity on “SSO” overwhelmed the
lexical signal.
Building a multi-list RRF pipeline
1. Retrieve wide, fuse narrow
Fetch top-50 to top-100 per list before fusion; merge to a fused top-20 or top-30, then apply a cross-encoder reranker to top-10 for the LLM context window. RRF is cheap CPU math; reranking is the latency bottleneck — don’t rerank 200 candidates if 30 fused hits suffice.
2. Deduplicate by stable chunk ID
The same passage may appear under different index IDs after re-ingest. Map every hit
to a canonical chunk_id before summing RRF scores; otherwise duplicates
inflate scores artificially.
3. Handle missing documents
If d is absent from list Li, skip that term
(do not treat as rank ∞ with score 0). Some implementations assign a
penalty rank (e.g. 1000) for absent docs when you want presence in any list to
matter; Harbor found skip-absent cleaner for three-list fusion.
4. Pair with metadata pre-filters
Apply product-line or date filters inside each retriever before ranking, not after fusion. Post-fusion filtering can empty the fused list when filters disagree across indexes.
5. Three-list pattern: BM25 + dense + SPLADE
This stack covers exact tokens (error codes, SKUs), semantic paraphrase, and learned lexical expansion. RRF merges without tuning three alphas. See RAG retrieval fundamentals for where fusion sits in the full ingest-to-answer path.
Harbor Support refactor (worked example)
Before RRF: Elasticsearch BM25 + OpenSearch kNN, min-max normalize top-50 each,
score = 0.3 * norm_bm25 + 0.7 * norm_dense, top-10 to Cohere rerank,
top-4 to GPT-4-class synthesis.
- Recall@10 (human-labeled relevant doc in top 10): 61%
- Wrong-article escalations (agent disagreed with bot citation): 29%
- Acronym / error-code query slice recall@10: 48%
- p95 retrieval + rerank latency: 380 ms
After RRF: added SPLADE sparse list (same corpus), top-80 per list, RRF with
k = 60, fused top-30, same reranker to top-4:
- Recall@10: 61% → 84%
- Wrong escalations: 29% → 9%
- Acronym slice recall@10: 48% → 79%
- p95 latency: 380 ms → 410 ms (+30 ms for SPLADE lookup)
- Engineering time saved: ~2 weeks of per-tenant alpha sweeps abandoned
They tried k in {40, 60, 80} on a 500-query dev set; metrics varied
<1.5 pp — far less sensitive than alpha in weighted fusion (8 pp swing).
Production stayed at k = 60.
Technique decision table
| Approach | Best when | Weak when |
|---|---|---|
| RRF rank fusion | 2+ heterogeneous retrievers; scores on incompatible scales; want low tuning | Single retriever; need fine-grained score thresholds for filtering |
| Weighted linear fusion | Scores are calibrated and comparable; one dominant signal with a known blend | Mixed BM25 + vectors without stable normalization; multi-tenant corpora |
| CombSUM / CombMNZ | Normalized scores with research baselines; classic IR benchmarks | Production hybrid stacks where normalization drifts by query type |
| Cross-encoder only (no fusion) | Small corpus (<5k chunks); can afford full pairwise scoring | Large indexes; sub-200 ms retrieval SLA |
| Two-stage rerank after RRF | Need precision at ranks 1–4 after broad recall fusion | Extreme latency budgets; reranker API cost per query |
| Learned fusion (LTR) | Massive click logs; dedicated ML team for retraining | Cold start; sparse relevance labels; fast iteration cycles |
RRF is the default merge for hybrid RAG; add ColBERT or a cross-encoder as a second stage, not as a replacement for rank fusion across indexes.
RRF vs reranking: complementary stages
RRF answers: which candidates should enter the expensive rerank pool? Cross-encoders answer: among those candidates, what is the true relevance order? Skipping RRF and reranking only the vector top-20 misses lexical-only hits. Reranking 100+ fused candidates without prior fusion wastes GPU/API budget.
Typical production stack:
- Parallel retrieve top-N per list (BM25, dense, optional SPLADE).
- RRF merge to top-30 fused.
- Cross-encoder rerank to top-8.
- LLM synthesis with citations.
For latency-sensitive paths, replace step 3 with a lightweight monoT5 or smaller reranker on the fused top-15 only.
Common pitfalls
- Fusing before deduplication — duplicate chunk IDs double-count RRF mass and push out diverse results.
- Tiny per-list depth — retrieving top-5 per list before RRF discards documents that rank #6 lexically but would win after fusion.
- Over-weighting one list — extreme
wivalues recreate alpha-tuning pain; start with equal weights. - Skipping rerank after RRF — RRF improves recall; precision at rank 1 still benefits from cross-encoder scoring on a short list.
- Post-fusion metadata filter — can zero out results; filter inside each retriever query instead.
- Ignoring query-type slices — measure acronym, procedural, and paraphrase buckets separately; aggregate recall hides regressions.
- Assuming k is sacred — 60 is a good default but validate on your dev set; the right range is usually 40–80, not 5 vs 500.
Production checklist
- Run 2+ retrievers in parallel with top-50–100 depth each.
- Map all hits to canonical
chunk_idbefore RRF summation. - Start with
k = 60and equal list weights; sweep k on labeled dev set. - Fuse to top-20–30; rerank with cross-encoder to final top-k for the LLM.
- Log per-list ranks for fused winners (debug which signal contributed).
- Track recall@k and MRR before/after fusion on stratified query types.
- Measure p95 latency for retrieve, fuse, and rerank stages separately.
- Apply metadata filters inside each retriever, not after fusion.
- Version index builds so all lists cover the same corpus snapshot.
- Re-evaluate when swapping embedding models — RRF is stable but reranker training may need refresh.
Key takeaways
- RRF merges ranked lists without normalizing incompatible scores — ideal for BM25 + vector hybrid RAG.
- Score is Σ 1/(k + rank); documents strong on multiple lists rise to the top naturally.
- Default k = 60 is robust; tune on a dev set but expect smaller gains than alpha in weighted fusion.
- Harbor Support lifted recall@10 from 61% to 84% and cut wrong escalations from 29% to 9%.
- Pair RRF (recall) with cross-encoder reranking (precision) — they solve different problems.
Related reading
- SPLADE sparse retrieval explained — learned lexical expansion as a third RRF list
- LLM reranking explained — cross-encoders after fusion
- RAG retrieval augmented generation explained — full retrieve-to-answer pipeline
- ColBERT late interaction retrieval explained — token-level scoring as an alternative merge stage