Guide

LLM reciprocal rank fusion explained

Harbor Support’s internal help-desk bot indexed 18,000 runbooks, API error codes, and escalation playbooks. Agents asked acronym-heavy questions: “What is the SSO timeout for the legacy billing shard?” The pipeline ran BM25 on Elasticsearch and dense vectors on a separate ANN index, then merged results with a fixed alpha = 0.7 weighted sum after min-max normalization. Lexical hits ranked #1 on BM25 but dropped out of the fused top-10 when vector scores dominated; semantic paraphrases won on vectors but lost exact error-code matches. On 200 labeled tickets, recall@10 was 61% and wrong-article escalations hit 29%.

Engineers replaced score fusion with reciprocal rank fusion (RRF): each retriever returns a ranked list; RRF assigns a score from rank position only, sums across lists, and re-sorts. No per-index score calibration. Recall@10 rose to 84%; wrong escalations fell to 9%. This guide covers the RRF formula and k constant, multi-list fusion with SPLADE sparse retrieval and dense bi-encoders, the Harbor Support refactor, a technique decision table versus weighted linear fusion and cross-encoder reranking, pitfalls, and a production checklist.

Why fusing raw scores breaks hybrid search

Hybrid RAG almost always combines at least two retrieval signals: lexical (BM25, Elasticsearch) and semantic (embedding cosine on a vector database). Optional third lists include learned sparse indexes, ColBERT late interaction, or metadata-filtered subsets.

Each scorer outputs values on incompatible scales. BM25 is unbounded and query-dependent; cosine similarity sits in [0, 1] or [−1, 1] depending on model; SPLADE dot products can spike on rare tokens. Min-max or z-score normalization on a per-query top-100 slice helps but is fragile: one outlier document compresses everyone else; short queries produce flat normalized scores; the optimal blend weight alpha shifts by domain and query length.

Rank fusion sidesteps calibration. You only need ordered lists. Reciprocal rank fusion (Cormack, Clarke, and Buettcher, 2009) became the de facto merge for modern hybrid stacks because it is simple, robust, and needs almost no tuning beyond a single constant k.

The RRF formula and what k controls

For document d and retrieval lists L₁ … L_m, each list assigns d a 1-based rank r_i(d) (or skips d if absent). RRF score:

RRF(d) = Σ_i=1..m 1 / (k + r_i(d))

Documents appearing in multiple lists accumulate mass. A doc ranked #1 on BM25 and #3 on vectors beats a doc ranked #2 on only one list. The constant k dampens how much top ranks dominate:

k = 60 (common default from the original paper and Elasticsearch hybrid templates) — smooth curve; rank-1 gets 1/61 ≈ 0.0164 per list.
Lower k (20–40) — sharper preference for top ranks; useful when each list is high precision at rank 1–3.
Higher k (80–100) — flatter; rank-10 documents still contribute meaningfully; better when lists are long-tailed recall channels.

Optional per-list weights w_i multiply each term: w_i / (k + r_i(d)). Use weights sparingly — one bad weight reintroduces the tuning problem RRF avoids. Harbor Support uses equal weights across BM25, dense, and SPLADE; they only down-weight a stale-archive list with w = 0.5.

Worked micro-example

Query: SSO timeout billing shard. Doc A (exact playbook): BM25 rank 1, dense rank 8. Doc B (generic SSO overview): BM25 rank 12, dense rank 1. With k = 60:

Doc A: 1/(60+1) + 1/(60+8) = 0.01639 + 0.01471 = 0.03110
Doc B: 1/(60+12) + 1/(60+1) = 0.01389 + 0.01639 = 0.03028

Doc A wins despite weak dense rank because it is strong lexically — exactly the behavior acronym tickets need. Weighted score fusion with alpha = 0.7 had promoted Doc B because vector similarity on “SSO” overwhelmed the lexical signal.

Building a multi-list RRF pipeline

1. Retrieve wide, fuse narrow

Fetch top-50 to top-100 per list before fusion; merge to a fused top-20 or top-30, then apply a cross-encoder reranker to top-10 for the LLM context window. RRF is cheap CPU math; reranking is the latency bottleneck — don’t rerank 200 candidates if 30 fused hits suffice.

2. Deduplicate by stable chunk ID

The same passage may appear under different index IDs after re-ingest. Map every hit to a canonical chunk_id before summing RRF scores; otherwise duplicates inflate scores artificially.

3. Handle missing documents

If d is absent from list L_i, skip that term (do not treat as rank ∞ with score 0). Some implementations assign a penalty rank (e.g. 1000) for absent docs when you want presence in any list to matter; Harbor found skip-absent cleaner for three-list fusion.

4. Pair with metadata pre-filters

Apply product-line or date filters inside each retriever before ranking, not after fusion. Post-fusion filtering can empty the fused list when filters disagree across indexes.

5. Three-list pattern: BM25 + dense + SPLADE

This stack covers exact tokens (error codes, SKUs), semantic paraphrase, and learned lexical expansion. RRF merges without tuning three alphas. See RAG retrieval fundamentals for where fusion sits in the full ingest-to-answer path.

Harbor Support refactor (worked example)

Before RRF: Elasticsearch BM25 + OpenSearch kNN, min-max normalize top-50 each, score = 0.3 * norm_bm25 + 0.7 * norm_dense, top-10 to Cohere rerank, top-4 to GPT-4-class synthesis.

Recall@10 (human-labeled relevant doc in top 10): 61%
Wrong-article escalations (agent disagreed with bot citation): 29%
Acronym / error-code query slice recall@10: 48%
p95 retrieval + rerank latency: 380 ms

After RRF: added SPLADE sparse list (same corpus), top-80 per list, RRF with k = 60, fused top-30, same reranker to top-4:

Recall@10: 61% → 84%
Wrong escalations: 29% → 9%
Acronym slice recall@10: 48% → 79%
p95 latency: 380 ms → 410 ms (+30 ms for SPLADE lookup)
Engineering time saved: ~2 weeks of per-tenant alpha sweeps abandoned

They tried k in {40, 60, 80} on a 500-query dev set; metrics varied <1.5 pp — far less sensitive than alpha in weighted fusion (8 pp swing). Production stayed at k = 60.

Technique decision table

Approach	Best when	Weak when
RRF rank fusion	2+ heterogeneous retrievers; scores on incompatible scales; want low tuning	Single retriever; need fine-grained score thresholds for filtering
Weighted linear fusion	Scores are calibrated and comparable; one dominant signal with a known blend	Mixed BM25 + vectors without stable normalization; multi-tenant corpora
CombSUM / CombMNZ	Normalized scores with research baselines; classic IR benchmarks	Production hybrid stacks where normalization drifts by query type
Cross-encoder only (no fusion)	Small corpus (<5k chunks); can afford full pairwise scoring	Large indexes; sub-200 ms retrieval SLA
Two-stage rerank after RRF	Need precision at ranks 1–4 after broad recall fusion	Extreme latency budgets; reranker API cost per query
Learned fusion (LTR)	Massive click logs; dedicated ML team for retraining	Cold start; sparse relevance labels; fast iteration cycles

RRF is the default merge for hybrid RAG; add ColBERT or a cross-encoder as a second stage, not as a replacement for rank fusion across indexes.

RRF vs reranking: complementary stages

RRF answers: which candidates should enter the expensive rerank pool? Cross-encoders answer: among those candidates, what is the true relevance order? Skipping RRF and reranking only the vector top-20 misses lexical-only hits. Reranking 100+ fused candidates without prior fusion wastes GPU/API budget.

Typical production stack:

Parallel retrieve top-N per list (BM25, dense, optional SPLADE).
RRF merge to top-30 fused.
Cross-encoder rerank to top-8.
LLM synthesis with citations.

For latency-sensitive paths, replace step 3 with a lightweight monoT5 or smaller reranker on the fused top-15 only.

Common pitfalls

Fusing before deduplication — duplicate chunk IDs double-count RRF mass and push out diverse results.
Tiny per-list depth — retrieving top-5 per list before RRF discards documents that rank #6 lexically but would win after fusion.
Over-weighting one list — extreme w_i values recreate alpha-tuning pain; start with equal weights.
Skipping rerank after RRF — RRF improves recall; precision at rank 1 still benefits from cross-encoder scoring on a short list.
Post-fusion metadata filter — can zero out results; filter inside each retriever query instead.
Ignoring query-type slices — measure acronym, procedural, and paraphrase buckets separately; aggregate recall hides regressions.
Assuming k is sacred — 60 is a good default but validate on your dev set; the right range is usually 40–80, not 5 vs 500.

Production checklist

Run 2+ retrievers in parallel with top-50–100 depth each.
Map all hits to canonical chunk_id before RRF summation.
Start with k = 60 and equal list weights; sweep k on labeled dev set.
Fuse to top-20–30; rerank with cross-encoder to final top-k for the LLM.
Log per-list ranks for fused winners (debug which signal contributed).
Track recall@k and MRR before/after fusion on stratified query types.
Measure p95 latency for retrieve, fuse, and rerank stages separately.
Apply metadata filters inside each retriever, not after fusion.
Version index builds so all lists cover the same corpus snapshot.
Re-evaluate when swapping embedding models — RRF is stable but reranker training may need refresh.

Key takeaways

RRF merges ranked lists without normalizing incompatible scores — ideal for BM25 + vector hybrid RAG.
Score is Σ 1/(k + rank); documents strong on multiple lists rise to the top naturally.
Default k = 60 is robust; tune on a dev set but expect smaller gains than alpha in weighted fusion.
Harbor Support lifted recall@10 from 61% to 84% and cut wrong escalations from 29% to 9%.
Pair RRF (recall) with cross-encoder reranking (precision) — they solve different problems.