Guide
LLM reranking explained
Reranking is the second stage of a modern retrieval pipeline: after fast first-pass search returns dozens of candidate chunks, a more accurate scorer re-orders them so the best evidence reaches the LLM context window. Without reranking, RAG systems often inject vaguely related passages — the model then hallucinates details that were never in your corpus. Reranking is one of the highest-leverage quality upgrades in knowledge-grounded AI, and it sits between embedding search and retrieval evaluation in a production stack. This guide explains bi-encoder vs cross-encoder scoring, popular reranker models, fusion with keyword search, latency budgets, and when reranking pays off versus simply retrieving more chunks.
Why first-stage retrieval is not enough
First-stage retrieval — usually approximate nearest-neighbor (ANN) search over vector embeddings — optimizes for speed at scale. Embedding models encode queries and documents independently (bi-encoder architecture), compare cosine similarity, and return top-k hits in milliseconds even across millions of chunks.
That independence is also the weakness. Bi-encoders never see query and document tokens together at scoring time, so subtle mismatches slip through: a chunk about "Apple's supply chain" ranks high for "how do I bake apple pie" because both contain the token "apple." Synonymy and negation are weak. Metadata filters help, but they cannot fix semantic near-misses inside the candidate set.
Reranking closes the gap by scoring each query–document pair jointly — the model reads both texts and outputs a relevance probability. You pay more compute per candidate, so you only rerank the top 20–100 results from stage one, then pass the best 3–8 chunks to the generator.
Two-stage retrieval architecture
A typical production RAG retrieval stack looks like this:
- Indexing — chunk documents, embed with a bi-encoder, store vectors + metadata in a vector index.
- First retrieval — embed the user query, ANN search for top-N (often 50–200), optionally hybrid with BM25 keyword scores.
- Reranking — cross-encoder scores each (query, chunk) pair; sort by relevance score.
- Context assembly — take top-k after rerank (often 4–10), deduplicate overlapping chunks, inject into the prompt.
- Generation — LLM answers using only the selected context.
The key tuning knobs are N (how many candidates to rerank) and k (how many survive into the prompt). Increasing N improves recall of the reranker stage — the right answer must appear in the candidate pool — but linearly increases rerank latency. Increasing k gives the generator more evidence but consumes context window and can confuse the model with redundant passages.
Bi-encoders vs cross-encoders
Bi-encoder (embedding) retrieval
Query and document are encoded separately into fixed-dimension vectors. Similarity is a dot product or cosine distance. Encoding is cheap and cacheable: document vectors are precomputed at index time; only the query is encoded at search time. This is why bi-encoders power first-stage search.
Cross-encoder reranking
Query and document are concatenated and fed through a transformer that outputs a single relevance score (often via a classification head). Full cross-attention between every query token and every document token produces much sharper relevance judgments — but you cannot precompute document scores; each pair is scored at query time. That is why cross-encoders are reserved for reranking small candidate sets.
Late interaction models (ColBERT-style)
ColBERT and similar architectures sit in between: documents are pre-tokenized into contextual embeddings, queries tokenize at search time, and MaxSim aggregation compares token-level interactions without full pairwise transformer passes on every document. They offer better quality than pure bi-encoders with lower latency than full cross-encoders, at the cost of more complex indexing and larger storage.
Popular reranker models and APIs
Several families dominate production RAG stacks in 2026:
- Cohere Rerank — managed API, strong multilingual performance, simple integration (send query + document list, receive sorted scores). Popular when you want zero GPU ops.
- BGE Reranker (BAAI) — open-weight cross-encoders (e.g. bge-reranker-v2-m3) you can self-host with sentence-transformers or ONNX runtimes. Good cost control at moderate scale.
- monoT5 / rankT5 — T5-based models fine-tuned to output "true" / "false" relevance tokens; older but still used in academic benchmarks.
- Jina Reranker — multilingual cross-encoder with competitive MTEB reranking scores; available as API or self-hosted weights.
- LLM-as-reranker — prompting a frontier model to score or rank passages. Flexible but expensive and slow; reserve for offline eval or tiny candidate sets.
Model choice matters less than pipeline design once you are above a reasonable quality floor. Benchmark on your domain: legal, medical, and code corpora behave differently from MS MARCO web passages the models were trained on.
Hybrid retrieval and score fusion
Vector search alone misses exact keyword matches — SKUs, error codes, function names, and rare entities. Production systems often run BM25 (keyword) and vector search in parallel, then merge ranked lists before reranking.
Reciprocal Rank Fusion (RRF)
RRF combines multiple ranked lists without calibrating incompatible score scales. Each document gets a fusion score like:
RRF(d) = sum over lists of 1 / (k + rank(d))
where k is a constant (often 60) and rank is the document's position in each list. Documents that appear high in both BM25 and vector results rise to the top. RRF output becomes the candidate pool for the cross-encoder reranker.
Weighted linear combination
Alternatively, normalize BM25 and cosine scores to [0, 1] and combine with weights (e.g. 0.3 keyword + 0.7 semantic). This requires score calibration per index version and is more brittle than RRF, but gives explicit control when you know your corpus is keyword-heavy (logs, tickets) vs semantic-heavy (policies, FAQs).
Latency, cost, and when reranking pays off
Cross-encoder latency scales roughly linearly with candidate count. Reranking 50 chunks of 512 tokens each on a single GPU might take 100–400 ms depending on model size; 200 chunks can blow your p95 budget. Rules of thumb:
- Start with N=50, k=5 — good default for support bots and internal knowledge bases.
- Batch rerank calls — most libraries batch (query, doc) pairs on GPU; do not score sequentially in a loop.
- Truncate document text — cross-encoders have max length (often 512 tokens). Pass title + first paragraph, not the full 2,000-token chunk.
- Cache rerank scores — for repeated identical queries (common in eval suites), memoize results.
- Skip reranking for tiny corpora — under ~500 chunks, brute-force cross-encoder over the whole index may be fine.
Reranking is usually worth it when users complain that answers are "close but wrong," when faithfulness evals show correct answers exist in the corpus but are not retrieved, or when hybrid search alone plateaus. It is often not worth it when your bottleneck is bad chunking, stale indexes, or missing documents — reranking cannot retrieve what was never indexed.
Common failure modes
- Candidate pool too small — reranker never sees the right chunk because first-stage top-20 missed it. Fix recall (bigger N, better embeddings, hybrid BM25) before tuning reranker weights.
- Duplicate chunks flooding top-k — overlapping splits from the same parent document consume context. Deduplicate by source ID or max marginal relevance (MMR) after reranking.
- Score threshold absent — forcing the LLM to use three chunks when none are relevant causes confabulation. Add a minimum rerank score; return "I don't know" below threshold.
- Domain mismatch — general rerankers underperform on code, tables, and JSON logs. Fine-tune a small cross-encoder on 500–2,000 labeled query–doc pairs from your domain.
- Ignoring metadata — tenant ID, date, and ACL filters should run before reranking, not after. Never rerank documents the user is not allowed to see.
Production checklist
- Measure first-stage recall@50 and recall@100 on a golden query set before adding reranking.
- Run hybrid BM25 + vector with RRF; confirm keyword-only queries improve.
- Choose managed API vs self-hosted reranker based on query volume and data residency.
- Benchmark p50/p95 rerank latency at your target candidate count; set N accordingly.
- Truncate chunks to reranker max length; keep full text only for chunks that survive to generation.
- Deduplicate and apply MMR if multiple chunks from one source rank highly.
- Add a relevance score threshold with a graceful "no good match" fallback.
- Log query, retrieved IDs, rerank scores, and final context for offline replay and regression tests.
- Re-evaluate when you change embedding model, chunk size, or corpus — reranker tuning is not set-and-forget.
Key takeaways
- Reranking is a second-stage scorer that re-orders first-pass retrieval candidates for sharper relevance.
- Bi-encoders are fast but approximate; cross-encoders are slow but precise — use both in sequence.
- Hybrid retrieval (BM25 + vectors) with RRF feeds better candidates into the reranker.
- Tune N (candidates reranked) for recall and k (chunks in prompt) for context budget.
- Fix chunking and indexing before expecting reranking to rescue a weak corpus.
- Always filter by tenant and ACL metadata before scoring, and add score thresholds to reduce hallucinations.
Related reading
- RAG explained — end-to-end retrieval-augmented generation from chunking to generation
- LLM embeddings explained — bi-encoder models and first-stage vector search
- Vector databases explained — ANN indexes, metadata filters, and hybrid search infrastructure
- LLM evaluation and benchmarking — measuring retrieval recall and answer faithfulness