Guide
Hybrid search explained
Pure keyword search excels at exact tokens — product SKUs, error codes, legal citations — but fails when users paraphrase. Pure vector search captures semantic similarity but can drift toward vaguely related chunks and miss rare identifiers. Hybrid search runs both retrieval paths in parallel and fuses their ranked lists so RAG pipelines surface the right evidence more often. This guide explains sparse (BM25) vs dense retrieval, score fusion methods like reciprocal rank fusion (RRF), platform patterns in Elasticsearch and managed vector stores, when to add a cross-encoder reranker, and a production checklist for tuning recall without blowing latency budgets.
Two retrieval philosophies
Information retrieval has historically split into lexical and semantic camps. Lexical systems tokenize documents and queries, build inverted indexes, and score matches with algorithms like BM25 (Best Matching 25). They reward term frequency and document length normalization but treat "automobile" and "car" as unrelated unless you add synonym expansion.
Dense retrieval encodes text into fixed-length embedding vectors and finds nearest neighbors by cosine similarity or dot product. A query about "refunding a purchase" can match a chunk that says "return policy" even with zero shared tokens — powerful for natural-language questions over messy documentation.
Neither approach dominates every query type. Hybrid search acknowledges this and treats retrieval as an ensemble problem: run both, merge results, optionally rerank the union.
When keyword search wins
- Exact identifiers:
CVE-2024-1234,0xabc…, part numbers. - Rare proper nouns and acronyms not well represented in embedding training.
- Boolean-style queries where users expect literal token presence.
- Structured fields (title, tags) where BM25 on a boosted field is cheap and precise.
When vector search wins
- Paraphrased questions: "How do I cancel?" vs docs titled "Subscription termination."
- Cross-lingual retrieval when embeddings are multilingual.
- Long-tail concepts with few overlapping keywords between query and chunk.
- Noisy user input — typos handled better by subword embeddings than raw BM25.
BM25 in one practical picture
BM25 scores each document for a query by summing per-term contributions. Terms
that appear often in the document but rarely in the corpus get high weight; very
long documents are penalized so a single mention in a 50-page PDF does not beat
a focused paragraph. Parameters k1 (term frequency saturation) and
b (length normalization) are tunable — most search engines ship
sensible defaults (Elasticsearch uses BM25 by default).
BM25 is sparse: vectors are high-dimensional but mostly zeros (one dimension per vocabulary term). Inverted indexes make sparse retrieval fast at scale — billions of documents with sub-100ms queries on well-sharded clusters. Dense retrieval stores every chunk as a 384–3072 float vector; approximate nearest neighbor (ANN) indexes (HNSW, IVF) trade recall for speed.
Storing both representations in a vector database or search platform that supports hybrid queries avoids maintaining two separate systems with divergent indexing pipelines.
Fusion: merging two ranked lists
After running BM25 and vector search independently (each returning top-k chunks, often k = 50–200 for later pruning), you must combine rankings. Naive approaches — averaging raw scores — fail because BM25 scores and cosine similarities live on incomparable scales. Production systems prefer rank-based fusion.
Reciprocal rank fusion (RRF)
RRF assigns each document a fusion score by summing
1 / (rank + c) across lists, where c is a constant
(commonly 60) that dampens the influence of top ranks vs deep ranks. A chunk
ranked #1 in BM25 and #3 in vectors scores higher than one ranked #20 in both.
RRF needs no score calibration, handles duplicate chunk IDs across lists, and
is the default in many RAG stacks including Elasticsearch hybrid queries and
open-source recipes paired with
rerankers.
function rrfScore(ranks, c = 60) {
return ranks.reduce((sum, rank) => sum + 1 / (rank + c), 0);
}
// chunk A: BM25 rank 2, vector rank 5 → 1/62 + 1/65 ≈ 0.0316
// chunk B: BM25 rank 40, vector rank 1 → 1/100 + 1/61 ≈ 0.0264
Weighted linear combination
Some platforms normalize scores to [0, 1] per query and compute
α · dense + (1 − α) · sparse. This works when the engine exposes
calibrated scores (Pinecone hybrid with sparse-dense vectors, Weaviate with
relativeScoreFusion). Tuning α on a labeled eval set (50–200
query–document pairs) often beats guessing 0.5/0.5.
Learned fusion and reranking
A two-stage pipeline treats fusion as recall expansion and a cross-encoder reranker as precision filtering: retrieve 100+ candidates via hybrid search, rerank down to 5–10 chunks fed to the LLM. Cross-encoders jointly encode query + document and are slower but far more accurate than bi-encoder cosine alone — the pattern described in depth in our reranking guide.
Architecture patterns
Single platform hybrid
Elasticsearch 8+, OpenSearch, Weaviate, and Qdrant support hybrid queries in
one API call: BM25 on text fields plus kNN on
dense_vector fields, fused server-side. Benefits: one index, one
replication story, consistent ACL filters applied to both legs. Drawback: you
are tied to that engine's fusion semantics and ANN tuning knobs.
Dual-index with application-side fusion
Legacy setups keep Solr/Elasticsearch for keyword and a dedicated vector store for embeddings. The application issues parallel queries, deduplicates by chunk ID, runs RRF in code, then optionally calls a reranker API (Cohere, Jina, BGE). More moving parts, but lets teams adopt vectors without migrating years of lexical tuning.
Sparse vectors (SPLADE and friends)
Learned sparse models output high-dimensional but mostly-zero vectors like BM25, yet capture semantic expansion ("car" activates "vehicle"). Some teams replace classic BM25 with SPLADE sparse vectors and still call it hybrid when combined with dense vectors — same fusion math, different sparse encoder. Worth A/B testing on your corpus before committing; SPLADE adds inference cost at index and query time.
Query routing: not every query needs hybrid
Running two retrieval paths doubles index I/O and ANN work. Smart routers save latency:
- Regex / heuristics: if the query matches
^[A-Z]{2,}-\d+(SKU pattern), keyword-only; otherwise hybrid. - Classifier: a tiny model or LLM labels the query "identifier lookup" vs "conceptual question" — cache labels for repeat queries.
- Fallback cascade: keyword first; if top score below a threshold or fewer than n hits, augment with vector search.
Log which path fired and downstream answer quality. Routers drift as user behavior changes; revisit quarterly.
Chunking and metadata still matter
Hybrid search does not fix bad indexing. Chunks that split mid-sentence lose BM25 context; chunks that are too long dilute dense vectors. Store metadata (source URL, section title, product ID, freshness timestamp) alongside both sparse and dense representations so you can:
- Boost BM25 on title fields (
title^3in Elasticsearch). - Pre-filter vectors by tenant, language, or date range before ANN.
- Display citations the LLM can quote in answers — critical for reducing hallucinations.
Re-embed when you change chunk boundaries or embedding models; stale vectors with fresh BM25 text is a subtle failure mode.
Evaluation without guessing
Measure retrieval with labeled data, not vibes:
- Recall@k: does the gold chunk appear in the top 10 fused results?
- MRR (mean reciprocal rank): how high does the first correct chunk rank?
- nDCG: rewards putting the best chunk at position 1.
Compare configurations: BM25-only, vector-only, RRF hybrid, hybrid + reranker. Track latency percentiles (p50, p95) alongside quality — a 200 ms retrieval budget is useless if fusion + rerank takes 800 ms. Our LLM evaluation guide covers end-to-end RAG eval including answer faithfulness once retrieval is fixed.
Production checklist
- Index the same chunk ID in both sparse and dense legs; dedupe before fusion.
- Start with RRF (
c= 60) before tuning weighted score blends. - Retrieve generously (top 50–100 per leg); prune with a reranker to top 5–10 for the LLM.
- Apply identical ACL / tenant filters to both retrieval paths.
- Log query text, fusion method, candidate IDs, and final picks for debugging.
- Build a 100+ query golden set; run weekly regression when you change embeddings or chunking.
- Route identifier-heavy queries to keyword-only when latency or precision demands it.
- Monitor ANN recall settings — aggressive HNSW
efcuts miss rare exact matches hybrid was meant to save. - Version embedding models in index metadata; plan reindex jobs before model swaps.
- Cap total retrieval latency; degrade to single-path search under timeout rather than failing open.
Key takeaways
- Keyword and vector search fail in opposite ways — hybrid covers SKU lookups and paraphrased questions in one pipeline.
- Fuse ranks, not raw scores — RRF is simple, scale-free, and widely deployed.
- Hybrid is stage one — pair with cross-encoder reranking for precision-critical RAG.
- Route and filter — not every query needs both legs; metadata boosts beat bigger k.
- Evaluate on your corpus — default fusion weights rarely match your domain without tuning.
Related reading
- RAG explained — end-to-end retrieval-augmented generation architecture
- Vector databases explained — HNSW indexes, sharding, and hybrid-capable stores
- LLM embeddings explained — model choice and similarity metrics for dense retrieval
- LLM reranking explained — cross-encoders and two-stage retrieval after fusion