Guide

Information retrieval explained

A user types "refund policy international shipping" into your help center search box. Behind the scenes, the system must scan thousands of support articles, rank the few that actually answer the question, and return them in under 50 milliseconds. That problem — finding the most relevant documents for a query from a large corpus — is information retrieval (IR). IR predates large language models by decades: Google, Elasticsearch, and Lucene all built on the same core ideas. Today those ideas power the lexical half of hybrid search and the retrieval stage of RAG pipelines. This guide covers inverted indexes, TF-IDF and BM25 scoring, query processing, precision vs recall trade-offs, evaluation metrics, and how classical IR connects to modern embedding-based retrieval.

What information retrieval is

Information retrieval is the science and engineering of matching user information needs (queries) to documents in a collection (the corpus). Unlike a database lookup that expects an exact key, IR handles fuzzy, natural-language queries where relevance is graded, not binary.

A retrieval system has three stages:

Indexing — preprocess documents (tokenize, stem, remove stop words), build a searchable structure.
Querying — parse the user query, look up candidate documents, score each by relevance.
Ranking — sort candidates by score and return the top k results.

In production search and RAG, you often add a fourth optional stage — reranking — where a slower, more accurate model re-scores the top candidates before they reach the user or the LLM context window.

Inverted indexes: how search scales

Scanning every document for every query does not scale past a few thousand pages. The standard solution is an inverted index: a hash map from each term (token) to a postings list of document IDs where that term appears, often with term frequency and position metadata.

Example postings list for the term refund:

refund → [(doc_42, tf=3, positions=[12, 87, 204]),
          (doc_108, tf=1, positions=[5]),
          (doc_301, tf=2, positions=[33, 91])]

When a query arrives, the engine intersects or unions postings lists for each query term. Boolean retrieval (AND/OR/NOT) uses set operations on these lists. Ranked retrieval assigns a numeric score to each candidate document and sorts by that score.

Elasticsearch, Apache Lucene, and OpenSearch all use inverted indexes under the hood. When you index chunks for RAG, you are building the same structure — often alongside a separate vector index for dense retrieval.

From TF-IDF to BM25

Early ranked retrieval used TF-IDF (term frequency times inverse document frequency). Term frequency rewards documents where a query word appears often; inverse document frequency down-weights common words like the that appear in nearly every document.

TF-IDF has known weaknesses: it does not cap term-frequency saturation (a word appearing 50 times is not 50 times more relevant than appearing 5 times), and it treats document length naively. BM25 (Best Matching 25) fixes both with two parameters:

k1 — controls term-frequency saturation (typical value 1.2–2.0). Higher k1 means repeated terms matter more.
b — controls document-length normalization (typical value 0.75). b=1 fully normalizes by length; b=0 ignores length.

BM25 is the default lexical scorer in Lucene, Elasticsearch, and most production search engines. In RAG pipelines, BM25 handles exact keyword matches — product SKUs, legal citations, error codes — that dense embeddings often miss. That is why hybrid search combines BM25 with vector similarity instead of relying on either alone.

Query processing: tokenization to expansion

Before scoring, the query passes through the same preprocessing pipeline as documents:

Tokenization — split text into words or subwords. Language matters: English tokenizes on whitespace; Chinese and Japanese need specialized segmenters.
Normalization — lowercase, remove punctuation, apply stemming (reduce running to run) or lemmatization (map to dictionary form).
Stop-word removal — drop high-frequency function words. Be careful: removing not changes meaning.

Beyond basic preprocessing, production systems add:

Synonym expansion — map laptop to also search notebook.
Spell correction — suggest refund when the user types refnd.
Query rewriting — transform verbose queries into keyword-focused forms, sometimes using an LLM.

Query-document preprocessing must stay consistent. If documents are stemmed at index time but queries are not, term matches silently fail — a common source of "search works in staging but not production" bugs.

Precision, recall, and the k trade-off

Retrieval quality is measured along two axes familiar from classification metrics:

Precision — of the documents returned, how many are actually relevant?
Recall — of all relevant documents in the corpus, how many did we return?

Returning more results (k) generally increases recall but may decrease precision. In RAG, the LLM context window caps k tightly — you might retrieve 20 chunks but only pass 5 to the model after reranking. Missing the one relevant chunk (low recall) causes hallucinations because the model invents facts not in the corpus.

The practical fix is a two-stage pipeline: a fast first stage retrieves a broad candidate set (high recall, lower precision), then a slower reranker or cross-encoder narrows to the best few (high precision).

Evaluating retrieval systems

You cannot improve what you do not measure. IR evaluation uses labeled query-document relevance judgments — often on a graded scale (0 = irrelevant, 1 = partially relevant, 2 = highly relevant).

Key metrics

Mean Average Precision (MAP) — average precision across queries; rewards ranking relevant docs high.
Normalized Discounted Cumulative Gain (NDCG) — handles graded relevance; penalizes burying the best result at position 10.
Recall@k — fraction of relevant docs found in the top k results. Critical for RAG where k is small.
MRR (Mean Reciprocal Rank) — focuses on the rank of the first relevant result. Useful for navigational queries.

Build an evaluation set of 50–200 real user queries with human-labeled relevant documents. Run your pipeline weekly against this set when you change chunking, embedding models, or fusion weights. Offline metrics correlate imperfectly with end-user satisfaction, but they catch regressions before deploy.

Lexical vs semantic retrieval

Classical IR is lexical: it matches on shared tokens, not shared meaning. A query for automobile repair will not find a document that only says car maintenance unless you have synonym rules or stemming that bridges the gap.

Semantic retrieval uses dense embeddings to match by meaning. It handles paraphrase and conceptual similarity but can miss exact identifiers — a query for error code ECONNREFUSED needs lexical matching, not semantic similarity to "connection refused."

Approach	Strengths	Weaknesses
BM25 (lexical)	Exact terms, SKUs, codes, rare words; fast; interpretable	Vocabulary mismatch; no paraphrase
Dense vectors (semantic)	Paraphrase, cross-language, conceptual match	Exact-match failures; index size; model dependency
Hybrid (BM25 + vectors)	Best of both via fusion or reranking	More infrastructure; tuning fusion weights

Most production RAG systems in 2026 use hybrid retrieval as the default, not pure vector search.

Chunking and indexing for RAG

IR assumes documents are atomic units. RAG corpora are often long PDFs, wikis, or codebases that must be chunked before indexing. Chunk size and overlap directly affect retrieval quality:

Too small — chunks lack context; the retriever returns sentence fragments the LLM cannot interpret.
Too large — one chunk covers multiple topics; BM25 scores dilute and embedding vectors average away specificity.
Overlap — 10–20% overlap between adjacent chunks prevents answers from being split across chunk boundaries.

Store metadata (source URL, section title, last-updated timestamp) alongside each chunk in the index. Filter by metadata at query time — "search only docs updated this quarter" — without re-embedding the corpus.

Common anti-patterns

Vector-only RAG — skipping BM25 loses exact-match retrieval for codes, names, and rare terms.
Mismatched preprocessing — different tokenization at index vs query time silently kills recall.
No evaluation set — tuning chunk size and fusion weights by gut feel instead of NDCG@10.
Ignoring document freshness — stale chunks outrank current policy docs because they have more inbound links or higher term frequency.
Retrieving into an oversized context — stuffing 30 chunks into the prompt adds noise; rerank down to 3–5 high-confidence passages.
Single global index — mixing legal, support, and marketing content without metadata filters causes cross-domain false positives.

Production checklist

Build an inverted index (Elasticsearch, OpenSearch, or SQLite FTS) for lexical retrieval.
Use BM25 with tuned k1 and b; compare against default before changing.
Chunk long documents with 10–20% overlap; attach source metadata to every chunk.
Add a vector index for semantic retrieval; fuse with BM25 via reciprocal rank fusion.
Apply a cross-encoder reranker on the top 20–50 candidates before LLM injection.
Create a labeled eval set (50+ queries); track NDCG@10 and Recall@5 weekly.
Keep index and query preprocessing pipelines identical.
Log queries with zero-result rate and click-through on top results.
Version embedding models; re-index when you upgrade.
Monitor retrieval latency separately from LLM generation latency.

Key takeaways

Inverted indexes make lexical search fast at scale — they are the foundation of Elasticsearch and Lucene.
BM25 is the production standard for keyword scoring; it handles term saturation and document length better than TF-IDF.
Precision and recall trade off via result count; RAG needs high recall in stage one and high precision after reranking.
Hybrid retrieval combines BM25 and embeddings because neither alone covers all query types.
Evaluate with labeled queries — NDCG and Recall@k catch regressions before users do.