Guide

Semantic search explained

A customer types “laptop won’t turn on” into your help center. The best article is titled “Troubleshooting power issues on portable computers” — it shares zero keywords with the query. Classical keyword search (BM25, TF-IDF) often misses this match. Semantic search encodes both the query and every document as dense vectors that capture meaning, then ranks by similarity in embedding space. It powers modern enterprise search, e-commerce discovery, and the retrieval stage of RAG pipelines. This guide explains how semantic search differs from lexical retrieval, the embedding-and-index pipeline, approximate nearest-neighbor (ANN) lookup, re-ranking with cross-encoders, when to pair semantic with keyword search in hybrid fusion, a product-catalog worked example, a strategy decision table, common pitfalls, and a practitioner checklist.

Keyword search vs semantic search

Lexical (keyword) search tokenizes text, builds inverted indexes, and scores overlap between query terms and document terms. BM25 rewards term frequency while penalizing common words. It is fast, interpretable, and excellent for exact product SKUs, error codes, and proper nouns — but brittle to synonyms, paraphrases, and cross-language queries.

Semantic search maps text into a fixed-dimensional vector space where proximity reflects conceptual similarity. “Refund policy” and “how to return an item” land near each other even without shared tokens. The trade-off: embeddings are opaque, require GPU inference at index and query time, and can hallucinate relevance for vaguely related topics unless combined with guardrails.

Production systems rarely choose one exclusively. The winning pattern is hybrid search: run BM25 and dense retrieval in parallel, then fuse rankings. Semantic search handles meaning; keywords handle precision on rare tokens.

The semantic search pipeline

A typical semantic search stack has four stages:

Chunking — split long documents into passages (256–512 tokens is common for RAG). Store metadata: source URL, title, section, timestamp.
Embedding — pass each chunk through an embedding model (e.g. OpenAI text-embedding-3-small, Cohere Embed, BGE, E5). Output is a float vector — often 384, 768, or 1,536 dimensions. See LLM embeddings for model selection.
Indexing — insert vectors into an ANN index inside a vector database (Pinecone, Weaviate, Qdrant, pgvector, Elasticsearch dense_vector). Indexes trade recall for speed via HNSW, IVF, or product quantization.
Query — embed the user query with the same model, retrieve top-k neighbors by cosine similarity or inner product, optionally re-rank, return passages to the UI or downstream LLM.

Critical invariant: index-time and query-time must use identical embedding models, preprocessing, and normalization. Swapping models without re-embedding the entire corpus silently degrades recall.

Embedding models and domain fit

General-purpose embedding models trained on web text work well for broad FAQ and documentation search. Specialized domains — legal contracts, medical records, source code — often need fine-tuned or domain-specific models (CodeBERT, PubMedBERT, legal-bert) or contrastive fine-tuning on your own query–document pairs.

Model choice affects three dimensions:

Dimensionality — higher dims can capture nuance but increase storage and latency. Matryoshka models allow truncating dims at query time.
Context length — must exceed your chunk size; otherwise text is silently truncated and recall drops on long passages.
Instruction prefixes — models like E5 expect query: and passage: prefixes at inference. Omitting them is a common production bug.

Evaluate candidates on your data with labeled query–relevant-document pairs. Offline NDCG@10 beats guessing from leaderboard benchmarks.

Approximate nearest neighbors (ANN)

Exact brute-force comparison of a query against millions of vectors is too slow at scale. ANN algorithms build graph or cluster structures that navigate embedding space heuristically, returning near-neighbors in milliseconds with >95% recall when tuned correctly.

HNSW (Hierarchical Navigable Small World) is the default in most vector DBs: layered graphs with greedy search. Tune efConstruction (index build quality) and efSearch (query-time recall/latency). Higher values improve recall at the cost of memory and latency.

IVF (Inverted File Index) clusters vectors and searches only the nearest clusters — faster builds, lower memory, slightly lower recall. Good for very large corpora with budget constraints.

Always measure recall@k against a brute-force baseline on a held-out sample before trusting ANN parameters in production.

Re-ranking: bi-encoder vs cross-encoder

First-stage semantic retrieval uses a bi-encoder: query and document are embedded independently, then compared cheaply. This is fast but shallow — the model never sees query and document tokens together.

Cross-encoders (e.g. ms-marco-MiniLM-L-6-v2) feed query + candidate passage through a single transformer and output a relevance score. They are far more accurate but too slow to score every document — use them as a re-ranker on the top 20–100 bi-encoder hits.

Typical latency budget: bi-encoder retrieves 50 candidates in ~20 ms; cross-encoder re-ranks 50 pairs in ~200 ms. For sub-100 ms SLAs, skip re-ranking or use a distilled smaller cross-encoder.

Worked example: e-commerce product discovery

An outdoor-gear retailer has 40,000 product descriptions. Users search “waterproof jacket for hiking in rain.”

Setup

Chunk each product description + bullet features into a single passage per SKU.
Embed with a general text model; store in pgvector with HNSW index.
Parallel BM25 index on title + brand + SKU for exact matches.

Query flow

Embed query → retrieve top 30 by cosine similarity.
BM25 retrieves top 30 on token overlap.
Reciprocal rank fusion merges both lists → top 20 candidates.
Cross-encoder re-ranks top 20 → return top 8 to the UI.

Outcome

Semantic retrieval surfaces a listing titled “Men’s Gore-Tex Shell — All-Weather Trail Protection” (no shared keywords with “waterproof jacket”). BM25 alone ranked it page three. Hybrid + re-rank places it first. NDCG@10 on a 200-query eval set rises from 0.61 (BM25 only) to 0.84 (hybrid + cross-encoder).

Evaluation metrics

Measure retrieval quality before shipping:

Recall@k — fraction of queries where at least one relevant document appears in the top k results. Primary metric for RAG (downstream LLM cannot use what retrieval missed).
NDCG@k (Normalized Discounted Cumulative Gain) — rewards placing highly relevant documents higher in the ranked list. Standard for search quality benchmarks.
MRR (Mean Reciprocal Rank) — average of 1/rank of the first relevant hit. Good when users need exactly one correct answer.
Precision@k — fraction of top-k results that are relevant. Useful when UI space is limited.

Build a golden set of 100–500 real user queries with human-labeled relevant documents. Re-run eval whenever you change embedding model, chunk size, or fusion weights.

Strategy decision table

Scenario	Recommended approach	Why
Exact SKU / error code lookup	BM25 keyword only	Semantic search adds latency without benefit on token-exact matches
FAQ / documentation with paraphrased questions	Semantic + BM25 hybrid	Meaning match for paraphrases; keywords catch rare technical terms
RAG over private knowledge base	Semantic bi-encoder + cross-encoder re-rank	Maximize recall@k; LLM quality depends on retrieval precision
Multilingual catalog (same embedding space)	Multilingual embedding model (e.g. multilingual-E5)	Single index serves cross-language queries without translation step
Sub-50 ms search SLA at 10M+ docs	IVF + product quantization; skip cross-encoder	Trade marginal recall for latency and memory
Highly specialized jargon (legal, medical)	Domain fine-tuned embeddings	General models underperform on out-of-distribution terminology

Common pitfalls

Model mismatch — indexing with model A and querying with model B destroys recall silently.
Chunks too large — 2,000-token passages dilute the embedding; relevant sentences get averaged away. Prefer 256–512 tokens with overlap.
No metadata filtering — returning archived or wrong-locale docs because vector similarity ignores business rules. Pre-filter by tenant, language, date before ANN search.
Stale index — documents updated in the source DB but embeddings not refreshed. Automate re-embedding on content change events.
Similarity threshold blindness — returning low-similarity hits when nothing is truly relevant. Set a minimum cosine score and fall back to “no results” or keyword search.
Ignoring negative queries — “products without gluten” needs structured filters, not pure semantic ranking.

Production checklist

Define chunk size, overlap, and metadata schema before indexing.
Pin embedding model version; document preprocessing (prefixes, normalization).
Build labeled eval set; baseline BM25-only NDCG@10 before adding vectors.
Tune ANN parameters against brute-force recall@k on a sample.
Implement hybrid fusion if corpus contains proper nouns, SKUs, or codes.
Add cross-encoder re-ranker if latency budget allows (>100 ms acceptable).
Enforce metadata filters (tenant, locale, visibility) before vector search.
Set minimum similarity threshold; handle empty-result UX gracefully.
Automate re-embedding pipeline on document create/update/delete.
Monitor query latency p95, recall@k on sampled queries, and index size growth.

Key takeaways

Semantic search ranks by meaning in embedding space — it solves synonym and paraphrase gaps that keyword search cannot.
The pipeline is chunk → embed → ANN index → query embed → retrieve → (re-rank); model consistency across index and query is non-negotiable.
Hybrid search combining BM25 and dense retrieval is the production default for most real corpora.
Cross-encoder re-ranking on top-k bi-encoder hits materially improves precision at modest latency cost.
Evaluate with recall@k and NDCG@10 on your own labeled queries — never ship on leaderboard faith alone.