Guide
Text embedding models explained
Harbor Support's first RAG stack indexed 42,000 help articles with BM25 and returned irrelevant chunks whenever customers used different words than the docs — “refund never arrived” missed pages titled “chargeback processing delays.” Swapping the retriever to a text embedding model (a bi-encoder that maps queries and passages into the same dense vector space) lifted first-hit recall@10 from 61% to 84% on their labeled ticket set, with no change to the downstream LLM. Text embedding models are the workhorse of modern semantic search: they compress meaning into fixed-size vectors so cosine similarity approximates relevance at millisecond latency inside a vector database. This guide covers bi-encoder architecture and pooling, contrastive training objectives, popular model families (E5, BGE, GTE, API embeddings), normalization and Matryoshka dimensions, domain fine-tuning, the Harbor Support retrieval refactor, a model decision table, pitfalls, and a production checklist.
From tokens to fixed-size sentence vectors
A text embedding model is a neural encoder — usually a
transformer — followed by a pooling step that
collapses variable-length token hidden states into one vector per input.
Given text T, the model outputs
e(T) ∈ ℝd where d is
the embedding dimension (384, 768, 1024, or 3072 depending on the
checkpoint).
Retrieval compares a query embedding e(q) against precomputed
document embeddings e(di). Because documents are
encoded independently, this bi-encoder pattern scales:
index millions of vectors offline, then run one forward pass per query at
request time. The trade-off is that query and document never attend to
each other during encoding — cross-encoders (which jointly score
query-document pairs) are more accurate but too slow for first-stage
retrieval over large corpora.
Pooling strategies
- Mean pooling — average token hidden states (masking padding). Default for Sentence-BERT and most open models.
- CLS pooling — use the first token vector. Common in BERT-style checkpoints when the CLS was pre-trained for classification.
- Last-token pooling — take the final non-padding hidden state. Standard for decoder-only LLMs used as embedders (E5-mistral, Nomic).
- Weighted mean / SPLADE-style sparse — hybrid extensions; dense pooling remains the default for general-purpose semantic search.
After pooling, most production pipelines L2-normalize vectors so dot product equals cosine similarity, enabling efficient inner-product indexes (HNSW, IVF-PQ) without storing norms separately.
How models learn: contrastive and dual-encoder training
Modern text embedders are trained with
contrastive learning:
pull similar pairs close, push dissimilar pairs apart in embedding space.
Given a query q and a relevant passage p+,
with in-batch negatives p-j, InfoNCE loss is:
L = -log exp(sim(q, p+) / τ) / ∑j exp(sim(q, pj) / τ)
Temperature τ controls gradient sharpness; large batches
(thousands of negatives via multi-GPU or memory banks) materially improve
retrieval quality. Training data mixes natural query-document pairs
(MS MARCO, NQ), synthetic pairs from LLM paraphrase, and hard-negative
mining where a BM25 top result is not the label.
Instruction prefixes and asymmetric retrieval
Models like E5 and BGE prepend task
instructions: query: ... for questions and
passage: ... for documents. This teaches the encoder that
queries and passages occupy the same space but with different roles —
critical for asymmetric retrieval where query length differs from document
length. Skipping prefixes on E5 checkpoints can drop nDCG by 10+ points.
Model families and when to pick each
- Small bi-encoders (MiniLM, all-MiniLM-L6-v2, bge-small) — 384-dim, <100 MB, CPU-friendly. Good for <500k docs and latency-sensitive edge routing.
- Mid-size open models (bge-base/large-en-v1.5, e5-base/large-v2, gte-large) — strong MTEB scores, fit on a single GPU for batch indexing. Default choice for self-hosted RAG.
- LLM-backed embedders (e5-mistral-7b-instruct, nomic-embed-text-v1.5) — higher quality on long context and nuanced paraphrase; need GPU and careful batching.
- API embeddings (OpenAI text-embedding-3, Cohere embed-v3, Voyage) — no ops burden, Matryoshka dimension truncation, pay per token. Watch data residency and vendor lock-in.
- Multilingual checkpoints (multilingual-e5-large, bge-m3) — shared space across 100+ languages; bge-m3 adds sparse+dense hybrid retrieval in one model.
Matryoshka and dimension trade-offs
Matryoshka representation learning trains embeddings so
the first k dimensions remain useful when truncated. OpenAI
embedding-3 and Nomic support cutting 3072 → 1024 → 256 dims
with modest recall loss — halving index RAM and speeding ANN search.
Always re-evaluate truncation on your domain; legal and medical text often
needs full dimensionality.
Harbor Support RAG retrieval refactor
Harbor Support replaced BM25-first retrieval with a two-stage pipeline:
- Bi-encoder recall —
bge-base-en-v1.5encodes all KB chunks (512-token windows, 64-token overlap) into 768-dim normalized vectors stored in Qdrant with HNSW (ef_construct=200,M=16). - Hybrid rerank — top-50 cosine hits fused with BM25 via RRF (
k=60), then a cross-encoder (bge-reranker-base) scores top-20 pairs for the final top-5 context window.
Queries prepend Represent this sentence for searching relevant passages:
per BGE instructions. Offline indexing runs at 1,800 chunks/sec on one
A10G; online query encoding adds ~18 ms p95. Domain fine-tuning on
12,000 historical ticket–resolution pairs (3 epochs, in-batch negatives
32) pushed recall@10 another 6 points on holdout tickets mentioning billing
synonyms. The LLM prompt and generation parameters stayed fixed —
retrieval quality was the lever.
Embedding model decision table
| Approach | Latency profile | Best when | Watch out for |
|---|---|---|---|
| Small bi-encoder (384-d) | <10 ms encode on CPU | Edge routing, <500k docs, tight RAM | Weak on long documents and rare jargon |
| Mid open bi-encoder (768–1024-d) | ~20–50 ms GPU / ~100 ms CPU | Default self-hosted RAG and catalog search | Needs instruction prefixes for E5/BGE |
| LLM embedder (7B class) | 100–300 ms unless batched | Hard paraphrase, long-context chunks | VRAM, batching complexity, cost at scale |
| API embeddings | Network-bound, ~50–200 ms | Fast MVP, no GPU ops, Matryoshka dims | Per-token cost, privacy, rate limits |
| Cross-encoder reranker | 10–30 ms per pair | Second stage after bi-encoder recall | Not a replacement for first-stage retrieval |
| BM25 / sparse only | Sub-ms inverted index | Exact SKU, error codes, rare tokens | Misses paraphrase and semantic overlap |
Common pitfalls
- Wrong instruction prefix — E5 and BGE expect different strings; copy from model cards, do not invent generic “embed this text” prompts.
- Encoding queries and documents identically when the model is asymmetric — use query vs passage prefixes even if both strings look like plain sentences.
- Chunking longer than the model context — silent truncation drops tail content; chunk to 75% of max tokens with overlap.
- Skipping normalization — unnormalized vectors break cosine indexes and fusion with BM25 scores.
- Evaluating on keyword overlap metrics only — paraphrase-heavy tasks need labeled relevance sets or LLM-judged nDCG, not BLEU.
- Re-embedding the corpus on every model tweak — version embedding model ID in index metadata; blue/green reindex for upgrades.
- Assuming bigger dimension always wins — Matryoshka truncation plus domain fine-tune often beats a larger general model.
- Ignoring hybrid retrieval — dense-only misses exact IDs, error codes, and product SKUs that BM25 catches in one hop.
Production checklist
- Pick bi-encoder vs API based on privacy, cost, and GPU availability; benchmark on held-out query-passage pairs from your domain.
- Apply correct query and passage prefixes from the model card before any evaluation.
- Define chunk size, overlap, and metadata (title, section, URL) stored alongside vectors.
- L2-normalize embeddings; confirm ANN index uses inner product or cosine consistently.
- Measure recall@k and MRR on a labeled set before tuning the LLM prompt.
- Add hybrid BM25 + dense fusion (RRF or weighted) when corpus mixes jargon and paraphrase.
- Plan cross-encoder reranking budget: top-20 pairs is a common sweet spot.
- Version index by model name + checkpoint; automate reindex on model upgrades.
- Test Matryoshka truncation if RAM or QPS is constrained; plot recall vs dimension.
- Consider domain fine-tune when open models lag on internal vocabulary by >10 points recall@10.
Key takeaways
- Text embedding models map variable-length text to fixed dense vectors so similarity search approximates semantic relevance.
- Bi-encoders encode queries and documents independently — fast at scale, but less accurate than cross-encoders that score pairs jointly.
- Contrastive training with hard negatives and instruction prefixes (E5, BGE) drives most of the retrieval quality gap over raw LLM hidden states.
- Hybrid dense + BM25 retrieval with optional cross-encoder reranking is the production default for RAG and support search.
- Domain fine-tuning and Matryoshka dimension tuning often beat blindly upgrading to a larger general embedding model.
Related reading
- Semantic search explained — ANN indexes, hybrid fusion, and ranking stages built on embeddings
- RAG explained — where retrieval fits in the full generate pipeline
- Contrastive learning explained — InfoNCE and negative sampling objectives behind embedders
- Vector databases explained — storing and querying millions of embedding vectors