Guide

Text embedding models explained

Harbor Support's first RAG stack indexed 42,000 help articles with BM25 and returned irrelevant chunks whenever customers used different words than the docs — “refund never arrived” missed pages titled “chargeback processing delays.” Swapping the retriever to a text embedding model (a bi-encoder that maps queries and passages into the same dense vector space) lifted first-hit recall@10 from 61% to 84% on their labeled ticket set, with no change to the downstream LLM. Text embedding models are the workhorse of modern semantic search: they compress meaning into fixed-size vectors so cosine similarity approximates relevance at millisecond latency inside a vector database. This guide covers bi-encoder architecture and pooling, contrastive training objectives, popular model families (E5, BGE, GTE, API embeddings), normalization and Matryoshka dimensions, domain fine-tuning, the Harbor Support retrieval refactor, a model decision table, pitfalls, and a production checklist.

From tokens to fixed-size sentence vectors

A text embedding model is a neural encoder — usually a transformer — followed by a pooling step that collapses variable-length token hidden states into one vector per input. Given text T, the model outputs e(T) ∈ ℝ^d where d is the embedding dimension (384, 768, 1024, or 3072 depending on the checkpoint).

Retrieval compares a query embedding e(q) against precomputed document embeddings e(d_i). Because documents are encoded independently, this bi-encoder pattern scales: index millions of vectors offline, then run one forward pass per query at request time. The trade-off is that query and document never attend to each other during encoding — cross-encoders (which jointly score query-document pairs) are more accurate but too slow for first-stage retrieval over large corpora.

Pooling strategies

Mean pooling — average token hidden states (masking padding). Default for Sentence-BERT and most open models.
CLS pooling — use the first token vector. Common in BERT-style checkpoints when the CLS was pre-trained for classification.
Last-token pooling — take the final non-padding hidden state. Standard for decoder-only LLMs used as embedders (E5-mistral, Nomic).
Weighted mean / SPLADE-style sparse — hybrid extensions; dense pooling remains the default for general-purpose semantic search.

After pooling, most production pipelines L2-normalize vectors so dot product equals cosine similarity, enabling efficient inner-product indexes (HNSW, IVF-PQ) without storing norms separately.

How models learn: contrastive and dual-encoder training

Modern text embedders are trained with contrastive learning: pull similar pairs close, push dissimilar pairs apart in embedding space. Given a query q and a relevant passage p⁺, with in-batch negatives p^-_j, InfoNCE loss is:

L = -log exp(sim(q, p⁺) / τ) / ∑_j exp(sim(q, p_j) / τ)

Temperature τ controls gradient sharpness; large batches (thousands of negatives via multi-GPU or memory banks) materially improve retrieval quality. Training data mixes natural query-document pairs (MS MARCO, NQ), synthetic pairs from LLM paraphrase, and hard-negative mining where a BM25 top result is not the label.

Instruction prefixes and asymmetric retrieval

Models like E5 and BGE prepend task instructions: query: ... for questions and passage: ... for documents. This teaches the encoder that queries and passages occupy the same space but with different roles — critical for asymmetric retrieval where query length differs from document length. Skipping prefixes on E5 checkpoints can drop nDCG by 10+ points.

Model families and when to pick each

Small bi-encoders (MiniLM, all-MiniLM-L6-v2, bge-small) — 384-dim, <100 MB, CPU-friendly. Good for <500k docs and latency-sensitive edge routing.
Mid-size open models (bge-base/large-en-v1.5, e5-base/large-v2, gte-large) — strong MTEB scores, fit on a single GPU for batch indexing. Default choice for self-hosted RAG.
LLM-backed embedders (e5-mistral-7b-instruct, nomic-embed-text-v1.5) — higher quality on long context and nuanced paraphrase; need GPU and careful batching.
API embeddings (OpenAI text-embedding-3, Cohere embed-v3, Voyage) — no ops burden, Matryoshka dimension truncation, pay per token. Watch data residency and vendor lock-in.
Multilingual checkpoints (multilingual-e5-large, bge-m3) — shared space across 100+ languages; bge-m3 adds sparse+dense hybrid retrieval in one model.

Matryoshka and dimension trade-offs

Matryoshka representation learning trains embeddings so the first k dimensions remain useful when truncated. OpenAI embedding-3 and Nomic support cutting 3072 → 1024 → 256 dims with modest recall loss — halving index RAM and speeding ANN search. Always re-evaluate truncation on your domain; legal and medical text often needs full dimensionality.

Harbor Support RAG retrieval refactor

Harbor Support replaced BM25-first retrieval with a two-stage pipeline:

Bi-encoder recall — bge-base-en-v1.5 encodes all KB chunks (512-token windows, 64-token overlap) into 768-dim normalized vectors stored in Qdrant with HNSW (ef_construct=200, M=16).
Hybrid rerank — top-50 cosine hits fused with BM25 via RRF (k=60), then a cross-encoder (bge-reranker-base) scores top-20 pairs for the final top-5 context window.

Queries prepend Represent this sentence for searching relevant passages: per BGE instructions. Offline indexing runs at 1,800 chunks/sec on one A10G; online query encoding adds ~18 ms p95. Domain fine-tuning on 12,000 historical ticket–resolution pairs (3 epochs, in-batch negatives 32) pushed recall@10 another 6 points on holdout tickets mentioning billing synonyms. The LLM prompt and generation parameters stayed fixed — retrieval quality was the lever.

Embedding model decision table

Approach	Latency profile	Best when	Watch out for
Small bi-encoder (384-d)	<10 ms encode on CPU	Edge routing, <500k docs, tight RAM	Weak on long documents and rare jargon
Mid open bi-encoder (768–1024-d)	~20–50 ms GPU / ~100 ms CPU	Default self-hosted RAG and catalog search	Needs instruction prefixes for E5/BGE
LLM embedder (7B class)	100–300 ms unless batched	Hard paraphrase, long-context chunks	VRAM, batching complexity, cost at scale
API embeddings	Network-bound, ~50–200 ms	Fast MVP, no GPU ops, Matryoshka dims	Per-token cost, privacy, rate limits
Cross-encoder reranker	10–30 ms per pair	Second stage after bi-encoder recall	Not a replacement for first-stage retrieval
BM25 / sparse only	Sub-ms inverted index	Exact SKU, error codes, rare tokens	Misses paraphrase and semantic overlap

Common pitfalls

Wrong instruction prefix — E5 and BGE expect different strings; copy from model cards, do not invent generic “embed this text” prompts.
Encoding queries and documents identically when the model is asymmetric — use query vs passage prefixes even if both strings look like plain sentences.
Chunking longer than the model context — silent truncation drops tail content; chunk to 75% of max tokens with overlap.
Skipping normalization — unnormalized vectors break cosine indexes and fusion with BM25 scores.
Evaluating on keyword overlap metrics only — paraphrase-heavy tasks need labeled relevance sets or LLM-judged nDCG, not BLEU.
Re-embedding the corpus on every model tweak — version embedding model ID in index metadata; blue/green reindex for upgrades.
Assuming bigger dimension always wins — Matryoshka truncation plus domain fine-tune often beats a larger general model.
Ignoring hybrid retrieval — dense-only misses exact IDs, error codes, and product SKUs that BM25 catches in one hop.

Production checklist

Pick bi-encoder vs API based on privacy, cost, and GPU availability; benchmark on held-out query-passage pairs from your domain.
Apply correct query and passage prefixes from the model card before any evaluation.
Define chunk size, overlap, and metadata (title, section, URL) stored alongside vectors.
L2-normalize embeddings; confirm ANN index uses inner product or cosine consistently.
Measure recall@k and MRR on a labeled set before tuning the LLM prompt.
Add hybrid BM25 + dense fusion (RRF or weighted) when corpus mixes jargon and paraphrase.
Plan cross-encoder reranking budget: top-20 pairs is a common sweet spot.
Version index by model name + checkpoint; automate reindex on model upgrades.
Test Matryoshka truncation if RAM or QPS is constrained; plot recall vs dimension.
Consider domain fine-tune when open models lag on internal vocabulary by >10 points recall@10.

Key takeaways

Text embedding models map variable-length text to fixed dense vectors so similarity search approximates semantic relevance.
Bi-encoders encode queries and documents independently — fast at scale, but less accurate than cross-encoders that score pairs jointly.
Contrastive training with hard negatives and instruction prefixes (E5, BGE) drives most of the retrieval quality gap over raw LLM hidden states.
Hybrid dense + BM25 retrieval with optional cross-encoder reranking is the production default for RAG and support search.
Domain fine-tuning and Matryoshka dimension tuning often beat blindly upgrading to a larger general embedding model.