Guide

Text embedding models explained

Harbor Support's first RAG stack indexed 42,000 help articles with BM25 and returned irrelevant chunks whenever customers used different words than the docs — “refund never arrived” missed pages titled “chargeback processing delays.” Swapping the retriever to a text embedding model (a bi-encoder that maps queries and passages into the same dense vector space) lifted first-hit recall@10 from 61% to 84% on their labeled ticket set, with no change to the downstream LLM. Text embedding models are the workhorse of modern semantic search: they compress meaning into fixed-size vectors so cosine similarity approximates relevance at millisecond latency inside a vector database. This guide covers bi-encoder architecture and pooling, contrastive training objectives, popular model families (E5, BGE, GTE, API embeddings), normalization and Matryoshka dimensions, domain fine-tuning, the Harbor Support retrieval refactor, a model decision table, pitfalls, and a production checklist.

From tokens to fixed-size sentence vectors

A text embedding model is a neural encoder — usually a transformer — followed by a pooling step that collapses variable-length token hidden states into one vector per input. Given text T, the model outputs e(T) ∈ ℝd where d is the embedding dimension (384, 768, 1024, or 3072 depending on the checkpoint).

Retrieval compares a query embedding e(q) against precomputed document embeddings e(di). Because documents are encoded independently, this bi-encoder pattern scales: index millions of vectors offline, then run one forward pass per query at request time. The trade-off is that query and document never attend to each other during encoding — cross-encoders (which jointly score query-document pairs) are more accurate but too slow for first-stage retrieval over large corpora.

Pooling strategies

  • Mean pooling — average token hidden states (masking padding). Default for Sentence-BERT and most open models.
  • CLS pooling — use the first token vector. Common in BERT-style checkpoints when the CLS was pre-trained for classification.
  • Last-token pooling — take the final non-padding hidden state. Standard for decoder-only LLMs used as embedders (E5-mistral, Nomic).
  • Weighted mean / SPLADE-style sparse — hybrid extensions; dense pooling remains the default for general-purpose semantic search.

After pooling, most production pipelines L2-normalize vectors so dot product equals cosine similarity, enabling efficient inner-product indexes (HNSW, IVF-PQ) without storing norms separately.

How models learn: contrastive and dual-encoder training

Modern text embedders are trained with contrastive learning: pull similar pairs close, push dissimilar pairs apart in embedding space. Given a query q and a relevant passage p+, with in-batch negatives p-j, InfoNCE loss is:

L = -log exp(sim(q, p+) / τ) / ∑j exp(sim(q, pj) / τ)

Temperature τ controls gradient sharpness; large batches (thousands of negatives via multi-GPU or memory banks) materially improve retrieval quality. Training data mixes natural query-document pairs (MS MARCO, NQ), synthetic pairs from LLM paraphrase, and hard-negative mining where a BM25 top result is not the label.

Instruction prefixes and asymmetric retrieval

Models like E5 and BGE prepend task instructions: query: ... for questions and passage: ... for documents. This teaches the encoder that queries and passages occupy the same space but with different roles — critical for asymmetric retrieval where query length differs from document length. Skipping prefixes on E5 checkpoints can drop nDCG by 10+ points.

Model families and when to pick each

  • Small bi-encoders (MiniLM, all-MiniLM-L6-v2, bge-small) — 384-dim, <100 MB, CPU-friendly. Good for <500k docs and latency-sensitive edge routing.
  • Mid-size open models (bge-base/large-en-v1.5, e5-base/large-v2, gte-large) — strong MTEB scores, fit on a single GPU for batch indexing. Default choice for self-hosted RAG.
  • LLM-backed embedders (e5-mistral-7b-instruct, nomic-embed-text-v1.5) — higher quality on long context and nuanced paraphrase; need GPU and careful batching.
  • API embeddings (OpenAI text-embedding-3, Cohere embed-v3, Voyage) — no ops burden, Matryoshka dimension truncation, pay per token. Watch data residency and vendor lock-in.
  • Multilingual checkpoints (multilingual-e5-large, bge-m3) — shared space across 100+ languages; bge-m3 adds sparse+dense hybrid retrieval in one model.

Matryoshka and dimension trade-offs

Matryoshka representation learning trains embeddings so the first k dimensions remain useful when truncated. OpenAI embedding-3 and Nomic support cutting 3072 → 1024 → 256 dims with modest recall loss — halving index RAM and speeding ANN search. Always re-evaluate truncation on your domain; legal and medical text often needs full dimensionality.

Harbor Support RAG retrieval refactor

Harbor Support replaced BM25-first retrieval with a two-stage pipeline:

  1. Bi-encoder recallbge-base-en-v1.5 encodes all KB chunks (512-token windows, 64-token overlap) into 768-dim normalized vectors stored in Qdrant with HNSW (ef_construct=200, M=16).
  2. Hybrid rerank — top-50 cosine hits fused with BM25 via RRF (k=60), then a cross-encoder (bge-reranker-base) scores top-20 pairs for the final top-5 context window.

Queries prepend Represent this sentence for searching relevant passages: per BGE instructions. Offline indexing runs at 1,800 chunks/sec on one A10G; online query encoding adds ~18 ms p95. Domain fine-tuning on 12,000 historical ticket–resolution pairs (3 epochs, in-batch negatives 32) pushed recall@10 another 6 points on holdout tickets mentioning billing synonyms. The LLM prompt and generation parameters stayed fixed — retrieval quality was the lever.

Embedding model decision table

Approach Latency profile Best when Watch out for
Small bi-encoder (384-d) <10 ms encode on CPU Edge routing, <500k docs, tight RAM Weak on long documents and rare jargon
Mid open bi-encoder (768–1024-d) ~20–50 ms GPU / ~100 ms CPU Default self-hosted RAG and catalog search Needs instruction prefixes for E5/BGE
LLM embedder (7B class) 100–300 ms unless batched Hard paraphrase, long-context chunks VRAM, batching complexity, cost at scale
API embeddings Network-bound, ~50–200 ms Fast MVP, no GPU ops, Matryoshka dims Per-token cost, privacy, rate limits
Cross-encoder reranker 10–30 ms per pair Second stage after bi-encoder recall Not a replacement for first-stage retrieval
BM25 / sparse only Sub-ms inverted index Exact SKU, error codes, rare tokens Misses paraphrase and semantic overlap

Common pitfalls

  • Wrong instruction prefix — E5 and BGE expect different strings; copy from model cards, do not invent generic “embed this text” prompts.
  • Encoding queries and documents identically when the model is asymmetric — use query vs passage prefixes even if both strings look like plain sentences.
  • Chunking longer than the model context — silent truncation drops tail content; chunk to 75% of max tokens with overlap.
  • Skipping normalization — unnormalized vectors break cosine indexes and fusion with BM25 scores.
  • Evaluating on keyword overlap metrics only — paraphrase-heavy tasks need labeled relevance sets or LLM-judged nDCG, not BLEU.
  • Re-embedding the corpus on every model tweak — version embedding model ID in index metadata; blue/green reindex for upgrades.
  • Assuming bigger dimension always wins — Matryoshka truncation plus domain fine-tune often beats a larger general model.
  • Ignoring hybrid retrieval — dense-only misses exact IDs, error codes, and product SKUs that BM25 catches in one hop.

Production checklist

  • Pick bi-encoder vs API based on privacy, cost, and GPU availability; benchmark on held-out query-passage pairs from your domain.
  • Apply correct query and passage prefixes from the model card before any evaluation.
  • Define chunk size, overlap, and metadata (title, section, URL) stored alongside vectors.
  • L2-normalize embeddings; confirm ANN index uses inner product or cosine consistently.
  • Measure recall@k and MRR on a labeled set before tuning the LLM prompt.
  • Add hybrid BM25 + dense fusion (RRF or weighted) when corpus mixes jargon and paraphrase.
  • Plan cross-encoder reranking budget: top-20 pairs is a common sweet spot.
  • Version index by model name + checkpoint; automate reindex on model upgrades.
  • Test Matryoshka truncation if RAM or QPS is constrained; plot recall vs dimension.
  • Consider domain fine-tune when open models lag on internal vocabulary by >10 points recall@10.

Key takeaways

  • Text embedding models map variable-length text to fixed dense vectors so similarity search approximates semantic relevance.
  • Bi-encoders encode queries and documents independently — fast at scale, but less accurate than cross-encoders that score pairs jointly.
  • Contrastive training with hard negatives and instruction prefixes (E5, BGE) drives most of the retrieval quality gap over raw LLM hidden states.
  • Hybrid dense + BM25 retrieval with optional cross-encoder reranking is the production default for RAG and support search.
  • Domain fine-tuning and Matryoshka dimension tuning often beat blindly upgrading to a larger general embedding model.

Related reading