Guide
LLM embeddings explained
Large language models read and write text, but many production AI systems need something simpler first: a way to compare meaning. An embedding is a fixed-length list of numbers — a dense vector — that represents what a piece of text is about. Similar meanings land close together in vector space; unrelated text lands far apart. That geometry powers semantic search, retrieval-augmented generation (RAG), clustering, deduplication, and recommendation. This guide explains how embedding models produce those vectors, which similarity metrics to use, how to choose a model, and the mistakes that silently wreck retrieval quality in production.
What an embedding actually is
Take the sentence "How do I stake SOL on Solana?" and run it through an embedding model. The output might be a vector of 768 or 1,536 floating-point numbers — not human-readable, but mathematically structured. Another sentence with the same intent, like "Steps to stake Solana tokens," should produce a vector pointing in a similar direction. A sentence about baking bread should point somewhere else entirely.
Embeddings compress meaning into a coordinate system. You cannot decode the original text from the vector — information is intentionally lossy — but you can compare vectors to rank candidates by relevance. That is the core contract: same model, same preprocessing, comparable vectors. Break any part of that contract and your vector index returns nonsense even when the underlying documents are perfect.
Embeddings differ from the token sequences a chat model consumes. Chat models predict the next token; embedding models emit a single summary vector per input (or one vector per token for specialized tasks). The training objective is different too — contrastive learning pulls paraphrases together and pushes unrelated pairs apart, rather than next-word prediction.
How embedding models are trained
Modern text embedding models usually start from a transformer encoder (BERT-class or a trimmed decoder) and fine-tune with contrastive loss:
- Positive pairs — query and relevant passage, two paraphrases, or anchor and hard positive from click logs.
- Negative pairs — random passages or "in-batch negatives" (other examples in the same training batch that are not matches).
The model learns to maximize similarity on positives and minimize it on negatives. Quality depends heavily on training data: web-scale paraphrase corpora, MS MARCO search pairs, domain-specific manuals for enterprise retrieval, or synthetic Q&A generated by a strong LLM. A model trained on general web text may underperform on legal, medical, or on-chain documentation unless you fine-tune or pick a domain-tuned variant.
Instruction-tuned embeddings accept a task prefix — e.g.
Represent this sentence for searching relevant passages: before the query —
so one model handles asymmetric retrieval (short query, long document) better than symmetric
sentence encoders trained only on paraphrase pairs. Always use the same prefix at index
and query time; the prefix is part of the input, not decoration.
Similarity metrics: cosine, dot product, Euclidean
Given two vectors a and b, you need a score that says how close they are. The three common choices:
- Cosine similarity — measures the angle between vectors, ignoring magnitude. Range is typically −1 to 1 (often 0 to 1 for text). Robust when vector lengths vary. Default for most semantic search pipelines.
- Dot product — sum of element-wise products. Equivalent to cosine when vectors are L2-normalized (unit length). Many APIs return normalized embeddings so dot product and cosine coincide; check your provider's docs.
- Euclidean distance — straight-line distance in space. Works when magnitudes carry signal (rare for text). Most vector DBs expose cosine or inner product indexes instead.
Before indexing, confirm whether your store expects normalized vectors. Mixing normalized
queries against unnormalized index entries skews ranking. If you self-host open models,
apply L2 normalization explicitly:
v / ||v|| before insert and search.
Popular embedding model families
Model choice is a product decision — latency, cost, language coverage, and retrieval quality trade off. Common options in 2026:
Hosted API models
- OpenAI text-embedding-3-small / large — strong general retrieval, variable dimensions (Matryoshka — see below), pay per token embedded. Good default when you already use OpenAI and want minimal ops.
- Cohere embed-v3 — asymmetric search modes (
search_documentvssearch_query), solid multilingual support, compression options for storage. - Voyage, Jina, Google Vertex — competitive benchmarks on MTEB-style tasks; worth A/B testing if retrieval quality is your bottleneck.
Open-weight models (self-host or run locally)
- nomic-embed-text, bge-large, e5-mistral — run via sentence-transformers or ONNX; no per-token API bill, full data residency. You pay GPU/CPU and maintenance.
- Smaller models (MiniLM, all-MiniLM-L6-v2) — fast and cheap for prototyping; quality gap shows on nuanced technical corpora.
Evaluate on your documents, not leaderboard averages. Build 50–200 labeled query–passage pairs from real user questions and measure recall@k before committing. Our LLM evaluation guide covers how to run those evals without fooling yourself.
Dimensionality and Matryoshka embeddings
Embedding dimension is the vector length — 384, 768, 1,024, 1,536, 3,072, etc. Higher dimensions can capture finer distinctions but cost more storage, memory bandwidth, and index build time. Rule of thumb: start with the provider's recommended default; only shrink after measuring recall loss on your eval set.
Matryoshka representation learning trains embeddings so the first d dimensions remain useful as you truncate — like nested dolls. OpenAI's text-embedding-3 models support reduced dimensions (e.g. 1,536 down to 256) with modest quality loss. That lets you trade index size for speed without re-embedding the entire corpus when requirements change.
Storage math: 1 million vectors at 1,536 dimensions × 4 bytes (float32) ≈ 6 GB raw vectors alone, before HNSW graph overhead. Plan capacity when you size a vector database; quantization (int8) can cut footprint if your store supports it.
Where embeddings show up in LLM apps
RAG retrieval
Chunk documents, embed each chunk, store vectors with metadata (source URL, title, ACL). At query time, embed the user question, fetch top-k neighbors, inject passages into the prompt. Embedding quality sets the ceiling on answer faithfulness — a weak retriever cannot be fixed by a smarter generator.
Semantic deduplication and clustering
Near-duplicate support tickets, forum posts, or scraped articles cluster by embedding similarity. Threshold tuning matters: 0.95 cosine might mean "duplicate," 0.85 "related topic."
Classification and routing
Embed labeled examples once; classify new text by nearest centroid or a lightweight classifier on top of frozen embeddings. Cheaper than full LLM calls for intent routing ("billing vs technical vs sales").
Multimodal and code search
Image–text models embed both modalities into a shared space (CLIP-style). Code embedding models map functions and natural-language queries into comparable vectors for repo search. Same similarity machinery, different encoders.
Indexing pipeline best practices
- Chunk thoughtfully — 256–512 tokens with overlap (10–20%) preserves context across boundaries. Headers and section titles in the chunk text improve retrieval for structured docs.
- Same model, same settings — never mix embedding model versions in one index. Re-embed the full corpus when you upgrade models; partial upgrades poison ranking.
- Batch embed offline — index builds should batch requests (respect rate limits) and write idempotently. Store model name and dimension in index metadata.
- Hybrid retrieval — combine BM25 keyword hits with vector search; embeddings miss exact SKUs, error codes, and rare proper nouns that keyword search nails.
- Rerank top candidates — cross-encoder rerankers (ms-marco-MiniLM, Cohere rerank) score query–passage pairs jointly; use on top-20 vector hits for precision.
- Refresh on content change — stale embeddings for updated docs produce wrong answers confidently. Tie index updates to your CMS or git webhook.
Query-time tips
- Query expansion — optionally rewrite vague user input with an LLM ("What does the user likely want?") before embedding; helps on typos and ultra-short queries at the cost of latency.
- Metadata filters — pre-filter by tenant, product, or date before vector search; reduces false positives in multi-tenant SaaS.
- Score thresholds — if top hit cosine similarity is below ~0.7 (model-dependent), return "I don't know" instead of hallucinating from irrelevant chunks.
- Cache frequent queries — embedding API calls add 50–200 ms; cache query vectors in Redis for hot questions.
Common pitfalls
- Model mismatch — indexing with model A, querying with model B. Similarity scores become meaningless; retrieval looks random.
- Ignoring asymmetric modes — using document embeddings for queries on models that expect different input types (Cohere, E5 instruct variants).
- Chunks too large or too small — whole PDF pages dilute the vector; single sentences lose context. Tune on your eval set.
- Skipping normalization — dot-product indexes assume unit vectors; unnormalized inserts skew results.
- English model on multilingual corpus — non-English queries retrieve poorly unless you pick a multilingual embedder and test per locale.
- No eval loop — shipping embedding upgrades without recall@k regression tests is how RAG quality silently rots.
- Security blind spot — retrieved text becomes prompt input; poisoned documents in the index are an indirect prompt injection vector. Sanitize sources and enforce ACL filters at retrieval time.
Key takeaways
- Embeddings are dense vectors that encode semantic similarity — the foundation of semantic search and RAG retrieval.
- Use cosine similarity (or normalized dot product) for text; confirm normalization matches your vector store's expectations.
- Pick embedding models with domain evals, not leaderboard hype — hosted APIs vs open-weight is a cost, latency, and privacy tradeoff.
- Matryoshka dimensions let you shrink vectors after training; still re-embed when switching model families.
- Pair vector search with keyword hybrid retrieval and reranking for production precision.
- Treat the index as part of your security boundary — stale, mixed-model, or poisoned embeddings break trust fast.
Related reading
- Vector databases explained — HNSW indexes, ANN search, and hybrid retrieval
- RAG explained — chunking, retrieval pipelines, and grounding LLM answers
- Transformer architecture explained — encoders, attention, and how models represent language
- LLM tokenization explained — tokens, BPE, and sizing chunks for embedding