Guide
Vector databases explained: embeddings, ANN search, and RAG indexes
Keyword search finds documents that contain the words you typed. Semantic search finds documents that mean the same thing — even when the vocabulary differs. A vector database stores high-dimensional embeddings (dense numeric vectors produced by a model) and answers "what is closest to this query vector?" at scale. That capability powers retrieval-augmented generation (RAG), recommendation systems, image similarity, and deduplication pipelines. This guide explains how vectors are compared, why exact search breaks at millions of rows, how approximate nearest-neighbor (ANN) indexes work, and how to choose and size a store for production retrieval.
From text to vectors
An embedding model maps text, images, or audio into a fixed-length
float array — often 384, 768, 1,024, or 1,536 dimensions. Similar content lands
near each other in that space. Encoder-only transformers (BERT-class models) and
modern embedding APIs (OpenAI, Cohere, Voyage, open-weight models like
nomic-embed-text) are the usual sources; see our
transformer architecture guide
for how encoders differ from chat LLMs.
Ingestion typically: split documents into chunks (paragraphs or fixed token windows — chunk sizing ties directly to how text is tokenized), run each chunk through the embedding model, and upsert the vector plus metadata (source URL, title, ACL group, timestamp) into the index. At query time you embed the user's question with the same model and search for nearest neighbors. Mixing embedding models between ingest and query destroys recall.
Distance metrics: cosine, dot product, L2
Most stores expose three related metrics:
- Cosine similarity — measures the angle between vectors, ignoring magnitude. Standard for text embeddings that are L2-normalized. Values range from -1 to 1; higher is more similar.
- Dot product — equivalent to cosine when vectors are unit-length, but faster on hardware that fuses multiply-add. Some models ship unnormalized vectors where dot product is the training objective (e.g. certain contrastive setups).
- Euclidean (L2) distance — straight-line distance. Works when magnitude carries signal (some image embeddings); for text, cosine is usually safer.
Pick one metric at index creation and keep it consistent. Re-embedding an entire corpus because you switched from cosine to dot product is an expensive mistake.
Why brute force fails — and what ANN does instead
Exact k-nearest-neighbor search compares the query against every vector. That is O(n) per query — fine for ten thousand rows, painful at ten million. Production systems use approximate nearest neighbor (ANN) indexes that trade a small amount of recall for orders-of-magnitude speedup.
HNSW (Hierarchical Navigable Small World)
HNSW builds a multi-layer graph: upper layers provide long jumps across the space;
lower layers refine the search locally. Queries start at an entry point and greedily
walk toward the query vector. Parameters M (neighbors per node) and
efConstruction / efSearch control index build cost vs
recall. HNSW is the default in Qdrant, Weaviate, Milvus, pgvector (with the
hnsw access method), and many managed offerings.
Strengths: excellent recall at low latency, no training phase. Weaknesses: memory-heavy (the graph lives in RAM for best performance), costly to update at very high write rates — deletes and heavy churn may require periodic rebuilds.
IVF (Inverted File Index)
IVF clusters vectors into nlist buckets (via k-means at build time).
At query time only the nearest few centroids are searched — fast, but recall drops
if the true neighbor sits in a bucket you skipped. IVF pairs well with product
quantization (PQ) to compress vectors and shrink RAM. FAISS popularized IVF-PQ;
Milvus and some cloud services expose similar options.
Rule of thumb: HNSW for interactive RAG where recall matters; IVF-PQ for billion-scale offline batch jobs where you can tolerate tuning and occasional misses.
Metadata filtering and hybrid search
Pure vector search ignores structure. Real apps need filters: "only docs this user can read," "published after 2025," "product category = shoes." Most vector databases support pre-filtering (apply metadata constraints before ANN) or post-filtering (retrieve top-k then discard — risky if k is small and many neighbors fail the filter).
Hybrid retrieval combines dense vectors with sparse lexical scores
(BM25 / full-text). A query like error code 0xC0000005 or a rare SKU
often matches keywords better than semantics; "how do refunds work?" benefits from
embeddings. Common patterns:
- Run BM25 and vector search in parallel, merge with reciprocal rank fusion (RRF)
- Weighted linear combination after normalizing scores
- Two-stage: cheap hybrid recall of top 100, then a cross-encoder reranker on top 20
pgvector on PostgreSQL can pair the tsvector full-text index with HNSW
in one database — attractive when you already run Postgres and want ACID transactions
around document metadata. Dedicated stores (Pinecone, Qdrant, Weaviate) ship hybrid
APIs and horizontal scaling as managed features.
Choosing a vector store
| Option | Best when | Watch out for |
|---|---|---|
| pgvector (Postgres extension) | Small to mid-size corpora (< few million vectors), strong consistency needs, team already on Postgres | Single-node RAM limits; tune maintenance_work_mem for HNSW builds; not a separate search cluster |
| Qdrant / Weaviate (self-hosted or cloud) | Dedicated RAG service, rich filtering, hybrid search, moderate ops appetite | Another system to monitor, backup, and version-upgrade |
| Pinecone / managed cloud | Fastest path to production, variable load, minimal infra team | Vendor lock-in, per-dimension pricing surprises at scale |
| OpenSearch / Elasticsearch kNN | Existing ES cluster, logs + docs in one place | ANN tuning is less ergonomic than purpose-built vector DBs |
| In-memory FAISS / hnswlib | Prototypes, batch offline jobs, embedded in a single process | No durability, no multi-tenant filtering — not a database |
Start simple: pgvector or a managed free tier until retrieval quality — not index throughput — becomes the bottleneck. Prematurely sharding a vector cluster before you have eval metrics is a common waste of engineering time.
Sizing and operating a RAG index
Back-of-envelope memory: vectors dominate. One million 768-dimensional float32 vectors ≈ 3 GB raw vectors alone; HNSW graphs often add 1.5–2× overhead. Plan headroom for metadata, replicas, and OS page cache.
Ingest and freshness
Batch embed overnight for static docs; stream updates for wikis that change hourly. Version your index when you change chunking strategy or embedding model — run dual indexes during migration and cut over when retrieval eval passes on a golden question set.
Query parameters
- top_k — how many chunks return to the LLM (often 5–20). Too few misses context; too many dilutes the prompt and burns tokens.
- score threshold — drop neighbors below a similarity floor to reduce hallucination triggers from irrelevant chunks.
- efSearch / probes — raise ANN search breadth when recall@k drops in eval; costs latency.
Observability
Log query embedding latency, ANN search ms, filter selectivity, and which document IDs were retrieved. When users report wrong answers, inspect retrieved chunks first — most RAG failures are retrieval failures, not generation failures. Pair index metrics with end-to-end answer grading on held-out questions.
Common pitfalls
- Chunk boundaries split answers — a table row cut in half embeds poorly; use structure-aware chunking for HTML/PDF
- Stale embeddings — updated docs without re-embed return old content to the model
- Duplicate chunks — near-identical paragraphs crowd out diverse context in top_k
- Ignoring ACL metadata — post-filtering with k=5 leaks nothing but returns empty context; pre-filter or raise k
- Evaluating only on cosine scores — high similarity does not guarantee the chunk answers the question; measure downstream LLM accuracy
Related reading
- RAG explained — full retrieval-augmented generation pipeline from chunking to reranking
- Transformer architecture explained — encoder vs decoder stacks and where embedding models fit
- LLM tokenization explained — token counts, chunk sizing, and multilingual embedding quirks
- LLM evaluation and benchmarking — measuring retrieval recall and answer quality on golden sets