Guide

Vector databases explained: embeddings, ANN search, and RAG indexes

Keyword search finds documents that contain the words you typed. Semantic search finds documents that mean the same thing — even when the vocabulary differs. A vector database stores high-dimensional embeddings (dense numeric vectors produced by a model) and answers "what is closest to this query vector?" at scale. That capability powers retrieval-augmented generation (RAG), recommendation systems, image similarity, and deduplication pipelines. This guide explains how vectors are compared, why exact search breaks at millions of rows, how approximate nearest-neighbor (ANN) indexes work, and how to choose and size a store for production retrieval.

From text to vectors

An embedding model maps text, images, or audio into a fixed-length float array — often 384, 768, 1,024, or 1,536 dimensions. Similar content lands near each other in that space. Encoder-only transformers (BERT-class models) and modern embedding APIs (OpenAI, Cohere, Voyage, open-weight models like nomic-embed-text) are the usual sources; see our transformer architecture guide for how encoders differ from chat LLMs.

Ingestion typically: split documents into chunks (paragraphs or fixed token windows — chunk sizing ties directly to how text is tokenized), run each chunk through the embedding model, and upsert the vector plus metadata (source URL, title, ACL group, timestamp) into the index. At query time you embed the user's question with the same model and search for nearest neighbors. Mixing embedding models between ingest and query destroys recall.

Distance metrics: cosine, dot product, L2

Most stores expose three related metrics:

Cosine similarity — measures the angle between vectors, ignoring magnitude. Standard for text embeddings that are L2-normalized. Values range from -1 to 1; higher is more similar.
Dot product — equivalent to cosine when vectors are unit-length, but faster on hardware that fuses multiply-add. Some models ship unnormalized vectors where dot product is the training objective (e.g. certain contrastive setups).
Euclidean (L2) distance — straight-line distance. Works when magnitude carries signal (some image embeddings); for text, cosine is usually safer.

Pick one metric at index creation and keep it consistent. Re-embedding an entire corpus because you switched from cosine to dot product is an expensive mistake.

Why brute force fails — and what ANN does instead

Exact k-nearest-neighbor search compares the query against every vector. That is O(n) per query — fine for ten thousand rows, painful at ten million. Production systems use approximate nearest neighbor (ANN) indexes that trade a small amount of recall for orders-of-magnitude speedup.

HNSW (Hierarchical Navigable Small World)

HNSW builds a multi-layer graph: upper layers provide long jumps across the space; lower layers refine the search locally. Queries start at an entry point and greedily walk toward the query vector. Parameters M (neighbors per node) and efConstruction / efSearch control index build cost vs recall. HNSW is the default in Qdrant, Weaviate, Milvus, pgvector (with the hnsw access method), and many managed offerings.

Strengths: excellent recall at low latency, no training phase. Weaknesses: memory-heavy (the graph lives in RAM for best performance), costly to update at very high write rates — deletes and heavy churn may require periodic rebuilds.

IVF (Inverted File Index)

IVF clusters vectors into nlist buckets (via k-means at build time). At query time only the nearest few centroids are searched — fast, but recall drops if the true neighbor sits in a bucket you skipped. IVF pairs well with product quantization (PQ) to compress vectors and shrink RAM. FAISS popularized IVF-PQ; Milvus and some cloud services expose similar options.

Rule of thumb: HNSW for interactive RAG where recall matters; IVF-PQ for billion-scale offline batch jobs where you can tolerate tuning and occasional misses.

Metadata filtering and hybrid search

Pure vector search ignores structure. Real apps need filters: "only docs this user can read," "published after 2025," "product category = shoes." Most vector databases support pre-filtering (apply metadata constraints before ANN) or post-filtering (retrieve top-k then discard — risky if k is small and many neighbors fail the filter).

Hybrid retrieval combines dense vectors with sparse lexical scores (BM25 / full-text). A query like error code 0xC0000005 or a rare SKU often matches keywords better than semantics; "how do refunds work?" benefits from embeddings. Common patterns:

Run BM25 and vector search in parallel, merge with reciprocal rank fusion (RRF)
Weighted linear combination after normalizing scores
Two-stage: cheap hybrid recall of top 100, then a cross-encoder reranker on top 20

pgvector on PostgreSQL can pair the tsvector full-text index with HNSW in one database — attractive when you already run Postgres and want ACID transactions around document metadata. Dedicated stores (Pinecone, Qdrant, Weaviate) ship hybrid APIs and horizontal scaling as managed features.

Choosing a vector store

Option	Best when	Watch out for
pgvector (Postgres extension)	Small to mid-size corpora (< few million vectors), strong consistency needs, team already on Postgres	Single-node RAM limits; tune `maintenance_work_mem` for HNSW builds; not a separate search cluster
Qdrant / Weaviate (self-hosted or cloud)	Dedicated RAG service, rich filtering, hybrid search, moderate ops appetite	Another system to monitor, backup, and version-upgrade
Pinecone / managed cloud	Fastest path to production, variable load, minimal infra team	Vendor lock-in, per-dimension pricing surprises at scale
OpenSearch / Elasticsearch kNN	Existing ES cluster, logs + docs in one place	ANN tuning is less ergonomic than purpose-built vector DBs
In-memory FAISS / hnswlib	Prototypes, batch offline jobs, embedded in a single process	No durability, no multi-tenant filtering — not a database

Start simple: pgvector or a managed free tier until retrieval quality — not index throughput — becomes the bottleneck. Prematurely sharding a vector cluster before you have eval metrics is a common waste of engineering time.

Sizing and operating a RAG index

Back-of-envelope memory: vectors dominate. One million 768-dimensional float32 vectors ≈ 3 GB raw vectors alone; HNSW graphs often add 1.5–2× overhead. Plan headroom for metadata, replicas, and OS page cache.

Ingest and freshness

Batch embed overnight for static docs; stream updates for wikis that change hourly. Version your index when you change chunking strategy or embedding model — run dual indexes during migration and cut over when retrieval eval passes on a golden question set.

Query parameters

top_k — how many chunks return to the LLM (often 5–20). Too few misses context; too many dilutes the prompt and burns tokens.
score threshold — drop neighbors below a similarity floor to reduce hallucination triggers from irrelevant chunks.
efSearch / probes — raise ANN search breadth when recall@k drops in eval; costs latency.

Observability

Log query embedding latency, ANN search ms, filter selectivity, and which document IDs were retrieved. When users report wrong answers, inspect retrieved chunks first — most RAG failures are retrieval failures, not generation failures. Pair index metrics with end-to-end answer grading on held-out questions.

Common pitfalls

Chunk boundaries split answers — a table row cut in half embeds poorly; use structure-aware chunking for HTML/PDF
Stale embeddings — updated docs without re-embed return old content to the model
Duplicate chunks — near-identical paragraphs crowd out diverse context in top_k
Ignoring ACL metadata — post-filtering with k=5 leaks nothing but returns empty context; pre-filter or raise k
Evaluating only on cosine scores — high similarity does not guarantee the chunk answers the question; measure downstream LLM accuracy