Guide

RAG explained: retrieval-augmented generation for LLM apps

A language model trained last year does not know your company's pricing page, yesterday's incident postmortem, or the Solana program you deployed this morning. You could paste everything into the prompt — until you hit the context window ceiling — or you can use retrieval-augmented generation (RAG): fetch the few paragraphs that matter, inject them into the prompt, and let the model answer from evidence instead of imagination. RAG is the default architecture for "chat with your docs" products, support bots, and internal knowledge assistants. This guide walks through how it works, where it breaks, and what to build first.

The problem RAG solves

Large language models are excellent at language, reasoning, and synthesis — but they are not live databases. Their weights freeze at training time. Fine-tuning on your corpus helps for stable knowledge, but it is expensive, slow to update, and risky when documents change weekly.

RAG splits the job into two steps:

  1. Retrieve — given a user question, find the most relevant chunks from an external knowledge base.
  2. Generate — pass those chunks plus the question to the LLM and ask it to answer using only the provided context.

When retrieval works, answers cite real sources, hallucinations drop, and you can refresh knowledge by re-indexing documents without retraining the model. When retrieval fails, the model still sounds confident — which is why evaluation and observability matter as much as the embedding model you pick.

The RAG pipeline, step by step

1. Ingest and chunk documents

Raw PDFs, HTML pages, markdown files, and database exports are too large to embed whole. You split them into chunks — typically 200–800 tokens with 10–20% overlap between adjacent chunks so sentences split across boundaries are not lost.

Chunking strategy dominates retrieval quality more than most teams expect:

  • Fixed-size splits — simple, but can cut tables and code blocks in half.
  • Structure-aware splits — respect headings, paragraphs, and list items; better for documentation.
  • Semantic chunking — break when embedding similarity between sentences drops; higher cost, often better recall on long prose.

Store metadata with every chunk: source URL, title, section heading, last-updated timestamp, and access-control tags. You will filter on these at query time.

2. Embed chunks into vectors

An embedding model maps each chunk to a dense vector (often 384–3,072 dimensions). Similar meaning produces vectors that are close in cosine distance. Open models (e.g. BGE, E5) and hosted APIs (OpenAI, Cohere, Voyage) all work; pick one and keep query and document embeddings on the same model version.

Vectors live in a vector store — Pinecone, Weaviate, pgvector in Postgres, Qdrant, or Chroma for prototypes. The store supports approximate nearest neighbor (ANN) search: given a query vector, return the top-k closest chunk IDs in milliseconds even across millions of rows.

3. Retrieve at query time

When a user asks a question:

  1. Embed the query with the same model used at index time.
  2. Run ANN search to fetch top-k chunks (often k = 5–20).
  3. Optionally rerank candidates with a cross-encoder that scores query–passage pairs more accurately than cosine similarity alone.
  4. Trim to fit the context budget — see our context windows guide for token accounting.

Many production systems use hybrid retrieval: combine dense vector search with sparse keyword search (BM25). Vectors catch paraphrases ("refund policy" vs "money back guarantee"); keywords catch exact SKUs, error codes, and legal clause numbers that embeddings miss.

4. Augment the prompt and generate

The final prompt typically includes system instructions ("answer only from the context; say you don't know if evidence is missing"), the retrieved chunks with source labels, and the user question. The model generates an answer — ideally with inline citations you can map back to chunk metadata.

Architecture choices that matter

When to re-index

Treat your vector index like a search index, not a cache. On document create/update/ delete, enqueue a job to re-chunk and re-embed affected pages. Stale indexes produce confident wrong answers — worse than no RAG at all because users trust the citations.

Access control

If some users may not see HR policies or unreleased specs, filter chunks before they reach the LLM. Embedding stores should carry tenant and permission metadata; never rely on the model to "forget" sensitive paragraphs it already read.

Query transformation

Short follow-ups like "what about enterprise?" fail retrieval because they lack keywords. Common fixes:

  • Query rewriting — use a small LLM call to expand the question using chat history.
  • HyDE (Hypothetical Document Embeddings) — generate a fake answer, embed it, search with that vector.
  • Multi-query — produce three paraphrased questions, merge retrieval results.

Agents vs single-shot RAG

Simple Q&A uses one retrieval pass. Agentic RAG lets the model decide when to search again, call tools, or browse — better for multi-hop questions ("compare Q3 revenue in the EU vs US subsidiaries") at the cost of latency and orchestration complexity.

Common failure modes

Symptom Likely cause Fix
Answer contradicts your docs Wrong chunks retrieved; model ignores context Improve chunking, add reranker, tighten system prompt, lower temperature
"I don't know" on easy questions Chunks too small, k too low, or stale index Increase overlap, raise k, run hybrid search, re-index
Slow responses Large k, huge chunks, no ANN index Pre-filter metadata, cache frequent queries, use HNSW/IVF indexes
Duplicate/conflicting chunks Same doc indexed multiple times Deduplicate by content hash; prefer canonical URLs

Measure retrieval separately from generation. If the right chunk never appears in the top 10, no prompt engineering will save you — fix search first.

Evaluation without guesswork

Build a small golden dataset: 50–200 real user questions paired with the document IDs that should be retrieved and the facts the answer must include. Track:

  • Recall@k — did the correct chunk appear in the top k results?
  • MRR (mean reciprocal rank) — how high was the first relevant hit?
  • Answer faithfulness — does the generated text match the retrieved passages? (LLM-as-judge or human rubric)
  • Latency p95 — embed + search + rerank + generation end to end

Log every query, retrieved chunk IDs, and final answer in production. When users thumbs-down a response, you can replay the retrieval trace and see whether search or generation failed.

RAG vs alternatives

Fine-tuning teaches style and task format; it is a poor substitute for volatile facts. Long-context stuffing works for small corpora that fit in one prompt but gets expensive and slow as documents grow. Tool use / APIs beat RAG when answers require live computation (balances, inventory counts) rather than prose in documents.

Most teams combine approaches: RAG for documentation, function calling for live data, and fine-tuning only for consistent tone or structured output formats.

Minimal viable RAG stack

For a first prototype you do not need a microservices maze:

  1. Parse markdown/HTML into heading-aware chunks.
  2. Embed with a hosted embedding API; store vectors in pgvector or a managed vector DB.
  3. On query: embed, top-5 search, stuff chunks into a GPT/Claude prompt with citation instructions.
  4. Ship to five internal users; collect failure cases; iterate chunk size and k before optimizing latency.

The hard part is not wiring the API calls — it is curating documents, chunking well, and measuring whether the right evidence shows up. Invest there first.

Related reading