Guide
LlamaIndex fundamentals explained
Most production LLM features eventually need a knowledge layer: PDFs, wikis, tickets, and database exports that users query in natural language. You can wire loaders, chunkers, embedders, and retrievers by hand — or use a framework built around that problem. LlamaIndex (formerly GPT Index) is the retrieval-first Python and TypeScript library for ingesting data into indexes, running query engines that retrieve and synthesize answers, and layering agents when multi-step reasoning is required. Where LangChain generalizes chains and tools across providers, LlamaIndex optimizes the load-chunk-embed-index-query loop with opinionated defaults, composable ingestion pipelines, and rich post-retrieval processing. This guide covers core data primitives, index types, query and chat engines, retrievers and rerankers, agents, observability, a Harbor Analytics policy knowledge base worked example, a framework decision table, common pitfalls, and a practitioner checklist.
What LlamaIndex is (and is not)
LlamaIndex is a data framework for LLM applications. It does not host
models; you plug in OpenAI, Anthropic, Cohere, Ollama, or any embedding and completion
provider through Settings.llm and Settings.embed_model. Its
sweet spot is document-heavy workloads: internal wikis, support knowledge bases,
financial filings, and code repositories where retrieval quality dominates latency
tuning.
It is not a replacement for every LLM orchestration need. Simple chatbots without retrieval, rigid state machines, or teams standardized on MCP tool servers may need less framework. Reach for LlamaIndex when ingestion complexity, hybrid search, metadata filters, and answer synthesis over heterogeneous sources would otherwise sprawl across custom scripts. Pair it with RAG fundamentals and vector database choices — the framework glues pieces together; architecture still determines quality.
Core primitives
- Document — raw text plus metadata (source path, section, author).
- Node — a chunk derived from a document; the unit stored and retrieved.
- Index — data structure mapping nodes to retrieval strategies.
- Retriever — fetches relevant nodes for a query (vector, keyword, hybrid).
- Query engine — retriever + response synthesizer that returns a final answer.
- Chat engine — conversational wrapper with memory over the same index.
Ingestion: readers, transformations, and pipelines
LlamaIndex ships readers for PDF, HTML, Notion, Google Drive, databases, and more via
the llama-index-readers-* package family. A typical flow loads files into
Document objects, runs a transformation pipeline (sentence
splitting, metadata extraction, optional LLM summaries), and produces
TextNode instances ready for embedding.
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter
documents = SimpleDirectoryReader("./policies").load_data()
splitter = SentenceSplitter(chunk_size=512, chunk_overlap=64)
nodes = splitter.get_nodes_from_documents(documents)
index = VectorStoreIndex(nodes)
For production, use IngestionPipeline with cached transformations so
re-indexing skips unchanged files. Attach metadata — product line, effective date,
jurisdiction — at parse time; filters at query time depend on it. Follow our
chunking strategies guide
for overlap and structure-aware splits; LlamaIndex’s
SemanticSplitterNodeParser and HierarchicalNodeParser help
when fixed token windows break tables and headings.
LlamaParse (LlamaIndex Cloud) handles complex PDFs with tables and multi-column layouts that naive extractors garble. Budget for parse cost on large corpora; cache parsed Markdown locally and version it alongside source files.
Index types and storage backends
The default starting point is VectorStoreIndex: embed every node, store vectors in memory or an external store (Pinecone, Weaviate, pgvector, Qdrant, Chroma). At query time, embed the question, fetch top-k similar nodes, pass them to a response synthesizer.
Other index patterns matter at scale:
- SummaryIndex — sequential summarization for “give me the gist of this repo” queries over entire documents.
- TreeIndex — hierarchical summaries for long narratives; higher build cost, useful for books and reports.
- KnowledgeGraphIndex — extracts subject-predicate-object triples for relationship queries; pairs with our knowledge graphs guide.
- Composable indices — route sub-questions to child indexes (per department, per product) and merge results.
Persist indexes to disk with storage_context or write directly into your
vector DB so multiple app instances share one corpus. Rebuild embeddings when you
change embedding models — dimensions and geometry shift; there is no safe
hot-swap.
Query engines, synthesizers, and chat
A query engine is the production interface most teams expose:
query_engine = index.as_query_engine(similarity_top_k=6)
response = query_engine.query("What is the refund window for enterprise plans?")
print(response.source_nodes) # citations for audit
Under the hood: retrieve nodes, optionally compress them with a
ContextChatEngine or TreeSummarize response mode, then call
the LLM. Response modes trade latency for faithfulness:
- compact — stuff retrieved text into one prompt (fast, risks context overflow).
- tree_summarize — map-reduce summarization across chunks (slower, better for long evidence).
- refine — iterative answer refinement per chunk (highest quality, highest cost).
Chat engines add conversational memory: follow-up questions inherit
prior context while still retrieving fresh nodes. Use
CondensePlusContextChatEngine to rewrite ambiguous follow-ups (“What
about the EU?”) into standalone search queries before retrieval — critical
for multi-turn support bots.
Retrievers, postprocessors, and hybrid search
Swap the default retriever for finer control. VectorIndexRetriever accepts
similarity_top_k and metadata filters (MetadataFilters with
FilterOperator.EQ on doc_type or region).
Combine vector search with BM25 via QueryFusionRetriever for hybrid
recall — see our
hybrid search guide.
Postprocessors run after retrieval, before synthesis:
SimilarityPostprocessor— drop nodes below a similarity threshold.CohereRerank/ cross-encoder rerankers — reorder top-20 to best-5; often the highest ROI RAG upgrade.LongContextReorder— mitigate “lost in the middle” attention bias.SentenceEmbeddingOptimizer— trim redundant sentences from oversized chunks.
Wire postprocessors into the query engine:
from llama_index.postprocessor.cohere_rerank import CohereRerank
query_engine = index.as_query_engine(
similarity_top_k=20,
node_postprocessors=[CohereRerank(top_n=5, model="rerank-english-v3.0")],
)
Evaluate changes with golden questions — our RAG evaluation guide covers hit rate, faithfulness, and citation accuracy metrics LlamaIndex callbacks can log.
Agents, tools, and workflows
When a question requires multiple retrieval steps or external APIs, promote the index
to a tool inside a LlamaIndex agent or workflow. QueryEngineTool
wraps a query engine with a name and description the planner model reads. Agents use
ReAct-style loops similar to
tool-use patterns
but stay inside LlamaIndex’s event system.
Workflows (v0.10+) replace opaque agent executors with explicit
@step functions and typed events — closer to
LangGraph
for teams that need auditable control flow. Use workflows when you have branching logic
(classify intent, then route to billing vs. shipping index) rather than a single
catch-all retriever.
Do not default to agents for every query. A plain query engine with good chunking and reranking beats an agent loop that burns tokens re-retrieving the same policy paragraph three times.
Observability, cost, and deployment
Enable LlamaDebugHandler or integrate OpenTelemetry / Langfuse /
Phoenix callbacks to trace retrieve and synthesize stages separately. Log embedding
token counts at ingest and completion tokens per query; spikes often mean chunk size
drift or a runaway refine synthesizer.
Deploy query engines behind FastAPI or as serverless functions with warm vector DB
connections. Stream responses with streaming=True on chat engines for
browser SSE. Pin llama-index-core and integration package versions;
minor releases frequently move imports from llama_index subpackages.
For high QPS, cache frequent queries with our
semantic caching guide
patterns keyed on normalized question embeddings.
Worked example: Harbor Analytics policy knowledge base
Harbor Analytics sells compliance dashboards to fintech clients. Support engineers answer questions about SOC 2 controls, data retention, and API rate limits spread across 240 PDF policies, Confluence exports, and a Postgres changelog. They need cited answers, not guesses.
Architecture
Nightly IngestionPipeline runs: LlamaParse for PDFs, HTML reader for wiki
dumps, SQL reader for changelog rows. SentenceSplitter at 400 tokens with
50 overlap; metadata tags product, effective_date, and
audience (internal vs. customer-facing). Nodes embed with
text-embedding-3-small into pgvector. Two composable indexes:
customer_policies and internal_runbooks.
Query path
A classifier workflow (single LLM step) routes the question. Customer queries hit a
query engine with metadata filter audience=external,
similarity_top_k=15, Cohere rerank to top-5, and
compact synthesis with a system prompt requiring bracketed citations
[doc_id]. Internal escalation queries add the runbook index via
RouterQueryEngine. P95 latency: 4.2 seconds. Faithfulness eval on 150
golden questions holds above 0.91 after reranker addition (up from 0.78 vector-only).
Operations
Re-ingest triggers on git SHA change for policy repos. Stale answers surface when
effective_date metadata predates the question context — the UI shows
“policy version” from top source node. Human reviewers thumbs-down responses
into a spreadsheet that feeds weekly chunk and prompt tweaks, not automatic fine-tuning.
Framework decision table
| Need | Prefer | Why |
|---|---|---|
| Document-heavy Q&A, complex ingestion | LlamaIndex | Indexes, query engines, and postprocessors tuned for retrieval-first apps |
| Multi-provider chains, broad tool ecosystem | LangChain + LangGraph | LCEL composition, agent graphs, LangSmith tracing |
| Explicit stateful agent with checkpoints | LangGraph | Durable human-in-the-loop workflows over LlamaIndex tools |
| One static FAQ, under 50 pages | Raw SDK + single vector collection | Minimal dependencies; LlamaIndex overhead may not pay off |
| Graph traversal over entity relationships | KnowledgeGraphIndex or dedicated graph DB | Triple extraction and graph retrievers built in |
| Cross-product tool standardization | MCP server exposing retrieval | Tools portable to Claude, IDEs, and internal agents alike |
Common pitfalls
- Default chunk sizes — 1024-token splits across all corpora; tables and APIs docs need structure-aware parsers.
- No metadata filters — internal runbooks leak into customer answers because every node shares one flat index.
- Skipping reranking — vector top-5 alone misses nuance; a cross-encoder reranker is often the cheapest quality win.
- Agent overkill — ReAct loops for questions a single query engine answers in one retrieval pass.
- Embedding model changes without reindex — mixed vectors in one collection silently degrade recall.
- Ignoring
source_nodes— shipping answers without citations; compliance teams cannot audit responses. - Stuffing context —
compactmode with top_k=30 overflows context and dilutes attention. - Parse-on-every-request — re-parsing PDFs at query time instead of batch ingestion.
Practitioner checklist
- Pin
llama-index-core, reader, and vector store integration versions. - Run ingestion on a schedule; version parsed artifacts and embedding model IDs.
- Tag nodes with rich metadata for filterable retrieval at query time.
- Start with
VectorStoreIndex; add composable or graph indexes only when metrics justify complexity. - Enable hybrid retrieval or reranking before tuning prompt prose.
- Expose
source_nodesor citations in every user-facing answer. - Benchmark with golden questions after each ingest or chunking change.
- Stream chat responses; set timeouts on agent workflows.
- Redact PII at ingest; do not rely on the LLM to forget sensitive chunks.
- Revisit whether LangChain or raw SDK would simplify if retrieval is only 10% of your app.
Key takeaways
- LlamaIndex centers on ingestion, indexing, and query engines for retrieval-heavy LLM apps.
- Nodes and metadata are the contract between parse-time investment and query-time precision.
- Postprocessors and rerankers often beat prompt engineering for answer quality.
- Chat and agent layers sit on top of indexes; do not skip solid retrieval fundamentals.
- Pair LlamaIndex with LangGraph or MCP when orchestration complexity outgrows a single query engine.
Related reading
- LangChain fundamentals explained — LCEL chains and when to pair with LlamaIndex retrievers
- RAG explained — chunking, hybrid search, and retrieval architecture fundamentals
- LLM reranking explained — cross-encoders and post-retrieval reordering
- RAG evaluation explained — faithfulness, hit rate, and regression datasets