Guide

LlamaIndex fundamentals explained

Most production LLM features eventually need a knowledge layer: PDFs, wikis, tickets, and database exports that users query in natural language. You can wire loaders, chunkers, embedders, and retrievers by hand — or use a framework built around that problem. LlamaIndex (formerly GPT Index) is the retrieval-first Python and TypeScript library for ingesting data into indexes, running query engines that retrieve and synthesize answers, and layering agents when multi-step reasoning is required. Where LangChain generalizes chains and tools across providers, LlamaIndex optimizes the load-chunk-embed-index-query loop with opinionated defaults, composable ingestion pipelines, and rich post-retrieval processing. This guide covers core data primitives, index types, query and chat engines, retrievers and rerankers, agents, observability, a Harbor Analytics policy knowledge base worked example, a framework decision table, common pitfalls, and a practitioner checklist.

What LlamaIndex is (and is not)

LlamaIndex is a data framework for LLM applications. It does not host models; you plug in OpenAI, Anthropic, Cohere, Ollama, or any embedding and completion provider through Settings.llm and Settings.embed_model. Its sweet spot is document-heavy workloads: internal wikis, support knowledge bases, financial filings, and code repositories where retrieval quality dominates latency tuning.

It is not a replacement for every LLM orchestration need. Simple chatbots without retrieval, rigid state machines, or teams standardized on MCP tool servers may need less framework. Reach for LlamaIndex when ingestion complexity, hybrid search, metadata filters, and answer synthesis over heterogeneous sources would otherwise sprawl across custom scripts. Pair it with RAG fundamentals and vector database choices — the framework glues pieces together; architecture still determines quality.

Core primitives

  • Document — raw text plus metadata (source path, section, author).
  • Node — a chunk derived from a document; the unit stored and retrieved.
  • Index — data structure mapping nodes to retrieval strategies.
  • Retriever — fetches relevant nodes for a query (vector, keyword, hybrid).
  • Query engine — retriever + response synthesizer that returns a final answer.
  • Chat engine — conversational wrapper with memory over the same index.

Ingestion: readers, transformations, and pipelines

LlamaIndex ships readers for PDF, HTML, Notion, Google Drive, databases, and more via the llama-index-readers-* package family. A typical flow loads files into Document objects, runs a transformation pipeline (sentence splitting, metadata extraction, optional LLM summaries), and produces TextNode instances ready for embedding.

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter

documents = SimpleDirectoryReader("./policies").load_data()
splitter = SentenceSplitter(chunk_size=512, chunk_overlap=64)
nodes = splitter.get_nodes_from_documents(documents)
index = VectorStoreIndex(nodes)

For production, use IngestionPipeline with cached transformations so re-indexing skips unchanged files. Attach metadata — product line, effective date, jurisdiction — at parse time; filters at query time depend on it. Follow our chunking strategies guide for overlap and structure-aware splits; LlamaIndex’s SemanticSplitterNodeParser and HierarchicalNodeParser help when fixed token windows break tables and headings.

LlamaParse (LlamaIndex Cloud) handles complex PDFs with tables and multi-column layouts that naive extractors garble. Budget for parse cost on large corpora; cache parsed Markdown locally and version it alongside source files.

Index types and storage backends

The default starting point is VectorStoreIndex: embed every node, store vectors in memory or an external store (Pinecone, Weaviate, pgvector, Qdrant, Chroma). At query time, embed the question, fetch top-k similar nodes, pass them to a response synthesizer.

Other index patterns matter at scale:

  • SummaryIndex — sequential summarization for “give me the gist of this repo” queries over entire documents.
  • TreeIndex — hierarchical summaries for long narratives; higher build cost, useful for books and reports.
  • KnowledgeGraphIndex — extracts subject-predicate-object triples for relationship queries; pairs with our knowledge graphs guide.
  • Composable indices — route sub-questions to child indexes (per department, per product) and merge results.

Persist indexes to disk with storage_context or write directly into your vector DB so multiple app instances share one corpus. Rebuild embeddings when you change embedding models — dimensions and geometry shift; there is no safe hot-swap.

Query engines, synthesizers, and chat

A query engine is the production interface most teams expose:

query_engine = index.as_query_engine(similarity_top_k=6)
response = query_engine.query("What is the refund window for enterprise plans?")
print(response.source_nodes)  # citations for audit

Under the hood: retrieve nodes, optionally compress them with a ContextChatEngine or TreeSummarize response mode, then call the LLM. Response modes trade latency for faithfulness:

  • compact — stuff retrieved text into one prompt (fast, risks context overflow).
  • tree_summarize — map-reduce summarization across chunks (slower, better for long evidence).
  • refine — iterative answer refinement per chunk (highest quality, highest cost).

Chat engines add conversational memory: follow-up questions inherit prior context while still retrieving fresh nodes. Use CondensePlusContextChatEngine to rewrite ambiguous follow-ups (“What about the EU?”) into standalone search queries before retrieval — critical for multi-turn support bots.

Retrievers, postprocessors, and hybrid search

Swap the default retriever for finer control. VectorIndexRetriever accepts similarity_top_k and metadata filters (MetadataFilters with FilterOperator.EQ on doc_type or region). Combine vector search with BM25 via QueryFusionRetriever for hybrid recall — see our hybrid search guide.

Postprocessors run after retrieval, before synthesis:

  • SimilarityPostprocessor — drop nodes below a similarity threshold.
  • CohereRerank / cross-encoder rerankers — reorder top-20 to best-5; often the highest ROI RAG upgrade.
  • LongContextReorder — mitigate “lost in the middle” attention bias.
  • SentenceEmbeddingOptimizer — trim redundant sentences from oversized chunks.

Wire postprocessors into the query engine:

from llama_index.postprocessor.cohere_rerank import CohereRerank

query_engine = index.as_query_engine(
    similarity_top_k=20,
    node_postprocessors=[CohereRerank(top_n=5, model="rerank-english-v3.0")],
)

Evaluate changes with golden questions — our RAG evaluation guide covers hit rate, faithfulness, and citation accuracy metrics LlamaIndex callbacks can log.

Agents, tools, and workflows

When a question requires multiple retrieval steps or external APIs, promote the index to a tool inside a LlamaIndex agent or workflow. QueryEngineTool wraps a query engine with a name and description the planner model reads. Agents use ReAct-style loops similar to tool-use patterns but stay inside LlamaIndex’s event system.

Workflows (v0.10+) replace opaque agent executors with explicit @step functions and typed events — closer to LangGraph for teams that need auditable control flow. Use workflows when you have branching logic (classify intent, then route to billing vs. shipping index) rather than a single catch-all retriever.

Do not default to agents for every query. A plain query engine with good chunking and reranking beats an agent loop that burns tokens re-retrieving the same policy paragraph three times.

Observability, cost, and deployment

Enable LlamaDebugHandler or integrate OpenTelemetry / Langfuse / Phoenix callbacks to trace retrieve and synthesize stages separately. Log embedding token counts at ingest and completion tokens per query; spikes often mean chunk size drift or a runaway refine synthesizer.

Deploy query engines behind FastAPI or as serverless functions with warm vector DB connections. Stream responses with streaming=True on chat engines for browser SSE. Pin llama-index-core and integration package versions; minor releases frequently move imports from llama_index subpackages. For high QPS, cache frequent queries with our semantic caching guide patterns keyed on normalized question embeddings.

Worked example: Harbor Analytics policy knowledge base

Harbor Analytics sells compliance dashboards to fintech clients. Support engineers answer questions about SOC 2 controls, data retention, and API rate limits spread across 240 PDF policies, Confluence exports, and a Postgres changelog. They need cited answers, not guesses.

Architecture

Nightly IngestionPipeline runs: LlamaParse for PDFs, HTML reader for wiki dumps, SQL reader for changelog rows. SentenceSplitter at 400 tokens with 50 overlap; metadata tags product, effective_date, and audience (internal vs. customer-facing). Nodes embed with text-embedding-3-small into pgvector. Two composable indexes: customer_policies and internal_runbooks.

Query path

A classifier workflow (single LLM step) routes the question. Customer queries hit a query engine with metadata filter audience=external, similarity_top_k=15, Cohere rerank to top-5, and compact synthesis with a system prompt requiring bracketed citations [doc_id]. Internal escalation queries add the runbook index via RouterQueryEngine. P95 latency: 4.2 seconds. Faithfulness eval on 150 golden questions holds above 0.91 after reranker addition (up from 0.78 vector-only).

Operations

Re-ingest triggers on git SHA change for policy repos. Stale answers surface when effective_date metadata predates the question context — the UI shows “policy version” from top source node. Human reviewers thumbs-down responses into a spreadsheet that feeds weekly chunk and prompt tweaks, not automatic fine-tuning.

Framework decision table

Need Prefer Why
Document-heavy Q&A, complex ingestion LlamaIndex Indexes, query engines, and postprocessors tuned for retrieval-first apps
Multi-provider chains, broad tool ecosystem LangChain + LangGraph LCEL composition, agent graphs, LangSmith tracing
Explicit stateful agent with checkpoints LangGraph Durable human-in-the-loop workflows over LlamaIndex tools
One static FAQ, under 50 pages Raw SDK + single vector collection Minimal dependencies; LlamaIndex overhead may not pay off
Graph traversal over entity relationships KnowledgeGraphIndex or dedicated graph DB Triple extraction and graph retrievers built in
Cross-product tool standardization MCP server exposing retrieval Tools portable to Claude, IDEs, and internal agents alike

Common pitfalls

  • Default chunk sizes — 1024-token splits across all corpora; tables and APIs docs need structure-aware parsers.
  • No metadata filters — internal runbooks leak into customer answers because every node shares one flat index.
  • Skipping reranking — vector top-5 alone misses nuance; a cross-encoder reranker is often the cheapest quality win.
  • Agent overkill — ReAct loops for questions a single query engine answers in one retrieval pass.
  • Embedding model changes without reindex — mixed vectors in one collection silently degrade recall.
  • Ignoring source_nodes — shipping answers without citations; compliance teams cannot audit responses.
  • Stuffing contextcompact mode with top_k=30 overflows context and dilutes attention.
  • Parse-on-every-request — re-parsing PDFs at query time instead of batch ingestion.

Practitioner checklist

  • Pin llama-index-core, reader, and vector store integration versions.
  • Run ingestion on a schedule; version parsed artifacts and embedding model IDs.
  • Tag nodes with rich metadata for filterable retrieval at query time.
  • Start with VectorStoreIndex; add composable or graph indexes only when metrics justify complexity.
  • Enable hybrid retrieval or reranking before tuning prompt prose.
  • Expose source_nodes or citations in every user-facing answer.
  • Benchmark with golden questions after each ingest or chunking change.
  • Stream chat responses; set timeouts on agent workflows.
  • Redact PII at ingest; do not rely on the LLM to forget sensitive chunks.
  • Revisit whether LangChain or raw SDK would simplify if retrieval is only 10% of your app.

Key takeaways

  • LlamaIndex centers on ingestion, indexing, and query engines for retrieval-heavy LLM apps.
  • Nodes and metadata are the contract between parse-time investment and query-time precision.
  • Postprocessors and rerankers often beat prompt engineering for answer quality.
  • Chat and agent layers sit on top of indexes; do not skip solid retrieval fundamentals.
  • Pair LlamaIndex with LangGraph or MCP when orchestration complexity outgrows a single query engine.

Related reading