Guide

Graph RAG explained

Standard vector RAG retrieves chunks whose embeddings are close to the question. That works when the answer lives in one paragraph. It struggles when the user asks how entities relate across dozens of documents — “Which vendors supply both our EU and APAC warehouses?” or “What themes connect these three incident reports?” Graph RAG builds a knowledge graph of entities and relationships during indexing, then retrieves by graph traversal, community summaries, or hybrid fusion with vectors. Microsoft’s GraphRAG pipeline and lighter variants (LightRAG, Graphiti) made the pattern mainstream in 2024–2025. This guide explains how graph-augmented retrieval differs from chunk-only RAG, when the extra indexing cost pays off, how it connects to classic RAG, knowledge graphs, vector stores, and semantic search, a Harbor Archive docs portal worked example, an architecture decision table, common pitfalls, and a production checklist.

What graph RAG adds to vector retrieval

Chunk-and-embed RAG treats every document as a bag of isolated text spans. Similarity search returns the top-k nearest neighbors in embedding space. If the answer requires connecting facts from five different files — each individually mediocre matches to the query — none may rank high enough to enter the context window.

Graph RAG adds a structured layer:

  1. Extract entities (people, products, locations, concepts) and relationships (works_at, supplies, caused_by) from each chunk, usually with an LLM.
  2. Merge duplicate entities across documents into a unified graph with provenance links back to source chunks.
  3. Summarize communities — clusters of densely connected nodes — into hierarchical summaries at multiple scales.
  4. Retrieve at query time by local graph expansion (seed entities → neighbors), global community search (which summary theme fits?), or both.

The graph is not a replacement for vectors. Most production systems use hybrid graph + vector RAG: vectors find semantically similar passages; the graph stitches multi-hop connections and supports global questions over the whole corpus.

Local vs global queries

Local queries start from named entities (“What did Acme Corp ship in Q3?”) and walk one to three hops along typed edges. Global queries ask about themes spanning the dataset (“What are the top risks across all supplier contracts?”) and lean on precomputed community summaries rather than per-chunk similarity. Graph RAG’s headline win is global retrieval; local queries often still benefit from a good vector index.

The indexing pipeline

1. Chunk and extract

Documents are chunked as in standard RAG (structure-aware splits help). Each chunk is passed to an LLM with a schema: extract entities with types, relationships as (source, predicate, target) triples, and optional claims with confidence. Extraction quality dominates graph quality — small models hallucinate edges; larger models cost more at index time.

2. Entity resolution and graph merge

“IBM,” “International Business Machines,” and “Big Blue” must collapse to one node. Resolution uses string similarity, embedding clustering, or a second LLM pass. Each merged node keeps pointers to every supporting chunk ID. Without resolution, the graph fragments and traversal returns noise.

3. Community detection and hierarchical summaries

Algorithms like Leiden clustering partition the graph into communities. An LLM writes a summary per community; communities are grouped into super-communities and summarized again, forming a tree. At query time, a global search selects relevant community summaries (map-reduce style) instead of scanning every chunk. This is the signature step in Microsoft GraphRAG and why index builds are expensive.

4. Dual indexes

Store chunk embeddings in a vector database, entity/relation embeddings (optional) for fuzzy node lookup, and the graph in Neo4j, NetworkX, or an in-memory adjacency structure. Metadata on every edge records source document, timestamp, and access-control tags for filtered retrieval.

Query-time retrieval modes

Local graph search

Parse the question for entity mentions (NER or LLM). Seed the graph with matching nodes, expand along high-weight edges up to depth d, collect attached chunk texts, optionally rerank with a cross-encoder, and pass to the generator. Depth and degree caps prevent exponential blow-up on hub nodes like “United States.”

Global community search

Embed the question, compare against community summary vectors, take top communities, pull their member entities’ source chunks, and synthesize. Suited to thematic questions where no single entity is named. Latency is often higher than local search because multiple summary levels participate.

Hybrid fusion

Run vector top-k and graph local/global in parallel. Merge with reciprocal rank fusion (RRF) or learned weights. Hybrid almost always beats either alone on benchmarks that mix factual lookup and cross-document synthesis. LightRAG-style systems emphasize lower index cost by updating the graph incrementally without full re-clustering on every ingest.

Worked example: Harbor Archive compliance Q&A

Harbor Archive indexes 2,400 internal markdown pages: vendor contracts, SOC2 evidence, and incident postmortems. A compliance analyst asks: “Which third-party processors handle EU customer PII and appeared in incidents since 2024?”

Vector-only baseline: top-5 chunks mention “EU,” “PII,” or “incident” separately. None list processors satisfying all three constraints. The LLM guesses.

Graph RAG path:

  1. Extraction built nodes Vendor:AcmeCloud, Region:EU, DataType:PII, Incident:INC-2024-17 with edges processes, operates_in, involved_in.
  2. Query seeds Region:EU and DataType:PII, traverses processes backward to vendors, filters vendors linked to Incident nodes with date > 2024-01-01.
  3. Retrieved chunks: contract clause for AcmeCloud, incident timeline, DPA appendix. Generator cites all three.

Index build took 18 hours on first run (LLM extraction + Leiden + summaries). Incremental nightly updates add ~12 minutes for changed pages. Query p95 latency: 2.1s hybrid vs 0.8s vector-only — acceptable for analyst workflows, not for sub-second chat widgets.

Architecture decision table

Scenario Prefer Avoid
Single-doc FAQ, answers in one page Vector RAG only Full GraphRAG index (overkill)
Multi-hop entity questions across corpus Hybrid graph + vector Vector-only with huge top-k
Global thematic synthesis (“main themes”) Graph RAG community summaries Embedding every chunk into one prompt
Corpus updates hourly Incremental graph (LightRAG / Graphiti) Full nightly Leiden re-cluster
Strict sub-second latency Vector + precomputed entity lookup table Deep graph traversal at query time
Small corpus (< 500 pages) Long-context LLM or vector RAG Graph index costing more than brute force
Regulated provenance requirements Graph with edge-to-chunk provenance Black-box embedding retrieval without citations

Common pitfalls

  • Extraction errors compound — one hallucinated edge pollutes traversals; validate with human spot checks on a golden doc set.
  • Hub node explosion — generic entities (company, user, data) need degree limits or stop lists.
  • Ignoring entity resolution — duplicate nodes make paths look shorter than reality.
  • Rebuilding the full graph on every ingest — index cost dominates; use incremental updates.
  • Skipping vector baseline — graph retrieval alone often misses paraphrased local facts.
  • Underestimating index LLM spend — GraphRAG first builds can cost 10–50× vanilla embedding jobs.
  • No evaluation split for multi-hop questions — standard RAG benchmarks under-report graph value; curate cross-doc QA pairs.
  • Stale community summaries — global search drifts when the graph changes but summaries do not refresh.

Production checklist

  • Define entity and relationship schemas before extraction; version them.
  • Benchmark vector-only, graph-only, and hybrid on local and global question sets.
  • Log extraction confidence; quarantine low-confidence edges.
  • Cap traversal depth and fan-out; profile worst-case hub nodes.
  • Store provenance: every triple links to chunk ID and character offset.
  • Schedule incremental graph updates; full re-cluster on major schema changes only.
  • Refresh community summaries when node/edge churn exceeds a threshold.
  • Fuse graph and vector results with RRF; tune weights on held-out queries.
  • Monitor index cost per document and query p95 separately.
  • Apply the same access-control filters on graph nodes as on vector metadata.
  • Run RAG evaluation with faithfulness checks on multi-source answers.
  • Document when to fall back to vector-only if graph service is degraded.

Key takeaways

  • Graph RAG models entities and relationships, not just similar text chunks.
  • Community summaries enable global questions that vector top-k cannot answer.
  • Hybrid retrieval is the practical default; neither layer alone covers all query types.
  • Index-time LLM cost and latency are the main tradeoffs vs plain vector RAG.
  • Invest in extraction quality, entity resolution, and multi-hop evaluation early.

Related reading