Guide

LLM RAG multi-hop retrieval explained

Harbor Engineering shipped an internal onboarding assistant over 6,200 Confluence pages, runbooks, and service catalogs. Single-shot RAG scored 89% on golden FAQs where the answer lived in one chunk — “What port does Harbor Metrics expose?” But questions that required linking facts across documents failed badly: “Who owns the on-call rotation for the service that ingests Stripe webhooks, and what is the P1 escalation path?” requires finding the webhook ingest service name, mapping it to an owner team, then retrieving the escalation runbook. The bot hallucinated a pager alias 31% of the time because no single embedding query surfaced all three hops.

The team replaced one-shot top-k retrieval with a bounded multi-hop pipeline: decompose the question, retrieve hop-by-hop using bridge entities extracted from each pass, and synthesize only when evidence covered every sub-claim. Wrong-escalation answers fell from 31% to 8%; faithfulness on a 180-question multi-hop golden set rose from 54% to 83%. This guide explains what makes a question multi-hop, bridge-entity retrieval, iterative read-and-search loops, decomposed sub-question pipelines, when graph structure helps, the Harbor Engineering refactor, a technique decision table versus single-shot RAG and full agentic loops, pitfalls, and a production checklist.

Single-hop vs multi-hop questions

A single-hop question can be answered from one retrieved passage (or a few adjacent chunks from the same document). The embedding of the user query overlaps strongly with the embedding of the answer span. Examples: API rate limits, PTO policy headline, error code meaning.

A multi-hop question needs evidence from two or more logically separate sources, often chained through an intermediate entity the user never names explicitly:

Bridge entity — “Which team maintains the service that writes to the ledger_events table?” Hop 1: find which service writes to that table. Hop 2: find the owning team for that service.
Comparison across docs — “Does Plan B cover the same regions as the legacy Enterprise addendum?” Hop 1: Plan B regions. Hop 2: addendum regions. Hop 3: diff.
Temporal chain — “What changed in the refund policy after the Q3 pricing update?” Hop 1: locate Q3 pricing doc and effective date. Hop 2: retrieve refund policy versions after that date.
Numeric aggregation — “Total committed spend across all vendors tagged harbor-ai in the FY26 budget sheet?” Multiple row lookups plus sum.

Single-shot RAG fails on multi-hop not because embeddings are weak but because the question vector is an average of all hops. The chunk that names the webhook service scores high; the escalation runbook scores low until you know the service name — information you only get after hop 1.

Bridge-entity retrieval

The most common production pattern: treat each hop as a retrieval round where the query for hop n+1 is built from entities extracted from hop n results.

Decompose — LLM or rule-based splitter breaks the user question into ordered sub-questions (see query decomposition).
Retrieve hop 1 — hybrid search + rerank for sub-Q1.
Extract bridges — structured parse (JSON schema) pulls service names, IDs, dates, team slugs from top chunks.
Retrieve hop 2 — new query injects bridge entities: "on-call rotation team:{extracted_team} service:{extracted_service}".
Repeat until sub-questions are covered or hop budget expires.
Synthesize — final LLM call cites chunk IDs from all hops.

Bridge extraction must be conservative: if hop 1 returns two plausible service names, branch retrieval (retrieve for both, let reranker or a lightweight verifier pick) rather than committing to the first JSON field the model emits. Harbor Engineering used a 3-hop cap and required every sub-question to map to at least one chunk ID before synthesis.

Metadata filters amplify bridge hops: once you extract service_id=metrics-ingest, filter the second search to doc_type=runbook AND service_id=metrics-ingest via vector metadata filtering instead of hoping semantic search alone disambiguates.

Iterative read-and-search loops

Bridge-entity pipelines assume you know the decomposition upfront. Some questions need iterative retrieval: read chunks, notice a gap, issue a follow-up search informed by what you learned. This is narrower than full agentic RAG — the loop is retrieve-read-retrieve with a fixed max iterations (typically 2–4), not open-ended tool use.

Pattern:

Round 1: retrieve k=8 for the original question.
Reader model outputs: { "known": [...], "missing": "escalation SLA for team X", "next_query": "..." }.
Round 2: retrieve using next_query; merge chunk sets with dedup by doc ID.
Stop when missing is empty or iteration budget hits.

Iterative loops cost more tokens than planned decomposition but handle questions where you cannot predict sub-questions without seeing hop-1 evidence — exploratory queries like “Why did deploys fail last Tuesday?” where the root cause document type is unknown until you read the incident summary.

Cap total retrieved chunks across hops (Harbor used 24) to avoid recreating lost-in-the-middle failures in a mega-prompt.

Decomposed sub-question pipelines

When hop order is predictable, a pipeline DAG beats ad-hoc loops:

Parallel fan-out — independent sub-questions retrieve concurrently (Plan B regions + Enterprise addendum regions), then a merge node diffs results. Cuts latency versus serial hops.
Serial dependency — sub-Q2 cannot start until sub-Q1 bridge is extracted; use async queues but enforce ordering.
Shared context buffer — pass accumulated entities (dates, IDs) forward; do not re-retrieve chunks already in the buffer unless freshness flags say otherwise.

Pre-compute decomposition with a small model or cached templates for recurring question shapes (“who owns X and what is the runbook”) instead of paying a frontier model to replan every query. Log decomposition outputs to refine templates from production failures.

For table-heavy corpora, hop 2 may need table-aware retrieval — the bridge entity is a row key, not prose.

When graph structure helps

Multi-hop RAG over flat chunk indexes works when hops are 2–3 deep and bridge entities appear in text. When relationships are dense — service depends-on graphs, policy exception hierarchies, org charts — graph RAG precomputes edges so hop 2 is a graph traversal instead of another embedding guess.

Hybrid approach Harbor Engineering kept in production: flat RAG for hop 1 (find the service mention), knowledge-graph lookup for hop 2 (owner team edge is structured), flat RAG for hop 3 (runbook prose). Graph-only search failed on novel phrasing (“the thing that eats Stripe events”); text-only failed on reliable ownership edges maintained in CMDB exports.

Invest in graphs when the same bridge relation is queried thousands of times per day and entity IDs are stable. Skip graph build for one-off document collections where extraction cost exceeds query volume.

Harbor Engineering refactor (worked example)

Before: single hybrid retrieval (BM25 + dense), k=12, cross-encoder rerank, one synthesis call. Multi-hop golden set: 54% faithfulness, 31% wrong escalation contact on owner+runbook questions.

After:

Classifier routes “multi-entity” questions (regex + lightweight intent model) to the hop pipeline; simple FAQs stay on single-shot for cost.
Decomposer emits 2–3 sub-questions with explicit depends_on tags.
Per-hop retrieve k=6, rerank to 3; bridge JSON schema validated against an allowlist of entity types (service, team, policy_id).
Metadata filters on service_id and doc_type for hops 2+.
Synthesis prompt lists sub-questions with assigned chunk citations; refuses if any sub-question has zero chunks.
NLI faithfulness check on final answer vs union of hop chunks; regenerate once on contradiction.

Results: multi-hop faithfulness 54% → 83%; wrong escalation 31% → 8%; p95 latency 2.1s → 4.8s (acceptable for internal tooling); cost per multi-hop query +2.3× versus single-shot, offset by routing 68% of traffic to the cheap path.

Technique decision table

Technique	Prefer when	Avoid when
Single-shot RAG	Answer in one doc; high query volume; strict latency SLA	Question mentions two+ entities or “which X for Y that Z”
Bridge-entity multi-hop	2–3 ordered hops; stable entity types; metadata filters available	Exploratory root-cause search with unknown doc types
Iterative read-and-search	Cannot decompose upfront; incident/debug queries	High QPS FAQ surface (cost explodes)
Graph-augmented hops	CMDB/org/service graphs maintained; same edges queried repeatedly	Small static corpus; graph build cost > query savings
Full agentic RAG	4+ hops; web + internal tools; ambiguous planning	2-hop owner+runbook with known schema (overkill)
Query expansion only	Vocabulary mismatch on single doc	True cross-document chaining (expansion does not extract bridges)

Default for enterprise knowledge bases: route by question shape. Single-shot for 70–80% of queries; bounded multi-hop for the rest. Reserve agentic loops for support tiers that already tolerate 10+ second latency.

Common pitfalls

Multi-hop everything — running 3-hop pipelines on single-doc FAQs burns budget; classify first.
Trusting bridge extraction blindly — one wrong entity poisons hop 2; branch or verify with a second retrieval candidate.
Unbounded hops — agents loop until timeout; cap hops and chunks explicitly.
No per-hop eval — only measuring end-to-end hides hop-1 recall failures; log recall@k per sub-question.
Ignoring dedup — the same runbook chunk in hop 1 and hop 3 wastes context budget.
Decomposition drift — decomposer invents sub-questions the corpus cannot answer; validate against schema or refuse early.
Flat k increase — raising k from 12 to 50 does not fix multi-hop; it adds noise without bridge entities.

Production checklist

Tag golden QA set with hop_count and required bridge entity types.
Build a router: single-hop vs multi-hop (keywords, entity count, classifier).
Define JSON schema for bridge entities with allowlisted types.
Implement per-hop retrieve + rerank with metadata filters on hops 2+.
Cap hops (3) and total chunks (24) across all rounds.
Log sub-questions, bridge values, and chunk IDs per hop for debugging.
Measure faithfulness per hop, not only final answer.
Refuse synthesis when any sub-question lacks evidence.
Keep single-shot path for simple queries to control cost.
Re-evaluate when CMDB or graph exports change entity ID formats.

Key takeaways

Multi-hop questions need chained retrieval — one embedding query cannot surface all required evidence.
Bridge entities extracted from hop n drive the query for hop n+1; validate extractions before committing.
Harbor Engineering cut wrong escalations 31%→8% with bounded 3-hop pipelines and single-shot routing for simple FAQs.
Graph edges help stable ownership relations; flat RAG still wins hop 1 for paraphrased service names.
Route by question shape — do not pay multi-hop latency on every query.