Guide

Agentic RAG explained

Classic RAG embeds the user question once, pulls the top-k chunks, and generates an answer. That pipeline is fast and cheap — and it fails when the question is ambiguous, spans multiple documents, or needs a follow-up search after the first pass misses the mark. Agentic RAG wraps retrieval in an agent loop: the model plans sub-queries, calls search tools, grades whether context is relevant and complete, and retries with refined queries until it can answer confidently or escalate. Patterns like Self-RAG, Corrective RAG (CRAG), and ReAct-style tool orchestration turned iterative retrieval from a research curiosity into a production default for support bots, legal research, and internal knowledge assistants. This guide explains how agentic RAG differs from single-shot vector RAG and graph RAG, the core loop and named patterns, tool integration with LLM agents, cost and latency tradeoffs, a Harbor Support escalation agent worked example, an architecture decision table, common pitfalls, and a production checklist alongside our RAG evaluation guide.

Single-shot RAG vs agentic retrieval

In naive RAG, the retrieval step is stateless: one embedding, one search, one prompt assembly. The generator never sees evidence it did not retrieve on the first try. Three failure modes dominate production logs:

  • Query mismatch — the user asks about “refund policy for annual plans cancelled mid-cycle” but the embedding matches a generic billing FAQ because the phrasing differs from indexed docs.
  • Multi-hop gaps — answering requires facts from two unrelated chunks (pricing table + exception clause) that neither ranks in top-k alone.
  • Insufficient context — retrieved chunks are topically related but do not contain the specific date, SKU, or jurisdiction the answer needs.

Agentic RAG treats retrieval as a sequence of decisions rather than a single function call. An orchestrator (often the same LLM with tool definitions) can decompose the question, run multiple searches, reject irrelevant passages, rewrite the query, and only then synthesize. The extra LLM rounds add latency and token cost — but they recover accuracy on questions where naive RAG hallucinates or refuses.

Where agentic RAG sits in the stack

Agentic RAG is not a replacement for good indexing. Chunking, hybrid search, reranking, and graph extraction still matter; the agent loop sits on top of those retrievers as a controller. You can agentically orchestrate vector search, BM25, SQL over a warehouse, and a function-calling API in one session. The agent chooses which tool and what query — not just how to phrase the final answer.

Named patterns: Self-RAG, CRAG, and query decomposition

Self-RAG (retrieve, generate, critique)

Self-RAG (Asai et al., 2023) trains or prompts the model to emit reflection tokens at each step: should I retrieve? are these passages relevant? is my draft supported? is the answer useful? At inference, the model can skip retrieval when internal knowledge suffices, retrieve when uncertain, and discard irrelevant chunks before generation. Even without special training tokens, production systems mimic Self-RAG with a lightweight retrieval grader (a small classifier or LLM call) that scores each chunk 0–1 before prompt assembly.

Corrective RAG (CRAG)

Corrective RAG adds an explicit correct step when initial retrieval scores below a threshold. Low-confidence retrievals trigger a fallback: web search, a broader keyword query, or a different index partition. High-confidence retrievals pass through a knowledge refinement step that strips irrelevant sentences from chunks before the generator sees them. CRAG reduces noise in the context window — a common cause of confident wrong answers.

Query decomposition and step-back prompting

Complex questions split into sub-queries executed in parallel or sequence:

  • Decomposition — “Compare EU and US data retention requirements for enterprise tier” becomes two retrieval calls plus a synthesis step.
  • Step-back — first retrieve a high-level concept doc (“what is our data retention framework?”), then a specific policy page.
  • HyDE (Hypothetical Document Embeddings) — the model writes a fake answer paragraph, embeds that text, and searches with the hypothetical doc vector when the raw question embedding underperforms.

Decomposition pairs naturally with multi-agent orchestration when sub-tasks have different tool permissions or model tiers.

The agent loop: plan, retrieve, grade, refine

Most production agentic RAG pipelines implement the same state machine with different names:

  1. Plan — classify query difficulty; decide tool set and max iterations (budget cap).
  2. Retrieve — execute one or more search tools (vector, keyword, graph traverse, SQL).
  3. Grade — score relevance, coverage, and freshness; drop or down-rank weak passages.
  4. Refine — if coverage is insufficient, rewrite the query, broaden filters, or switch tools; loop until budget exhausted.
  5. Generate — synthesize with citations; optionally run a faithfulness check against retrieved text.
  6. Escalate or abstain — if loops fail, hand off to human or return “I could not find authoritative documentation.”

ReAct-style tool use

ReAct (reason + act) interleaves natural-language reasoning traces with tool calls. The model writes “I need the 2024 SOC2 report section on encryption” then invokes search(docs, query=...), reads results, and continues. This pattern is the practical backbone of agentic RAG in frameworks like LangGraph, LlamaIndex agents, and custom OpenAI Assistants workflows. Cap iteration count and total retrieved tokens — unbounded ReAct loops are a top source of runaway inference bills.

Latency and cost knobs

  • Max iterations — typically 2–5 for support bots; more for research assistants with async UX.
  • Grader model tier — use a 3B–8B classifier for chunk scoring; reserve the frontier model for final synthesis.
  • Parallel sub-queries — decomposition steps can fan out; wall-clock equals slowest branch, not sum of serial searches.
  • Early exit — if the grader scores all chunks above 0.85 on the first pass, skip refinement (most queries should hit this path).

Tool integration beyond vector search

Agentic RAG shines when retrieval is heterogeneous. Common tools in one agent session:

  • Vector + BM25 hybrid — via hybrid search with metadata filters (product, region, doc version).
  • Structured SQL — “How many enterprise accounts cancelled last quarter?” needs a warehouse query, not a help article.
  • Graph traversal — follow entity edges when the question names relationships (see graph RAG indexes as a tool, not a separate product).
  • Live APIs — ticket status, account entitlements, shipping tracking; stale docs cannot answer these.
  • Web search fallback — CRAG-style external retrieval when internal corpus confidence is low (with source attribution and policy guardrails).

Route simple lookups to single-shot RAG via a model router; reserve the full agent loop for queries the classifier marks as multi-step or low first-pass retrieval score.

Worked example: Harbor Support escalation agent

Harbor Support handles B2B SaaS tickets. Tier-1 uses naive RAG over 4,200 help articles. Tier-2 agentic RAG activates when tier-1 confidence is below 0.7 or the ticket tags include billing-dispute, compliance, or multi-product.

Ticket: “We upgraded from Pro to Enterprise in March, were charged twice in April, and need confirmation our EU customer data stays in Frankfurt under the new contract.”

  1. Plan — router flags multi-domain (billing + data residency); budget: 4 iterations, tools: search_kb, query_billing_db, fetch_contract.
  2. Retrieve (pass 1)search_kb("duplicate charge enterprise upgrade") returns generic payment FAQ (grader: 0.4 relevance).
  3. Refine — agent decomposes: (a) duplicate charge policy, (b) Enterprise data residency addendum.
  4. Retrieve (pass 2) — parallel: query_billing_db(account_id, month=April) shows two invoices; search_kb("EU data residency Frankfurt enterprise DPA") hits DPA appendix (grader: 0.92).
  5. Retrieve (pass 3)fetch_contract(account_id, effective=March) confirms Frankfurt region clause.
  6. Generate — draft cites invoice IDs, refund SOP link, and DPA section 4.2; faithfulness checker passes.

Median tier-2 latency: 6.8s (vs 1.1s tier-1). Escalation-to-human rate dropped 34% on the pilot queue. Token cost per tier-2 ticket averages 4.2× tier-1 — still cheaper than a 12-minute human handle time.

Architecture decision table

Scenario Prefer Avoid
FAQ with answer in one doc Single-shot vector RAG Multi-iteration agent loop
Ambiguous or compound question Agentic RAG with decomposition Raising top-k to 50 chunks
First retrieval often off-topic CRAG with fallback search + refinement Stuffing irrelevant chunks into context
Needs live account or ticket data ReAct agent with API tools + doc RAG Indexing PII-heavy tables into vectors
Sub-2s chat widget SLA Cached single-shot + async agent for hard queries Synchronous 5-iteration loop on every message
Regulated answers with citations Self-RAG grading + faithfulness check Agent web search without allowlists
Tight inference budget Router: 80% single-shot, 20% agentic Frontier model on every grader call
Cross-document entity reasoning Graph RAG index as agent tool + vector Serial blind re-queries without graph structure

Common pitfalls

  • Unbounded loops — agents that never abstain burn budget and annoy users; hard-cap iterations and total retrieved tokens.
  • Grader too weak or too strong — a lenient grader passes noise; an overly strict grader triggers endless re-queries.
  • Agentic RAG on bad indexes — no loop fixes 500-token chunks that split tables across boundaries.
  • Ignoring routing — running the full agent on every query multiplies cost without accuracy gains on easy questions.
  • No observability per step — without logging each tool call and grader score, debugging wrong answers is guesswork.
  • Tool permission sprawl — giving the agent SQL + delete + email without scoped credentials is a security incident waiting to happen.
  • Confusing agentic RAG with fine-tuning — loops fix retrieval gaps; they do not teach the model new facts not in tools or corpus.
  • Skipping evaluation on multi-step QA — single-hop benchmarks understate agentic value; build compound question sets.

Production checklist

  • Benchmark single-shot RAG baseline before adding agent loops.
  • Define query router rules or classifier for agentic vs fast path.
  • Set max iterations, max tool calls, and per-request token budget.
  • Implement retrieval grader with calibrated thresholds on a golden set.
  • Log plan, each tool call, grader scores, and final citations per request.
  • Run parallel sub-queries where decomposition allows.
  • Add CRAG fallback only for allowlisted external sources.
  • Enforce tool auth scopes; never expose raw SQL without row-level security.
  • Measure p50/p95 latency and cost per tier separately.
  • Run faithfulness and citation checks on agent-generated answers.
  • A/B test escalation and human-handoff rates vs naive RAG.
  • Document abstain and escalate paths when loops exhaust budget.

Key takeaways

  • Agentic RAG wraps retrieval in a plan-grade-refine loop instead of one embedding search.
  • Self-RAG and CRAG name the critique and correction steps production systems already implement with graders and fallbacks.
  • Query decomposition and ReAct tool use handle multi-hop and live-data questions naive RAG cannot.
  • Routing is essential — reserve agent loops for hard queries; keep FAQs on single-shot paths.
  • Invest in observability, iteration caps, and multi-step evaluation before scaling agentic retrieval to all traffic.

Related reading