Guide
Agentic RAG explained
Classic RAG embeds the user question once, pulls the top-k chunks, and generates an answer. That pipeline is fast and cheap — and it fails when the question is ambiguous, spans multiple documents, or needs a follow-up search after the first pass misses the mark. Agentic RAG wraps retrieval in an agent loop: the model plans sub-queries, calls search tools, grades whether context is relevant and complete, and retries with refined queries until it can answer confidently or escalate. Patterns like Self-RAG, Corrective RAG (CRAG), and ReAct-style tool orchestration turned iterative retrieval from a research curiosity into a production default for support bots, legal research, and internal knowledge assistants. This guide explains how agentic RAG differs from single-shot vector RAG and graph RAG, the core loop and named patterns, tool integration with LLM agents, cost and latency tradeoffs, a Harbor Support escalation agent worked example, an architecture decision table, common pitfalls, and a production checklist alongside our RAG evaluation guide.
Single-shot RAG vs agentic retrieval
In naive RAG, the retrieval step is stateless: one embedding, one search, one prompt assembly. The generator never sees evidence it did not retrieve on the first try. Three failure modes dominate production logs:
- Query mismatch — the user asks about “refund policy for annual plans cancelled mid-cycle” but the embedding matches a generic billing FAQ because the phrasing differs from indexed docs.
- Multi-hop gaps — answering requires facts from two unrelated chunks (pricing table + exception clause) that neither ranks in top-k alone.
- Insufficient context — retrieved chunks are topically related but do not contain the specific date, SKU, or jurisdiction the answer needs.
Agentic RAG treats retrieval as a sequence of decisions rather than a single function call. An orchestrator (often the same LLM with tool definitions) can decompose the question, run multiple searches, reject irrelevant passages, rewrite the query, and only then synthesize. The extra LLM rounds add latency and token cost — but they recover accuracy on questions where naive RAG hallucinates or refuses.
Where agentic RAG sits in the stack
Agentic RAG is not a replacement for good indexing. Chunking, hybrid search, reranking, and graph extraction still matter; the agent loop sits on top of those retrievers as a controller. You can agentically orchestrate vector search, BM25, SQL over a warehouse, and a function-calling API in one session. The agent chooses which tool and what query — not just how to phrase the final answer.
Named patterns: Self-RAG, CRAG, and query decomposition
Self-RAG (retrieve, generate, critique)
Self-RAG (Asai et al., 2023) trains or prompts the model to emit reflection tokens at each step: should I retrieve? are these passages relevant? is my draft supported? is the answer useful? At inference, the model can skip retrieval when internal knowledge suffices, retrieve when uncertain, and discard irrelevant chunks before generation. Even without special training tokens, production systems mimic Self-RAG with a lightweight retrieval grader (a small classifier or LLM call) that scores each chunk 0–1 before prompt assembly.
Corrective RAG (CRAG)
Corrective RAG adds an explicit correct step when initial retrieval scores below a threshold. Low-confidence retrievals trigger a fallback: web search, a broader keyword query, or a different index partition. High-confidence retrievals pass through a knowledge refinement step that strips irrelevant sentences from chunks before the generator sees them. CRAG reduces noise in the context window — a common cause of confident wrong answers.
Query decomposition and step-back prompting
Complex questions split into sub-queries executed in parallel or sequence:
- Decomposition — “Compare EU and US data retention requirements for enterprise tier” becomes two retrieval calls plus a synthesis step.
- Step-back — first retrieve a high-level concept doc (“what is our data retention framework?”), then a specific policy page.
- HyDE (Hypothetical Document Embeddings) — the model writes a fake answer paragraph, embeds that text, and searches with the hypothetical doc vector when the raw question embedding underperforms.
Decomposition pairs naturally with multi-agent orchestration when sub-tasks have different tool permissions or model tiers.
The agent loop: plan, retrieve, grade, refine
Most production agentic RAG pipelines implement the same state machine with different names:
- Plan — classify query difficulty; decide tool set and max iterations (budget cap).
- Retrieve — execute one or more search tools (vector, keyword, graph traverse, SQL).
- Grade — score relevance, coverage, and freshness; drop or down-rank weak passages.
- Refine — if coverage is insufficient, rewrite the query, broaden filters, or switch tools; loop until budget exhausted.
- Generate — synthesize with citations; optionally run a faithfulness check against retrieved text.
- Escalate or abstain — if loops fail, hand off to human or return “I could not find authoritative documentation.”
ReAct-style tool use
ReAct (reason + act) interleaves natural-language reasoning traces with
tool calls. The model writes “I need the 2024 SOC2 report section on encryption”
then invokes search(docs, query=...), reads results, and continues. This
pattern is the practical backbone of agentic RAG in frameworks like LangGraph, LlamaIndex
agents, and custom OpenAI Assistants workflows. Cap iteration count and total retrieved
tokens — unbounded ReAct loops are a top source of runaway inference bills.
Latency and cost knobs
- Max iterations — typically 2–5 for support bots; more for research assistants with async UX.
- Grader model tier — use a 3B–8B classifier for chunk scoring; reserve the frontier model for final synthesis.
- Parallel sub-queries — decomposition steps can fan out; wall-clock equals slowest branch, not sum of serial searches.
- Early exit — if the grader scores all chunks above 0.85 on the first pass, skip refinement (most queries should hit this path).
Tool integration beyond vector search
Agentic RAG shines when retrieval is heterogeneous. Common tools in one agent session:
- Vector + BM25 hybrid — via hybrid search with metadata filters (product, region, doc version).
- Structured SQL — “How many enterprise accounts cancelled last quarter?” needs a warehouse query, not a help article.
- Graph traversal — follow entity edges when the question names relationships (see graph RAG indexes as a tool, not a separate product).
- Live APIs — ticket status, account entitlements, shipping tracking; stale docs cannot answer these.
- Web search fallback — CRAG-style external retrieval when internal corpus confidence is low (with source attribution and policy guardrails).
Route simple lookups to single-shot RAG via a model router; reserve the full agent loop for queries the classifier marks as multi-step or low first-pass retrieval score.
Worked example: Harbor Support escalation agent
Harbor Support handles B2B SaaS tickets. Tier-1 uses naive RAG over 4,200 help articles.
Tier-2 agentic RAG activates when tier-1 confidence is below 0.7 or the
ticket tags include billing-dispute, compliance, or
multi-product.
Ticket: “We upgraded from Pro to Enterprise in March, were charged twice in April, and need confirmation our EU customer data stays in Frankfurt under the new contract.”
- Plan — router flags multi-domain (billing + data residency);
budget: 4 iterations, tools:
search_kb,query_billing_db,fetch_contract. - Retrieve (pass 1) —
search_kb("duplicate charge enterprise upgrade")returns generic payment FAQ (grader: 0.4 relevance). - Refine — agent decomposes: (a) duplicate charge policy, (b) Enterprise data residency addendum.
- Retrieve (pass 2) — parallel:
query_billing_db(account_id, month=April)shows two invoices;search_kb("EU data residency Frankfurt enterprise DPA")hits DPA appendix (grader: 0.92). - Retrieve (pass 3) —
fetch_contract(account_id, effective=March)confirms Frankfurt region clause. - Generate — draft cites invoice IDs, refund SOP link, and DPA section 4.2; faithfulness checker passes.
Median tier-2 latency: 6.8s (vs 1.1s tier-1). Escalation-to-human rate dropped 34% on the pilot queue. Token cost per tier-2 ticket averages 4.2× tier-1 — still cheaper than a 12-minute human handle time.
Architecture decision table
| Scenario | Prefer | Avoid |
|---|---|---|
| FAQ with answer in one doc | Single-shot vector RAG | Multi-iteration agent loop |
| Ambiguous or compound question | Agentic RAG with decomposition | Raising top-k to 50 chunks |
| First retrieval often off-topic | CRAG with fallback search + refinement | Stuffing irrelevant chunks into context |
| Needs live account or ticket data | ReAct agent with API tools + doc RAG | Indexing PII-heavy tables into vectors |
| Sub-2s chat widget SLA | Cached single-shot + async agent for hard queries | Synchronous 5-iteration loop on every message |
| Regulated answers with citations | Self-RAG grading + faithfulness check | Agent web search without allowlists |
| Tight inference budget | Router: 80% single-shot, 20% agentic | Frontier model on every grader call |
| Cross-document entity reasoning | Graph RAG index as agent tool + vector | Serial blind re-queries without graph structure |
Common pitfalls
- Unbounded loops — agents that never abstain burn budget and annoy users; hard-cap iterations and total retrieved tokens.
- Grader too weak or too strong — a lenient grader passes noise; an overly strict grader triggers endless re-queries.
- Agentic RAG on bad indexes — no loop fixes 500-token chunks that split tables across boundaries.
- Ignoring routing — running the full agent on every query multiplies cost without accuracy gains on easy questions.
- No observability per step — without logging each tool call and grader score, debugging wrong answers is guesswork.
- Tool permission sprawl — giving the agent SQL + delete + email without scoped credentials is a security incident waiting to happen.
- Confusing agentic RAG with fine-tuning — loops fix retrieval gaps; they do not teach the model new facts not in tools or corpus.
- Skipping evaluation on multi-step QA — single-hop benchmarks understate agentic value; build compound question sets.
Production checklist
- Benchmark single-shot RAG baseline before adding agent loops.
- Define query router rules or classifier for agentic vs fast path.
- Set max iterations, max tool calls, and per-request token budget.
- Implement retrieval grader with calibrated thresholds on a golden set.
- Log plan, each tool call, grader scores, and final citations per request.
- Run parallel sub-queries where decomposition allows.
- Add CRAG fallback only for allowlisted external sources.
- Enforce tool auth scopes; never expose raw SQL without row-level security.
- Measure p50/p95 latency and cost per tier separately.
- Run faithfulness and citation checks on agent-generated answers.
- A/B test escalation and human-handoff rates vs naive RAG.
- Document abstain and escalate paths when loops exhaust budget.
Key takeaways
- Agentic RAG wraps retrieval in a plan-grade-refine loop instead of one embedding search.
- Self-RAG and CRAG name the critique and correction steps production systems already implement with graders and fallbacks.
- Query decomposition and ReAct tool use handle multi-hop and live-data questions naive RAG cannot.
- Routing is essential — reserve agent loops for hard queries; keep FAQs on single-shot paths.
- Invest in observability, iteration caps, and multi-step evaluation before scaling agentic retrieval to all traffic.
Related reading
- RAG explained — chunking, embeddings, and the single-shot retrieval baseline
- Graph RAG explained — entity graphs and community summaries as a retrieval tool
- AI agents and tool use explained — function calling, permissions, and agent safety
- RAG evaluation explained — faithfulness, context recall, and multi-hop test sets