Guide

LLM agent RAG retrieval pipeline systems explained

Harbor Legal runs an internal policy assistant over 14,000 pages of HR handbooks, state addenda, and vendor contracts. The first production version naively pasted the top three whole documents into every agent turn. Context windows filled before the model could reason; irrelevant clauses drowned out the one paragraph that mattered; and the agent invented policy where retrieval missed. 52% of sampled answers contained unsupported claims in their first month. After engineering replaced prompt stuffing with a staged retrieval-augmented generation (RAG) pipeline — query rewriting, hybrid dense+sparse search, cross-encoder reranking, and citation-bound tool loops coordinated with context budgets — unsupported-claim rate fell to 7% while p95 time-to-first-token dropped 38%.

RAG in agent systems is not a single vector lookup. It is a pipeline that decides when to retrieve, what to search, how to fuse and rank candidates, and how to inject evidence into the model without blowing token limits or leaking cross-tenant data. This guide covers agent-specific retrieval triggers, index design, hybrid search, rerankers, citation contracts, coordination with episodic and vector memory, Harbor Legal’s refactor, a technique decision table, pitfalls, and a production checklist.

Why agents need a pipeline, not a single embed call

Chatbots often run one embedding query per user message. Agents add complexity because retrieval fires at multiple lifecycle points:

  • Pre-plan retrieval — gather domain facts before the model chooses tools.
  • Mid-loop retrieval — each tool step may need fresh evidence (CRM record, ticket thread, code file).
  • Post-tool synthesis — compress noisy tool output into citeable spans for the next turn.
  • Memory hydration — pull prior session summaries and long-term facts from vector stores.
  • Guardrail verification — retrieve policy snippets to ground refusal or approval decisions.

Each stage needs different k, filters, and latency budgets. A monolithic “search the whole corpus” step wastes tokens and invites hallucination when the wrong chunks rank high. Instrument each stage as its own span in distributed tracing so you can see whether bad answers come from retrieval miss, ranking drift, or synthesis failure.

Pipeline stages in production order

1. Retrieval trigger and query formulation

Not every turn needs RAG. Use lightweight classifiers or heuristics: factual questions, policy lookups, and unknown entities trigger retrieval; pure formatting or arithmetic may not. When retrieval runs, generate a search query distinct from the user’s raw text — resolve pronouns, expand acronyms, and strip chit-chat. Multi-hop agents may emit several sub-queries in parallel; cap fan-out with rate limits on embed and search APIs.

2. Metadata filters and tenant isolation

Apply hard filters before vector math: tenant_id, document class, effective date, jurisdiction, clearance level. Filters belong in the index, not in prompt instructions — models forget constraints under pressure. Coordinate with tenant isolation so namespace mistakes cannot return another customer’s chunks.

3. Hybrid retrieval: dense + sparse

Dense embeddings excel at paraphrase and conceptual match. Sparse BM25-style retrieval catches exact SKUs, statute numbers, and rare tokens embeddings smooth away. Production systems fuse scores with reciprocal rank fusion (RRF) or learned weights. Start with RRF — it is robust when you lack labeled click data. Tune chunk boundaries using chunking strategy guidance; bad splits hurt both paths.

4. Reranking

Retrieve top 50–100 candidates cheaply, then rerank top 10 with a cross-encoder or lightweight reranker model. This step is where Harbor Legal recovered precision on near-duplicate policy sections. Budget rerank latency separately from first-token latency — users tolerate a short “searching policy…” beat if streaming starts immediately after.

5. Context assembly and citation contract

Pack reranked chunks into a structured envelope: source ID, title, effective date, and verbatim span. Instruct the model to cite source IDs in answers and to abstain when evidence is insufficient. Pair with output validation that rejects answers referencing unknown citation keys. Keep total retrieved text inside the run’s token allocation — retrieval is a consumer of the same budget as tool results.

6. Refresh, staleness, and index versioning

Documents change. Tag chunks with index_version and content_hash. When a source updates, invalidate affected chunks before the agent cites revoked policy. Async ingestion pipelines should not block live queries — serve stale-with-warning or exclude in-flight namespaces until re-embed completes.

RAG as an agent tool vs middleware injection

Two integration patterns dominate:

PatternHow it worksBest for
Middleware pre-fill Pipeline runs before every model call; evidence appended to system or tool context automatically. Support bots, policy Q&A, always-on knowledge bases.
Explicit retrieve tool Model calls search_knowledge_base(query) when it chooses. Multi-domain agents, optional retrieval, cost-sensitive workloads.
Hybrid Middleware retrieves session-critical facts; tool available for deep dives. Long-running ops agents, research assistants.

Explicit tools reduce wasted retrieval but depend on the model calling them. Middleware guarantees coverage but can over-fetch. Harbor Legal uses middleware for jurisdiction-scoped policy baselines and a tool for ad-hoc contract clause lookup.

Harbor Legal refactor: from stuffing to staged RAG

The broken v1 pipeline: embed user message → top-3 whole PDFs → paste into system prompt. Problems stacked quickly:

  • PDFs averaged 180k tokens each — truncation dropped the relevant section.
  • No effective-date filter — superseded 2022 leave policy appeared beside 2026 rules.
  • No citation keys — auditors could not trace answers to sources.
  • Single dense index — exact regulation numbers missed.

The v2 pipeline: query rewrite → metadata filter (state + effective_date) → hybrid retrieve k=80 → cross-encoder rerank to k=8 → structured citation block → model answer with mandatory [source:…] tags → guardrail rejects orphan citations. Unsupported claims dropped from 52% to 7%; faithfulness on their golden set rose from 61% to 94% per RAG evaluation metrics.

Decision table: RAG pipeline vs alternatives

ApproachStrengthWeaknessUse when
Full prompt stuffing Simple, no infra Token blow-up, stale data, poor precision Prototype only, tiny corpora
Staged RAG pipeline Scales to large corpora, citeable, tunable Index ops, eval harness needed Production agents over changing knowledge
Fine-tuned parametric memory Low latency at inference Hard to update, opaque, costly retraining Stable style/format, not factual corpora
Tool-only live APIs Always fresh for connected systems No offline docs, rate limits, latency CRM, tickets, code repos with APIs

Common pitfalls

  • Retrieve-then-ignore — model answers from parametric memory; enforce citation or abstain prompts plus validation.
  • Chunk overlap too low — sentences split across boundaries; tune overlap and parent-child indexes.
  • No sparse leg — exact identifiers never rank; add BM25 or SPLADE.
  • Reranker skipped — embedding top-5 looks good in demos, fails on near-duplicates at scale.
  • Cross-tenant leakage — filter bug exposes another customer’s chunks; test with adversarial tenant IDs.
  • PII in retrieved spans — run PII redaction before logging traces or echoing to users.
  • Unbounded mid-loop retrieval — agent loops retrieve every turn until budget exhaustion; cap calls per run.

Production checklist

  • Define retrieval triggers: when to search vs skip.
  • Implement query rewriting with pronoun resolution and acronym expansion.
  • Enforce metadata filters (tenant, date, jurisdiction) before vector search.
  • Deploy hybrid dense+sparse with fused ranking (start with RRF).
  • Add cross-encoder reranking on top-k candidate pool.
  • Structure context blocks with source IDs and effective dates.
  • Require citations or explicit abstention in model contract.
  • Validate outputs against known citation keys in guardrails.
  • Version indexes; invalidate stale chunks on document update.
  • Measure Precision@k, faithfulness, and latency per pipeline stage.
  • Load-test retrieval under tenant fan-out and embed rate limits.

Key takeaways

  • Agent RAG is multi-stage — trigger, search, rerank, inject, verify.
  • Hybrid retrieval beats dense-only for exact tokens and rare entities.
  • Citation contracts turn retrieval from a hint into auditable evidence.
  • Tenant filters belong in the index, not the prompt.
  • Harbor Legal cut unsupported claims 52% → 7% with staged hybrid RAG and reranking.

Related reading