Guide
LLM agent hallucination detection and grounding verification systems explained
Harbor Legal deployed a contract-review agent for in-house counsel. The RAG pipeline retrieved relevant clauses with high recall — section numbers, defined terms, and indemnity caps usually appeared in the context window. Yet 41% of final answers contained at least one factual claim not supported by retrieved evidence: invented limitation periods, wrong notice windows, and paraphrased obligations that reversed who bore risk. Associates trusted the fluent prose; two draft client memos shipped unsupported statements before a partner caught the pattern. Root cause was not retrieval failure — the model synthesized beyond what chunks authorized, and nothing verified claim-by-claim before delivery.
Hallucination detection and grounding verification systems decompose model outputs into checkable atomic claims, match each claim to evidence spans (retrieved chunks, tool results, or structured records), score support strength, and gate delivery: answer, hedge, strip unsupported sentences, or escalate. Harbor Legal's post-generation verifier cut unsupported-claim rate from 41% to 3.1% while preserving useful synthesis on supported facts. This guide explains claim extraction, evidence matching, citation integrity, confidence thresholds, integration with output guardrails and reflection loops, the Harbor Legal refactor, a decision table, pitfalls, and a production checklist.
What hallucination means in agent systems
In production agents, a hallucination is not merely “the model was wrong.” It is a user-visible statement presented as fact that lacks adequate support from authorized evidence at generation time. Categories that matter for verification design:
- Unsupported factual claims — dates, amounts, entity names, policy limits not in context or tool output.
- Over-extrapolation — reasonable-sounding inference that exceeds what sources permit (legal “therefore” without basis).
- Citation fabrication — bracketed references, footnotes, or URLs that do not map to real chunks.
- Tool result misreporting — summarizing an API response incorrectly while citing the tool call.
- Stale knowledge bleed — parametric priors overriding fresher retrieved text without disclosure.
Hallucination detection differs from schema guardrails (syntax and policy) and from PII redaction (privacy). Grounding verification asks: is this specific sentence justified by evidence we are allowed to show the user?
The grounding verification pipeline
Mature systems run verification as a structured pipeline after draft generation (and optionally between tool steps):
- Evidence bundle assembly — normalize retrieved chunks, tool JSON, and database rows into a single indexed evidence set with stable chunk ids.
- Claim extraction — split the draft answer into atomic, verifiable statements (one fact per claim; split compound sentences).
- Evidence retrieval per claim — dense + lexical search over the evidence bundle; optional cross-encoder rerank.
- Entailment scoring — NLI model or LLM judge: does span support, contradict, or not address the claim?
- Aggregation and gating — per-answer support score; apply tenant policy (block, hedge, strip, escalate).
- User-facing packaging — inline citations, confidence badges, or “insufficient evidence” fallbacks.
draft = agent.generate(context, evidence_bundle)
claims = extract_claims(draft) # atomic factual units
for claim in claims:
spans = retrieve_spans(claim, evidence_bundle, top_k=5)
verdict = entailment_judge(claim, spans) # SUPPORT | CONTRADICT | NEUTRAL
claim.support_score = verdict.confidence
claim.citation_ids = verdict.matched_chunk_ids
answer_policy = gate(draft, claims, tenant.grounding_profile)
# profiles: STRICT_BLOCK | HEDGE_UNSUPPORTED | STRIP_AND_DELIVER | HITL_ESCALATE
return package(answer_policy.text, claims.audit_trail)
Log every claim verdict on the run trace. Without per-claim audit rows, you cannot explain why an answer was blocked or which chunk failed to match during incident review.
Claim extraction: making answers checkable
Verification quality depends on claim granularity. Rules Harbor Legal adopted:
Atomic claims only
“The agreement limits liability to $2M and requires 30-day notice” becomes two claims. Compound claims hide partial hallucination — one half may be grounded while the other is invented.
Exclude non-verifiable rhetoric
Greetings, empathy, and procedural instructions skip verification.
Tag claim types: VERIFIABLE_FACT, OPINION,
PROCEDURAL, META (e.g. “I searched the
contract”).
Normalize entities and numbers
Map “thirty days” and “30-day” to the same token; resolve defined terms to canonical clause references before matching.
Tool-output claims are first-class
When the agent says “the CRM shows status: shipped,” verify against the actual tool response blob, not against RAG chunks alone. Misreporting tool results was 18% of Harbor's false positives before they indexed tool JSON in the evidence bundle.
Evidence matching and entailment judges
Retrieval recall for verification differs from user-facing RAG. You need high recall on short claims against known evidence, not broad question answering. Common judge stack:
| Layer | Role | Trade-off |
|---|---|---|
| Lexical overlap (BM25) | Catch exact dates, amounts, names | Misses paraphrase |
| Embedding similarity | Paraphrase-tolerant span retrieval | False support on topical but non-entailing text |
| Cross-encoder rerank | Precision on top candidates | Latency per claim |
| NLI classifier | Support / contradict / neutral | Domain shift on legal/medical text |
| LLM-as-judge (structured) | Complex multi-hop checks | Cost; judge hallucination risk |
Production systems often use a cascade: cheap lexical + embedding filter, then NLI only on borderline claims (support score 0.4–0.7). Contradiction verdicts trigger hard blocks or immediate human escalation — delivering a confidently wrong answer is worse than refusing.
Calibrate thresholds per domain on a labeled golden set. Harbor Legal maintained 420 claim–evidence pairs reviewed by attorneys; weekly regression runs caught judge drift when embedding models changed.
Citation integrity and user trust
Citations are not decoration — they are the user's audit trail. Integrity checks:
- Bijection — every inline citation id maps to exactly one evidence chunk shown or expandable on click.
- No orphan citations — ids in brackets must not reference chunks removed by post-processing.
- Span alignment — highlighted excerpt must entail the sentence it supports (not merely share keywords).
- Version pinning — cite
doc_revand chunk offset so contract amendments do not silently invalidate old answers.
When support is partial, prefer honest hedging: “Section 8.2 addresses notice periods; the agreement does not specify a limitation period in the retrieved excerpts” over silent invention. Users forgive gaps; they do not forgive confident fabrication.
Confidence gating and delivery policies
Tenant-configurable grounding profiles map aggregate scores to behavior:
| Profile | Trigger | User experience |
|---|---|---|
| STRICT_BLOCK | Any VERIFIABLE_FACT below threshold | Refuse; offer to search more sources |
| STRIP_UNSUPPORTED | Per-sentence support < τ | Deliver only supported sentences + citation list |
| HEDGE | Borderline support | Prefix with uncertainty; require user acknowledgment |
| HITL_ESCALATE | Contradiction or high-stakes low support | Queue for human review before send |
| AUDIT_ONLY | Log failures; deliver full draft | Internal QA / low-risk channels only |
Pair gating with reflection loops: on STRIP or HEDGE, optionally regenerate once with explicit “unsupported claims removed” feedback — but cap retries; re-generation without new evidence often re-hallucinates differently.
Where verification runs in the agent loop
Three placement options, often combined:
Post-answer (default)
Verify the final user-facing message before send. Lowest integration cost; cannot prevent a bad tool write unless paired with pre-tool guardrails.
Pre-tool side effects
Before write_crm or send_email, verify claims in
the proposed payload against evidence. Essential for agents that act, not
only chat.
Per retrieval step
After RAG fetch, verify the draft plan (“I will cite sections 4 and 9”) against returned chunks before generation. Catches retrieval overconfidence early.
Store verification artifacts on the run record for offline eval and compliance replay. Metrics to dashboard: unsupported-claim rate, contradiction rate, strip rate, escalation rate, median claims per answer.
Harbor Legal refactor
Root causes beyond “RAG was enabled”:
- No claim decomposition — reviewers judged whole paragraphs, missing single-sentence inventions.
- Evidence bundle incomplete — tool JSON and prior turn context excluded from matching.
- Citation theater — model emitted section numbers not present in retrieved chunks.
- Binary pass/fail QA — no continuous unsupported-claim metric in production.
Shipped fixes:
- Post-generation claim extractor + NLI cascade with attorney-calibrated thresholds.
- Unified evidence index: RAG chunks + tool responses + structured clause table.
- Citation bijection linter; orphan ids block delivery in STRICT profiles.
- Weekly golden-set regression in CI; unsupported-claim SLO on live traffic sample.
- Associate UI: click claim to see support spans and judge score.
Unsupported-claim rate fell from 41% to 3.1% over one quarter. Escalation volume rose 12% (expected — borderline cases routed to humans instead of slipping through). Client memo rework tickets dropped 78%.
Technique decision table
| Approach | Best for | Weak when |
|---|---|---|
| Prompt-only (“cite sources”) | Prototypes, low stakes | Regulated domains, any write side effects |
| RAG without verification | High-recall search assistants | Users treat synthesis as authoritative fact |
| Post-hoc claim + NLI verify | Q&A, summaries, legal/medical review | Ultra-low latency chat (< 200 ms added) |
| Structured output + DB lookup | Enumerated fields (plan IDs, SKUs) | Narrative answers with mixed facts |
| Reflection self-critique only | Catching obvious inconsistencies | Model rationalizes its own hallucinations |
| Human review all answers | Maximum safety, tiny volume | Scale and cost |
Common pitfalls
- Verifying against the draft, not evidence — judge sees the answer and “confirms” itself.
- Chunk boundaries split facts — limitation period spans two chunks; matcher misses both.
- Ignoring contradictions — treating NEUTRAL as pass; delivering when evidence conflicts.
- Citation ids without span checks — users click empty or irrelevant highlights.
- One global threshold — medical dosing needs STRICT_BLOCK; internal brainstorm can use HEDGE.
- No tool-result indexing — agent lies about API fields verification cannot see.
- Endless regenerate loops — each retry invents new unsupported claims.
Production checklist
- Define claim types and atomic extraction rules for your domain.
- Build a unified evidence bundle (RAG + tools + structured records).
- Implement retrieve-then-entail cascade with contradiction handling.
- Calibrate thresholds on a labeled golden claim set; regression in CI.
- Enforce citation bijection and span alignment in user UI.
- Configure per-tenant grounding profiles (block / strip / hedge / escalate).
- Verify before irreversible tool writes, not only before chat send.
- Log per-claim verdicts on traces for audit and offline eval.
- Dashboard unsupported-claim rate and escalation rate weekly.
- Cap reflection retries; fetch more evidence instead of re-guessing.
- Run adversarial red-team prompts (fabrication pressure) quarterly.
Key takeaways
- RAG improves recall; grounding verification enforces that synthesis stays within evidence.
- Atomic claim extraction is the foundation — compound sentences hide partial hallucinations.
- Entailment judges need domain calibration; retrieval similarity alone is not support.
- Citations must map bijectively to checkable spans or user trust collapses.
- Harbor Legal cut unsupported claims from 41% to 3.1% with post-generation verification.
Related reading
- LLM agent RAG retrieval pipeline explained — query routing, hybrid search, and grounded context
- LLM agent guardrails and output validation explained — schema gates and policy layers before side effects
- LLM agent reflection and verification loops explained — self-critique and repair iterations
- LLM agent evaluation and benchmarking explained — trajectory scoring and regression suites