Guide

LLM agent hallucination detection and grounding verification systems explained

Harbor Legal deployed a contract-review agent for in-house counsel. The RAG pipeline retrieved relevant clauses with high recall — section numbers, defined terms, and indemnity caps usually appeared in the context window. Yet 41% of final answers contained at least one factual claim not supported by retrieved evidence: invented limitation periods, wrong notice windows, and paraphrased obligations that reversed who bore risk. Associates trusted the fluent prose; two draft client memos shipped unsupported statements before a partner caught the pattern. Root cause was not retrieval failure — the model synthesized beyond what chunks authorized, and nothing verified claim-by-claim before delivery.

Hallucination detection and grounding verification systems decompose model outputs into checkable atomic claims, match each claim to evidence spans (retrieved chunks, tool results, or structured records), score support strength, and gate delivery: answer, hedge, strip unsupported sentences, or escalate. Harbor Legal's post-generation verifier cut unsupported-claim rate from 41% to 3.1% while preserving useful synthesis on supported facts. This guide explains claim extraction, evidence matching, citation integrity, confidence thresholds, integration with output guardrails and reflection loops, the Harbor Legal refactor, a decision table, pitfalls, and a production checklist.

What hallucination means in agent systems

In production agents, a hallucination is not merely “the model was wrong.” It is a user-visible statement presented as fact that lacks adequate support from authorized evidence at generation time. Categories that matter for verification design:

Unsupported factual claims — dates, amounts, entity names, policy limits not in context or tool output.
Over-extrapolation — reasonable-sounding inference that exceeds what sources permit (legal “therefore” without basis).
Citation fabrication — bracketed references, footnotes, or URLs that do not map to real chunks.
Tool result misreporting — summarizing an API response incorrectly while citing the tool call.
Stale knowledge bleed — parametric priors overriding fresher retrieved text without disclosure.

Hallucination detection differs from schema guardrails (syntax and policy) and from PII redaction (privacy). Grounding verification asks: is this specific sentence justified by evidence we are allowed to show the user?

The grounding verification pipeline

Mature systems run verification as a structured pipeline after draft generation (and optionally between tool steps):

Evidence bundle assembly — normalize retrieved chunks, tool JSON, and database rows into a single indexed evidence set with stable chunk ids.
Claim extraction — split the draft answer into atomic, verifiable statements (one fact per claim; split compound sentences).
Evidence retrieval per claim — dense + lexical search over the evidence bundle; optional cross-encoder rerank.
Entailment scoring — NLI model or LLM judge: does span support, contradict, or not address the claim?
Aggregation and gating — per-answer support score; apply tenant policy (block, hedge, strip, escalate).
User-facing packaging — inline citations, confidence badges, or “insufficient evidence” fallbacks.

draft = agent.generate(context, evidence_bundle)
claims = extract_claims(draft)  # atomic factual units

for claim in claims:
  spans = retrieve_spans(claim, evidence_bundle, top_k=5)
  verdict = entailment_judge(claim, spans)  # SUPPORT | CONTRADICT | NEUTRAL
  claim.support_score = verdict.confidence
  claim.citation_ids = verdict.matched_chunk_ids

answer_policy = gate(draft, claims, tenant.grounding_profile)
# profiles: STRICT_BLOCK | HEDGE_UNSUPPORTED | STRIP_AND_DELIVER | HITL_ESCALATE
return package(answer_policy.text, claims.audit_trail)

Log every claim verdict on the run trace. Without per-claim audit rows, you cannot explain why an answer was blocked or which chunk failed to match during incident review.

Claim extraction: making answers checkable

Verification quality depends on claim granularity. Rules Harbor Legal adopted:

Atomic claims only

“The agreement limits liability to $2M and requires 30-day notice” becomes two claims. Compound claims hide partial hallucination — one half may be grounded while the other is invented.

Exclude non-verifiable rhetoric

Greetings, empathy, and procedural instructions skip verification. Tag claim types: VERIFIABLE_FACT, OPINION, PROCEDURAL, META (e.g. “I searched the contract”).

Normalize entities and numbers

Map “thirty days” and “30-day” to the same token; resolve defined terms to canonical clause references before matching.

Tool-output claims are first-class

When the agent says “the CRM shows status: shipped,” verify against the actual tool response blob, not against RAG chunks alone. Misreporting tool results was 18% of Harbor's false positives before they indexed tool JSON in the evidence bundle.

Evidence matching and entailment judges

Retrieval recall for verification differs from user-facing RAG. You need high recall on short claims against known evidence, not broad question answering. Common judge stack:

Layer	Role	Trade-off
Lexical overlap (BM25)	Catch exact dates, amounts, names	Misses paraphrase
Embedding similarity	Paraphrase-tolerant span retrieval	False support on topical but non-entailing text
Cross-encoder rerank	Precision on top candidates	Latency per claim
NLI classifier	Support / contradict / neutral	Domain shift on legal/medical text
LLM-as-judge (structured)	Complex multi-hop checks	Cost; judge hallucination risk

Production systems often use a cascade: cheap lexical + embedding filter, then NLI only on borderline claims (support score 0.4–0.7). Contradiction verdicts trigger hard blocks or immediate human escalation — delivering a confidently wrong answer is worse than refusing.

Calibrate thresholds per domain on a labeled golden set. Harbor Legal maintained 420 claim–evidence pairs reviewed by attorneys; weekly regression runs caught judge drift when embedding models changed.

Citation integrity and user trust

Citations are not decoration — they are the user's audit trail. Integrity checks:

Bijection — every inline citation id maps to exactly one evidence chunk shown or expandable on click.
No orphan citations — ids in brackets must not reference chunks removed by post-processing.
Span alignment — highlighted excerpt must entail the sentence it supports (not merely share keywords).
Version pinning — cite doc_rev and chunk offset so contract amendments do not silently invalidate old answers.

When support is partial, prefer honest hedging: “Section 8.2 addresses notice periods; the agreement does not specify a limitation period in the retrieved excerpts” over silent invention. Users forgive gaps; they do not forgive confident fabrication.

Confidence gating and delivery policies

Tenant-configurable grounding profiles map aggregate scores to behavior:

Profile	Trigger	User experience
STRICT_BLOCK	Any VERIFIABLE_FACT below threshold	Refuse; offer to search more sources
STRIP_UNSUPPORTED	Per-sentence support < τ	Deliver only supported sentences + citation list
HEDGE	Borderline support	Prefix with uncertainty; require user acknowledgment
HITL_ESCALATE	Contradiction or high-stakes low support	Queue for human review before send
AUDIT_ONLY	Log failures; deliver full draft	Internal QA / low-risk channels only

Pair gating with reflection loops: on STRIP or HEDGE, optionally regenerate once with explicit “unsupported claims removed” feedback — but cap retries; re-generation without new evidence often re-hallucinates differently.

Where verification runs in the agent loop

Three placement options, often combined:

Post-answer (default)

Verify the final user-facing message before send. Lowest integration cost; cannot prevent a bad tool write unless paired with pre-tool guardrails.

Pre-tool side effects

Before write_crm or send_email, verify claims in the proposed payload against evidence. Essential for agents that act, not only chat.

Per retrieval step

After RAG fetch, verify the draft plan (“I will cite sections 4 and 9”) against returned chunks before generation. Catches retrieval overconfidence early.

Store verification artifacts on the run record for offline eval and compliance replay. Metrics to dashboard: unsupported-claim rate, contradiction rate, strip rate, escalation rate, median claims per answer.

Harbor Legal refactor

Root causes beyond “RAG was enabled”:

No claim decomposition — reviewers judged whole paragraphs, missing single-sentence inventions.
Evidence bundle incomplete — tool JSON and prior turn context excluded from matching.
Citation theater — model emitted section numbers not present in retrieved chunks.
Binary pass/fail QA — no continuous unsupported-claim metric in production.

Shipped fixes:

Post-generation claim extractor + NLI cascade with attorney-calibrated thresholds.
Unified evidence index: RAG chunks + tool responses + structured clause table.
Citation bijection linter; orphan ids block delivery in STRICT profiles.
Weekly golden-set regression in CI; unsupported-claim SLO on live traffic sample.
Associate UI: click claim to see support spans and judge score.

Unsupported-claim rate fell from 41% to 3.1% over one quarter. Escalation volume rose 12% (expected — borderline cases routed to humans instead of slipping through). Client memo rework tickets dropped 78%.

Technique decision table

Approach	Best for	Weak when
Prompt-only (“cite sources”)	Prototypes, low stakes	Regulated domains, any write side effects
RAG without verification	High-recall search assistants	Users treat synthesis as authoritative fact
Post-hoc claim + NLI verify	Q&A, summaries, legal/medical review	Ultra-low latency chat (< 200 ms added)
Structured output + DB lookup	Enumerated fields (plan IDs, SKUs)	Narrative answers with mixed facts
Reflection self-critique only	Catching obvious inconsistencies	Model rationalizes its own hallucinations
Human review all answers	Maximum safety, tiny volume	Scale and cost

Common pitfalls

Verifying against the draft, not evidence — judge sees the answer and “confirms” itself.
Chunk boundaries split facts — limitation period spans two chunks; matcher misses both.
Ignoring contradictions — treating NEUTRAL as pass; delivering when evidence conflicts.
Citation ids without span checks — users click empty or irrelevant highlights.
One global threshold — medical dosing needs STRICT_BLOCK; internal brainstorm can use HEDGE.
No tool-result indexing — agent lies about API fields verification cannot see.
Endless regenerate loops — each retry invents new unsupported claims.

Production checklist

Define claim types and atomic extraction rules for your domain.
Build a unified evidence bundle (RAG + tools + structured records).
Implement retrieve-then-entail cascade with contradiction handling.
Calibrate thresholds on a labeled golden claim set; regression in CI.
Enforce citation bijection and span alignment in user UI.
Configure per-tenant grounding profiles (block / strip / hedge / escalate).
Verify before irreversible tool writes, not only before chat send.
Log per-claim verdicts on traces for audit and offline eval.
Dashboard unsupported-claim rate and escalation rate weekly.
Cap reflection retries; fetch more evidence instead of re-guessing.
Run adversarial red-team prompts (fabrication pressure) quarterly.

Key takeaways

RAG improves recall; grounding verification enforces that synthesis stays within evidence.
Atomic claim extraction is the foundation — compound sentences hide partial hallucinations.
Entailment judges need domain calibration; retrieval similarity alone is not support.
Citations must map bijectively to checkable spans or user trust collapses.
Harbor Legal cut unsupported claims from 41% to 3.1% with post-generation verification.