Guide

LLM response grounding and factuality verification explained

Harbor Legal's internal policy Q&A bot retrieved the right employee handbook sections 89% of the time and always attached footnote links. Legal reviewers still flagged 23% of answers for “unsupported assertions” — invented section numbers, dates that never appeared in source text, and paraphrases that reversed obligation direction (“may” became “must”). Prompt instructions to “only use retrieved context” did not move the metric. Offline exact-match benchmarks looked fine because evaluators scored topical relevance, not atomic factuality.

The team added a grounding verification layer between generation and the user: decompose the draft into atomic claims, grade each claim against retrieved passages with natural-language inference (NLI), verify citation span overlap for quoted facts, and run deterministic checks on numbers and dates. Unsupported-assertion rate fell from 23% to 4%. Safe paraphrases that previously failed brittle string match now pass entailment grading. This guide covers verification architecture, claim decomposition, entailment and overlap scoring, structured field validation, confidence gating and abstention, the Harbor Legal refactor, a technique decision table versus prompt-only mitigations and full human review, pitfalls, and a production checklist. It pairs with hallucination causes, chain-of-verification prompting, and NLI faithfulness for RAG.

Grounding vs faithfulness vs factuality

These terms overlap in blog posts but mean different things in production pipelines:

Grounding — every substantive claim in the answer can be traced to an allowed source (retrieved chunk, tool output, or structured database row).
Faithfulness — the answer does not contradict or distort those sources; paraphrase is allowed if meaning is preserved.
Factuality — claims are true in the real world, which verification cannot fully guarantee without external knowledge bases; in RAG products you usually operationalize “factuality” as faithfulness to retrieved evidence plus tool checks where available.

A response can be grounded but unfaithful (“Section 4.2 requires 30 days notice” when the source says 14). It can be faithful to wrong retrieval (garbage in, faithful garbage out). Verification gates address the first failure mode; retrieval quality and citation UX address the second.

Verification pipeline architecture

Treat verification as a post-generation guard, not a hope embedded in the system prompt. A typical production stack:

Generate draft answer with retrieved context and optional citations.
Decompose into atomic claims (one verifiable proposition per unit).
Align each claim to supporting evidence spans (retrieval chunks, tool JSON fields).
Grade with NLI entailment, overlap metrics, or rule engines per claim type.
Gate — block, rewrite, downgrade to excerpt-only, or route to human review based on aggregate score.
Log per-claim scores for online monitoring and regression tests.

Latency budget matters: run fast deterministic checks first (regex on dates, schema validation), then parallelize NLI batches on remaining claims. Cache entailment results keyed by (claim hash, evidence hash) when users ask follow-ups on the same thread.

Atomic claim decomposition

Whole-answer grading hides localized errors. Decomposition splits “Employees must submit expenses within 30 days and managers approve within 5 business days” into two claims. Use a small model or structured output schema:

One sentence per claim when possible.
Preserve modals (may, must, shall) — they are legal load-bearing.
Flag numerics, dates, named entities, and comparative statements as high-risk claim types.
Skip pure discourse (“In summary”) from grading.

Decomposition quality dominates downstream precision. Audit 200 decomposed answers manually before trusting automated gates in production.

NLI entailment grading

For each (claim, evidence_span) pair, an NLI classifier labels entailment, neutral, or contradiction. Production policy examples:

Entailment on any aligned span → claim passes.
Contradiction on best-aligned span → claim fails; block or strip.
All neutral → fail closed for regulated domains; warn-only for low-stakes FAQ.

Cross-encoders beat bi-encoders on accuracy but cost more; use cross-encoder on high-risk claims only. Calibrate thresholds per domain on a labeled dev set — default 0.5 softmax scores are rarely optimal.

Citation span overlap

When the model emits explicit citations, verify the cited span actually contains the fact:

Token F1 overlap between claim entities and cited span.
Numeric extraction match — claim “30 days” must appear in span or be arithmetically derivable.
Quote integrity — quoted strings must be substrings of source after normalization.

Overlap alone is brittle for paraphrase; pair it with NLI. Overlap catches fake citations that NLI sometimes accepts when evidence is vaguely related.

Structured and tool-backed verification

When answers include JSON fields (SQL results, API payloads), run schema validation and cross-field consistency checks before natural-language grading. For math and units, prefer deterministic evaluators over NLI. Output parsing and validation belongs in the same layer as faithfulness grading.

Confidence gating and abstention policies

Binary pass/fail is too coarse for mixed-quality drafts. Common policies:

Strip-and-ship — remove failing claims; deliver remainder with disclaimer.
Excerpt fallback — replace synthesis with direct quote from highest-scoring retrieval chunk.
Abstain — “I cannot verify this from available policy text” beats a confident wrong answer.
Human queue — borderline scores (entailment 0.45–0.55) route to review within SLA.

Define per-intent policies: HR policy Q&A abstains aggressively; marketing copy may warn-only. Tie gates to human-in-the-loop queues with feedback that improves decomposition prompts and thresholds.

Harbor Legal refactor: 23% to 4% unsupported assertions

Pre-gate stack: GPT-4 class model, hybrid retrieval over 1,200 handbook chunks, system prompt demanding citations. Problems clustered into three buckets:

Invented references — “Section 7.3.1” with no matching heading in corpus.
Modal drift — optional guidance rendered as mandatory obligation.
Numeric hallucination — deadlines and dollar caps not present in any retrieved span.

The verification layer added:

Claim decomposition via structured JSON (avg 3.2 claims per answer).
NLI cross-encoder graded against top-3 aligned chunks per claim.
Regex + parser verification for dates, currency, and section IDs against retrieval metadata.
Strip-and-ship for failing claims; full abstain if >40% claims fail or any contradiction score > 0.7.

Outcomes after four weeks: unsupported assertions 23% → 4%; abstain rate 8%; median added latency 340 ms (parallel NLI batch); reviewer time per ticket −31%. False abstains on valid paraphrase dropped from 12% to 3% after entailment threshold tuning on 800 labeled pairs.

Technique decision table

Technique	Best when	Risk
Prompt-only (“cite sources”)	Prototypes, low-stakes internal tools	Confident hallucination persists; no measurable gate
Chain-of-verification prompting	Single-turn Q&A without strict latency SLO	Model may verify its own mistakes; not auditable
NLI entailment per claim	RAG over prose documents; paraphrase expected	Threshold tuning; neutral class ambiguity
Citation span overlap rules	Explicit footnotes; legal/medical citation culture	Rejects valid paraphrase if used alone
Deterministic numeric/date parsers	Finance, HR, compliance with structured facts	Misses qualitative obligation language
LLM-as-judge on faithfulness	Rapid iteration before NLI deployment	Judge correlates with generator; position bias
Full human review pre-send	Regulated outbound, low volume	Does not scale; reviewer fatigue
Hybrid: decompose + NLI + parsers + HITL queue	Production enterprise RAG with audit trail	Engineering and labeling investment upfront

Prompting reduces frequency; verification measures and blocks what slips through. Use CoVe to improve drafts, then NLI gates to enforce policy.

Common pitfalls

Grading whole answers — one good sentence hides three invented facts.
Evidence misalignment — grading claims against retrieval list instead of the span the model actually used.
Neutral treated as pass — vague related text passes; tighten policy for high-risk domains.
Ignoring modals — entailment on “employees submit expenses” misses must vs may inversion.
Stale retrieval IDs — verification passes against chunks that were not in the user-facing citation.
Latency serial bottleneck — sequential NLI on 12 claims blows p95; batch and short-circuit on parser failures.
No abstain path — strip logic leaves broken grammar or empty answers without user-facing fallback.
Evaluating on relevance only — offline sets must label atomic support, not topic match.
Generator-judge coupling — same model grades itself without blinded evidence presentation.

Production checklist

Define grounding policy per product surface (pass, strip, abstain, HITL).
Build labeled dev set with atomic claim support labels (500+ examples).
Implement claim decomposition with structured output and manual QA sample.
Align each claim to evidence spans (citation pointers or retrieval alignment).
Run deterministic parsers on numbers, dates, IDs, and enum fields first.
Batch NLI entailment with calibrated thresholds per claim risk tier.
Add citation span overlap checks for explicit footnote workflows.
Define aggregate gate: max failure rate, any contradiction, borderline HITL band.
Log per-claim scores, evidence IDs, and gate decision for audit.
Monitor unsupported-assertion rate and abstain rate in online evaluation.
Regression test gate on every prompt, retrieval, or model change.
Document abstain copy and excerpt fallback UX for end users.

Key takeaways

Grounding verification is a post-generation gate — prompts alone do not produce auditable factuality enforcement.
Atomic claim decomposition localizes errors that whole-answer metrics miss.
NLI entailment handles paraphrase; citation overlap and parsers catch invented references and numbers.
Abstain and strip-and-ship policies are product decisions, not just model decisions.
Harbor Legal cut unsupported assertions from 23% to 4% with decompose + NLI + numeric parsers and 340 ms median latency.