Guide

LLM NLI faithfulness for RAG explained

Harbor Support’s billing RAG bot answered “Can I downgrade mid-cycle and keep my annual discount?” with a confident yes — citing a chunk about upgrades that never mentioned proration. Retrieval had returned plausible text; the generator stitched a policy that did not exist. Post-hoc review flagged 19% of production answers as unsupported on a 400-ticket golden set. Routing every response through GPT-4-as-judge cut that to 9% but added 1.8 seconds and $0.04 per ticket. After inserting a natural language inference (NLI) entailment pass that scores each atomic claim against retrieved sources, unsupported answers dropped to 6% at 120 ms and $0.0003 per check.

NLI models classify whether a hypothesis (a generated sentence or claim) is entailed by, contradicted by, or neutral relative to a premise (a source chunk). That three-way judgment is cheaper and more consistent than asking a frontier LLM “is this faithful?” on every turn, and it plugs directly into retrieval grading, chain-of-verification loops, and citation enforcement. This guide covers NLI fundamentals, where it fits in RAG pipelines, claim decomposition, the Harbor Support refactor, a technique decision table versus cross-encoder reranking and LLM judges, pitfalls, and a production checklist — complementing hallucination taxonomy and two-stage reranking.

What natural language inference is

Natural language inference (NLI) is the task of determining whether one text logically follows from another. Given a premise P and hypothesis H, a classifier outputs one of three labels:

Entailment — if P is true, H must be true (“Refunds within 30 days” entails “You can get a refund within one month”).
Contradiction — P and H cannot both be true (“No mid-cycle downgrades” contradicts “Downgrade anytime”).
Neutral — P neither proves nor disproves H (upgrade policy is silent on downgrade discounts).

Production systems typically use fine-tuned cross-encoders (DeBERTa on MNLI, SNLI, or domain-adapted corpora) or small encoder models exported to ONNX. Unlike bi-encoder similarity, NLI reads premise and hypothesis jointly — it catches paraphrase entailment and explicit contradiction that cosine distance misses.

Signal	What it measures	Typical model
Embedding cosine similarity	Semantic relatedness	Bi-encoder (e5, BGE)
Cross-encoder relevance	Query–passage match quality	ms-marco MiniLM
NLI entailment score	Does source support this claim?	DeBERTa-MNLI, TRUE benchmark models
LLM-as-judge	Holistic faithfulness rubric	Frontier model + structured prompt

Where NLI fits in RAG pipelines

1. Retrieval grading (pre-generation)

In Self-RAG and Corrective RAG, a retrieval grader decides whether retrieved chunks are relevant enough to answer the query. NLI formulation: premise = chunk, hypothesis = “This passage answers: {user question}.” Reject chunks below an entailment threshold; trigger re-query when fewer than k chunks pass. This is cheaper than generating a draft answer to grade retrieval quality.

2. Answer faithfulness (post-generation)

After the LLM produces an answer, split it into atomic claims (one policy fact per sentence). For each claim, score entailment against every retrieved chunk; take the max entailment score. Claims below threshold are stripped, rewritten with “I don’t have documentation for…”, or block the entire response. Contradiction scores above threshold trigger hard failure — the model asserted something sources explicitly deny.

3. Citation alignment

When answers include inline citations, NLI verifies that cited chunk entails the sentence it supports — not merely that the chunk was retrieved. Mismatched citations are a common hallucination pattern in legal and medical RAG.

4. Summarization and compression guards

Map-reduce and chain-of-density summarization can drop negation or invert quantities. Run NLI between each summary sentence and the source span it was derived from before merging into the context window.

Claim decomposition and scoring

Raw answers are rarely one sentence. A practical pipeline:

Segment — split on sentence boundaries; optionally use a lightweight LLM to extract “checkable claims” (dates, amounts, permissions).
Pair — for each claim, form (premise, hypothesis) pairs with top-n retrieved chunks (usually 3–8).
Score — batch through NLI cross-encoder; record entailment, contradiction, neutral logits.
Aggregate — per claim: max(entailment) across chunks; flag if max < τ_e or any contradiction > τ_c.
Act — regenerate with stricter prompt, escalate to human, or return partial answer with unsupported sentences removed.

Thresholds are domain-specific. Harbor Support tuned τ_e = 0.72 on DeBERTa-v3-base-MNLI after calibrating on 200 labeled ticket pairs. Legal and medical deployments often require τ_e > 0.85 and zero tolerated contradictions on numeric fields.

Batching and latency

A 4-sentence answer against 5 chunks = 20 NLI pairs. Batched on GPU, that is ~80–150 ms; on CPU with ONNX INT8, ~200–400 ms. Still an order of magnitude faster than a frontier judge call. Cache NLI scores for repeated (chunk, claim) pairs within a session.

Harbor Support answer-guard refactor

Before refactor, Harbor Support’s RAG stack used cross-encoder reranking for retrieval only. Faithfulness relied on prompt instructions (“only use provided context”) plus weekly human audits. Failure modes:

Conflation hallucinations — merging upgrade and downgrade policies from adjacent chunks.
Negation drops — “not eligible” became “eligible” in short answers.
Citation theater — footnotes pointed to chunks that did not support the sentence.

Refactor steps:

Deployed cross-encoder/nli-deberta-v3-base on a shared CPU inference pod (ONNX, batch size 32).
Added claim splitter: rule-based for short answers; LLM extract for answers >120 tokens.
Blocked any response with a contradiction score > 0.55 against any retrieved chunk.
Stripped sentences with max entailment < 0.72; if >50% of sentences stripped, route to human agent instead of partial answer.
Logged (claim, chunk, scores) for weekly threshold review.

Outcomes: unsupported-answer rate 19% → 6% on golden set; p95 latency +110 ms vs +1,800 ms for GPT-4 judge; faithfulness-related escalations down 34%. Retrieval recall unchanged — NLI does not fix bad search, only catches overconfident generation.

Technique decision table

Scenario	Prefer	Avoid
Per-claim source support check	NLI entailment vs retrieved chunks	Embedding similarity alone
Retrieval relevance ranking	Cross-encoder reranker (ms-marco)	NLI with synthetic hypothesis per query
High-stakes numeric/compliance answers	NLI + regex validators + human escalation	Single entailment threshold without contradiction check
Nuanced tone or multi-hop synthesis	LLM-as-judge on sampled traffic	Expecting NLI to grade rhetorical quality
Latency-sensitive chat (<500 ms budget)	Small NLI on CPU, claim cap at 4 sentences	Frontier judge on every turn
Contradiction detection between sources	Pairwise NLI across chunks + UI disclosure	Blindly merging conflicting policies
Multilingual RAG	Multilingual NLI (XNLI-finetuned) or translate-then-NLI	English-only MNLI on non-English pairs without eval

NLI complements uncertainty estimation: low entailment is a hard gate; calibrated LLM logprobs handle gray-area phrasing NLI was not trained on.

Common pitfalls

Treating neutral as pass — neutral means unsupported; only entailment should approve a claim.
One chunk per claim — multi-hop answers may require conjunctive support across chunks; max-over-chunks misses split evidence unless you add merge steps.
NLI on the user question — premise must be source text, not the query; query-chunk NLI confuses relevance with faithfulness.
Ignoring contradiction — high entailment on chunk A while chunk B contradicts the claim means conflicting corpus, not a safe answer.
Over-segmentation — splitting “Refunds are available within 30 days of purchase” into two claims loses joint entailment.
Domain drift — MNLI models weak on SLA tables, JSON logs, and code; fine-tune or add structured validators.
Replacing retrieval quality — NLI catches bad answers; it does not retrieve missing documents. Pair with re-query loops.
Threshold copy-paste — 0.72 worked for Harbor Support; calibrate on your labeled set or you will over-block or under-guard.

Production checklist

Label 150+ (source chunk, claim, supported/unsupported/contradicted) examples for threshold tuning.
Deploy NLI as a batched microservice with ONNX or TensorRT; avoid per-pair HTTP to Hugging Face Inference API in hot path.
Implement claim segmentation with tests on negation, dates, and currency.
Log entailment and contradiction scores per claim for drift monitoring.
Block on contradiction before checking entailment for compliance tiers.
Define partial-answer policy: strip vs block vs escalate when claims fail.
Run NLI on citation pairs in CI when prompt templates change.
Sample 5% of traffic to LLM-as-judge to detect NLI blind spots.
Evaluate multilingual pairs if corpus is not English-only.
Document false positive/negative review process for support agents.

Key takeaways

NLI classifies entailment, contradiction, and neutral between a source premise and a generated hypothesis.
It fits pre-generation retrieval grading and post-generation claim verification — different jobs than reranking or embedding search.
Claim decomposition plus max entailment over chunks is the standard faithfulness pattern; contradiction scores need explicit gates.
Harbor Support cut unsupported answers from 19% to 6% with ~110 ms latency vs 1.8 s for a frontier judge.
Calibrate thresholds on domain data; MNLI alone is not enough for tables, code, or multilingual corpora without extra validators.