Guide
LLM NLI faithfulness for RAG explained
Harbor Support’s billing RAG bot answered “Can I downgrade mid-cycle and keep my annual discount?” with a confident yes — citing a chunk about upgrades that never mentioned proration. Retrieval had returned plausible text; the generator stitched a policy that did not exist. Post-hoc review flagged 19% of production answers as unsupported on a 400-ticket golden set. Routing every response through GPT-4-as-judge cut that to 9% but added 1.8 seconds and $0.04 per ticket. After inserting a natural language inference (NLI) entailment pass that scores each atomic claim against retrieved sources, unsupported answers dropped to 6% at 120 ms and $0.0003 per check.
NLI models classify whether a hypothesis (a generated sentence or claim) is entailed by, contradicted by, or neutral relative to a premise (a source chunk). That three-way judgment is cheaper and more consistent than asking a frontier LLM “is this faithful?” on every turn, and it plugs directly into retrieval grading, chain-of-verification loops, and citation enforcement. This guide covers NLI fundamentals, where it fits in RAG pipelines, claim decomposition, the Harbor Support refactor, a technique decision table versus cross-encoder reranking and LLM judges, pitfalls, and a production checklist — complementing hallucination taxonomy and two-stage reranking.
What natural language inference is
Natural language inference (NLI) is the task of determining whether one text logically follows from another. Given a premise P and hypothesis H, a classifier outputs one of three labels:
- Entailment — if P is true, H must be true (“Refunds within 30 days” entails “You can get a refund within one month”).
- Contradiction — P and H cannot both be true (“No mid-cycle downgrades” contradicts “Downgrade anytime”).
- Neutral — P neither proves nor disproves H (upgrade policy is silent on downgrade discounts).
Production systems typically use fine-tuned cross-encoders (DeBERTa on MNLI, SNLI, or domain-adapted corpora) or small encoder models exported to ONNX. Unlike bi-encoder similarity, NLI reads premise and hypothesis jointly — it catches paraphrase entailment and explicit contradiction that cosine distance misses.
| Signal | What it measures | Typical model |
|---|---|---|
| Embedding cosine similarity | Semantic relatedness | Bi-encoder (e5, BGE) |
| Cross-encoder relevance | Query–passage match quality | ms-marco MiniLM |
| NLI entailment score | Does source support this claim? | DeBERTa-MNLI, TRUE benchmark models |
| LLM-as-judge | Holistic faithfulness rubric | Frontier model + structured prompt |
Where NLI fits in RAG pipelines
1. Retrieval grading (pre-generation)
In Self-RAG and Corrective RAG, a retrieval grader decides whether retrieved chunks are relevant enough to answer the query. NLI formulation: premise = chunk, hypothesis = “This passage answers: {user question}.” Reject chunks below an entailment threshold; trigger re-query when fewer than k chunks pass. This is cheaper than generating a draft answer to grade retrieval quality.
2. Answer faithfulness (post-generation)
After the LLM produces an answer, split it into atomic claims (one policy fact per sentence). For each claim, score entailment against every retrieved chunk; take the max entailment score. Claims below threshold are stripped, rewritten with “I don’t have documentation for…”, or block the entire response. Contradiction scores above threshold trigger hard failure — the model asserted something sources explicitly deny.
3. Citation alignment
When answers include inline citations, NLI verifies that cited chunk entails the sentence it supports — not merely that the chunk was retrieved. Mismatched citations are a common hallucination pattern in legal and medical RAG.
4. Summarization and compression guards
Map-reduce and chain-of-density summarization can drop negation or invert quantities. Run NLI between each summary sentence and the source span it was derived from before merging into the context window.
Claim decomposition and scoring
Raw answers are rarely one sentence. A practical pipeline:
- Segment — split on sentence boundaries; optionally use a lightweight LLM to extract “checkable claims” (dates, amounts, permissions).
- Pair — for each claim, form (premise, hypothesis) pairs with top-n retrieved chunks (usually 3–8).
- Score — batch through NLI cross-encoder; record entailment, contradiction, neutral logits.
- Aggregate — per claim:
max(entailment)across chunks; flag if max < τe or any contradiction > τc. - Act — regenerate with stricter prompt, escalate to human, or return partial answer with unsupported sentences removed.
Thresholds are domain-specific. Harbor Support tuned τe = 0.72 on DeBERTa-v3-base-MNLI after calibrating on 200 labeled ticket pairs. Legal and medical deployments often require τe > 0.85 and zero tolerated contradictions on numeric fields.
Batching and latency
A 4-sentence answer against 5 chunks = 20 NLI pairs. Batched on GPU, that is ~80–150 ms; on CPU with ONNX INT8, ~200–400 ms. Still an order of magnitude faster than a frontier judge call. Cache NLI scores for repeated (chunk, claim) pairs within a session.
Harbor Support answer-guard refactor
Before refactor, Harbor Support’s RAG stack used cross-encoder reranking for retrieval only. Faithfulness relied on prompt instructions (“only use provided context”) plus weekly human audits. Failure modes:
- Conflation hallucinations — merging upgrade and downgrade policies from adjacent chunks.
- Negation drops — “not eligible” became “eligible” in short answers.
- Citation theater — footnotes pointed to chunks that did not support the sentence.
Refactor steps:
- Deployed
cross-encoder/nli-deberta-v3-baseon a shared CPU inference pod (ONNX, batch size 32). - Added claim splitter: rule-based for short answers; LLM extract for answers >120 tokens.
- Blocked any response with a contradiction score > 0.55 against any retrieved chunk.
- Stripped sentences with max entailment < 0.72; if >50% of sentences stripped, route to human agent instead of partial answer.
- Logged (claim, chunk, scores) for weekly threshold review.
Outcomes: unsupported-answer rate 19% → 6% on golden set; p95 latency +110 ms vs +1,800 ms for GPT-4 judge; faithfulness-related escalations down 34%. Retrieval recall unchanged — NLI does not fix bad search, only catches overconfident generation.
Technique decision table
| Scenario | Prefer | Avoid |
|---|---|---|
| Per-claim source support check | NLI entailment vs retrieved chunks | Embedding similarity alone |
| Retrieval relevance ranking | Cross-encoder reranker (ms-marco) | NLI with synthetic hypothesis per query |
| High-stakes numeric/compliance answers | NLI + regex validators + human escalation | Single entailment threshold without contradiction check |
| Nuanced tone or multi-hop synthesis | LLM-as-judge on sampled traffic | Expecting NLI to grade rhetorical quality |
| Latency-sensitive chat (<500 ms budget) | Small NLI on CPU, claim cap at 4 sentences | Frontier judge on every turn |
| Contradiction detection between sources | Pairwise NLI across chunks + UI disclosure | Blindly merging conflicting policies |
| Multilingual RAG | Multilingual NLI (XNLI-finetuned) or translate-then-NLI | English-only MNLI on non-English pairs without eval |
NLI complements uncertainty estimation: low entailment is a hard gate; calibrated LLM logprobs handle gray-area phrasing NLI was not trained on.
Common pitfalls
- Treating neutral as pass — neutral means unsupported; only entailment should approve a claim.
- One chunk per claim — multi-hop answers may require conjunctive support across chunks; max-over-chunks misses split evidence unless you add merge steps.
- NLI on the user question — premise must be source text, not the query; query-chunk NLI confuses relevance with faithfulness.
- Ignoring contradiction — high entailment on chunk A while chunk B contradicts the claim means conflicting corpus, not a safe answer.
- Over-segmentation — splitting “Refunds are available within 30 days of purchase” into two claims loses joint entailment.
- Domain drift — MNLI models weak on SLA tables, JSON logs, and code; fine-tune or add structured validators.
- Replacing retrieval quality — NLI catches bad answers; it does not retrieve missing documents. Pair with re-query loops.
- Threshold copy-paste — 0.72 worked for Harbor Support; calibrate on your labeled set or you will over-block or under-guard.
Production checklist
- Label 150+ (source chunk, claim, supported/unsupported/contradicted) examples for threshold tuning.
- Deploy NLI as a batched microservice with ONNX or TensorRT; avoid per-pair HTTP to Hugging Face Inference API in hot path.
- Implement claim segmentation with tests on negation, dates, and currency.
- Log entailment and contradiction scores per claim for drift monitoring.
- Block on contradiction before checking entailment for compliance tiers.
- Define partial-answer policy: strip vs block vs escalate when claims fail.
- Run NLI on citation pairs in CI when prompt templates change.
- Sample 5% of traffic to LLM-as-judge to detect NLI blind spots.
- Evaluate multilingual pairs if corpus is not English-only.
- Document false positive/negative review process for support agents.
Key takeaways
- NLI classifies entailment, contradiction, and neutral between a source premise and a generated hypothesis.
- It fits pre-generation retrieval grading and post-generation claim verification — different jobs than reranking or embedding search.
- Claim decomposition plus max entailment over chunks is the standard faithfulness pattern; contradiction scores need explicit gates.
- Harbor Support cut unsupported answers from 19% to 6% with ~110 ms latency vs 1.8 s for a frontier judge.
- Calibrate thresholds on domain data; MNLI alone is not enough for tables, code, or multilingual corpora without extra validators.
Related reading
- Agentic RAG explained — retrieval grading and re-query loops where NLI graders plug in
- LLM hallucinations explained — taxonomy of unsupported generation and mitigation layers
- LLM chain-of-verification explained — planned fact-check questions before final answers
- RAG citation and source attribution explained — tying answers to verifiable spans