Guide
LLM chain-of-verification explained
Harbor Support's policy Q&A bot answered “Can I return a digital download after 14 days?” with a confident yes — citing a 30-day window that never existed in the knowledge base. The model had blended a physical-goods rule with a promo FAQ snippet. A single chain-of-thought pass made the error sound reasoned. Adding chain-of-verification (CoVe) (Dhuliawala et al., 2023) cut policy hallucination rate on a 400-question audit set from 18% to 6% at roughly 3.2× token cost: draft an answer, plan verification questions, answer those questions without the draft visible, then revise. CoVe targets the failure mode where fluent reasoning locks in a wrong fact before anyone checks it.
Unlike retrieval-only fixes, CoVe uses the model's own uncertainty structure: verification questions decompose claims into checkable atoms. Unlike self-consistency, it does not assume diverse samples will vote out a shared wrong premise — it explicitly re-queries facts in isolation. This guide walks through the CoVe loop, variants (joint vs factored verification), how it pairs with RAG, the Harbor Support refactor, a technique decision table, pitfalls, and a production checklist.
What chain-of-verification is
CoVe is a four-stage inference pattern for reducing factual errors in free-form generation:
- Baseline response — produce an initial answer to the user query (with or without CoT).
- Plan verification questions — given the baseline, list questions whose answers would confirm or refute specific claims in the draft (dates, amounts, eligibility rules, names, counts).
- Execute verification — answer each verification question independently, without the baseline response in context. This breaks confirmation bias: the model cannot rationalize the original mistake while “checking.”
- Final verified response — revise the baseline using verification results; drop or correct contradicted claims.
The key insight is conditioning independence. If the model sees its draft while answering “What is the digital refund window?” it tends to defend the draft. Hiding the draft forces a fresh lookup from parametric memory (or retrieved context, if wired in). CoVe does not guarantee truth — verification answers can still hallucinate — but it separates generation from critique in a way that catches many entangled errors.
CoVe variants: joint, two-step, and factored
Joint verification
One prompt lists all verification questions and answers them in a single completion. Cheapest latency; risk that later answers leak bias from earlier ones in the same context window.
Two-step verification
First call plans questions; second call answers all verification questions without the baseline. Third call revises. Clear separation; moderate cost.
Factored verification (recommended for production)
Each verification question gets its own API call (or batched calls with isolated prompts). Highest cost, strongest independence. Parallelize with async workers; typical policies need 3–8 verification questions per user query.
| Variant | Independence | Latency | Token cost |
|---|---|---|---|
| Joint | Low | Lowest | ~2× baseline |
| Two-step | Medium | Medium | ~2.5× |
| Factored | High | Higher (parallelizable) | ~3–5× |
Designing good verification questions
Question quality drives CoVe gains. Weak plans ask vague questions (“Is this policy fair?”); strong plans target atomic facts:
- Entity questions — “What is the official name of plan tier X?”
- Numeric questions — “How many days after purchase are digital goods refundable?”
- Boolean eligibility — “Does promo code HARBOR20 extend the refund window?”
- Temporal questions — “When did policy v3.2 take effect?”
- Scope questions — “Does the EU consumer directive override the US terms for this SKU?”
Prompt the planner with examples from your domain. Cap question count (e.g.
5–7) to control cost. Reject meta-questions that merely restate the user
query. Use
structured
outputs (JSON array of {"claim", "question"} pairs) so
downstream code can fan out factored calls reliably.
CoVe vs RAG, self-consistency, and external fact-checkers
| Approach | Mechanism | Best when | Weak when |
|---|---|---|---|
| RAG alone | Retrieve chunks, generate grounded answer | Corpus is complete and chunks match query | Model ignores or misquotes retrieved text |
| CoVe | Self-posed verification Q&A, then revise | Atomic facts checkable from memory or docs | Verification answers share same blind spots |
| Self-consistency | Sample N answers, majority vote | Discrete labels, reasoning diversity helps | Shared wrong premise across samples |
| LLM-as-judge | Separate model scores faithfulness | Need continuous quality scores | Judge agrees with generator's errors |
| Tool / DB lookup | Deterministic API for each fact | Structured policy DB, pricing APIs | Unstructured prose policies |
Production stacks combine layers: RAG supplies evidence, CoVe catches misreads and over-generalizations, deterministic tools arbitrate numbers and dates. CoVe is not a replacement for citation requirements — attach source chunk IDs in the final response when compliance demands audit trails.
Harbor Support policy Q&A refactor
The refactor kept the same vector index over 1,200 policy paragraphs but changed the answer path:
- Retrieve — top-6 chunks via hybrid search (unchanged).
- Baseline — CoT answer citing chunk IDs in
JSON
schema with
claims[]array. - Plan — one call outputs 3–6 verification questions mapped to specific claims.
- Verify — factored calls with retrieval only (no
baseline text); each answer includes supporting chunk ID or
unknown. - Revise — final user-facing prose; claims without verification support are omitted or softened (“contact support”).
- Router — FAQ intents with exact doc matches skip CoVe (single-call fast path).
Human review of 400 held-out tickets: hallucination rate 18% → 6%; p95 latency 2.4 s → 6.1 s on the CoVe tier; customer CSAT on policy answers +0.4 points. Escalations to human agents fell 12% because fewer answers contradicted the written policy.
CoVe with retrieval: wiring verification to evidence
Pure parametric CoVe still hallucinates if the model never saw the policy. Harbor's factored verification prompts include the same retrieved chunks as the baseline step, but not the baseline answer:
System: Answer using only the provided policy excerpts.
If the excerpts do not contain the answer, respond UNKNOWN.
Policy excerpts:
[chunk 42] Digital goods: non-refundable after download...
[chunk 87] Physical goods: 30-day return window...
Question: How many days after purchase can a digital download be refunded?
Mismatches between verification answers and baseline claims trigger automatic
revision. Track
hallucination
proxies: rate of UNKNOWN verification responses, claim
drop rate in revision, and user thumbs-down on policy topics.
Technique decision table
| Approach | Best when | Skip when |
|---|---|---|
| Single-shot RAG | High volume, simple FAQ, strong retrieval | High-stakes compliance, frequent misquotes |
| CoVe (factored) | Multi-claim answers, policy/legal/medical prose | Creative writing, opinion, low factual stakes |
| Self-consistency | Math, classification, discrete choices | Long narrative with entangled factual threads |
| CoVe + tool calls | Dates, prices, inventory in structured APIs | No machine-readable source of truth |
| Human-in-the-loop | Irreversible actions (refunds, medical) | Latency-sensitive chat for low-risk topics |
Common pitfalls
- Verification sees the baseline — negates the whole method; enforce prompt templates that strip the draft from verify calls.
- Vague verification questions — “Is this correct?” produces yes-man answers; require atomic, answerable questions.
- Too many questions — latency and cost explode; cap and prioritize high-risk claims (money, health, legal).
- No UNKNOWN escape hatch — models guess when evidence is missing; train prompts to admit ignorance.
- Ignoring contradictions — revision step must drop or rewrite claims when verify conflicts with baseline.
- CoVe on every query — route chit-chat and creative tasks to fast paths; reserve CoVe for factual tiers.
- Same retrieval for all steps — verification may need different chunks; re-retrieve per question when queries are specific.
- No regression suite — policy updates silently break verification templates; maintain labeled claim-level tests.
Production checklist
- Define which intents require CoVe vs single-shot (risk tiering).
- Implement factored verification with baseline hidden from verify prompts.
- Use structured JSON for claim extraction and question planning.
- Wire verification to retrieval or tools, not parametric memory alone.
- Cap verification questions per query; prioritize high-impact claims.
- Implement revision rules: drop, soften, or escalate on contradiction.
- Log baseline, questions, verify answers, and final text for audit.
- Benchmark hallucination rate and latency on a labeled holdout set.
- Parallelize factored verify calls; monitor p95 latency and token spend.
- Re-run evals after every policy corpus or prompt template change.
Key takeaways
- Chain-of-verification drafts an answer, plans atomic verification questions, answers them without the draft visible, then revises — breaking confirmation bias that fuels confident hallucinations.
- Factored verification (one call per question) costs more but delivers the independence that makes CoVe work; joint variants are faster but weaker.
- Harbor Support cut policy hallucinations from 18% to 6% by pairing RAG with CoVe and a fast-path router for exact FAQ matches.
- CoVe complements RAG and self-consistency; it does not replace citations, structured tools, or human review for high-stakes decisions.
- Question quality and an UNKNOWN escape hatch matter as much as the four-step loop — vague checks produce vague fixes.
Related reading
- LLM hallucinations explained — why models invent facts and how to measure them
- RAG explained — grounding answers in retrieved evidence
- LLM self-consistency reasoning explained — voting across samples vs verification loops
- LLM-as-judge explained — separate evaluators for faithfulness scoring