Guide

LLM chain-of-verification explained

Harbor Support's policy Q&A bot answered “Can I return a digital download after 14 days?” with a confident yes — citing a 30-day window that never existed in the knowledge base. The model had blended a physical-goods rule with a promo FAQ snippet. A single chain-of-thought pass made the error sound reasoned. Adding chain-of-verification (CoVe) (Dhuliawala et al., 2023) cut policy hallucination rate on a 400-question audit set from 18% to 6% at roughly 3.2× token cost: draft an answer, plan verification questions, answer those questions without the draft visible, then revise. CoVe targets the failure mode where fluent reasoning locks in a wrong fact before anyone checks it.

Unlike retrieval-only fixes, CoVe uses the model's own uncertainty structure: verification questions decompose claims into checkable atoms. Unlike self-consistency, it does not assume diverse samples will vote out a shared wrong premise — it explicitly re-queries facts in isolation. This guide walks through the CoVe loop, variants (joint vs factored verification), how it pairs with RAG, the Harbor Support refactor, a technique decision table, pitfalls, and a production checklist.

What chain-of-verification is

CoVe is a four-stage inference pattern for reducing factual errors in free-form generation:

Baseline response — produce an initial answer to the user query (with or without CoT).
Plan verification questions — given the baseline, list questions whose answers would confirm or refute specific claims in the draft (dates, amounts, eligibility rules, names, counts).
Execute verification — answer each verification question independently, without the baseline response in context. This breaks confirmation bias: the model cannot rationalize the original mistake while “checking.”
Final verified response — revise the baseline using verification results; drop or correct contradicted claims.

The key insight is conditioning independence. If the model sees its draft while answering “What is the digital refund window?” it tends to defend the draft. Hiding the draft forces a fresh lookup from parametric memory (or retrieved context, if wired in). CoVe does not guarantee truth — verification answers can still hallucinate — but it separates generation from critique in a way that catches many entangled errors.

CoVe variants: joint, two-step, and factored

Joint verification

One prompt lists all verification questions and answers them in a single completion. Cheapest latency; risk that later answers leak bias from earlier ones in the same context window.

Two-step verification

First call plans questions; second call answers all verification questions without the baseline. Third call revises. Clear separation; moderate cost.

Factored verification (recommended for production)

Each verification question gets its own API call (or batched calls with isolated prompts). Highest cost, strongest independence. Parallelize with async workers; typical policies need 3–8 verification questions per user query.

Variant	Independence	Latency	Token cost
Joint	Low	Lowest	~2× baseline
Two-step	Medium	Medium	~2.5×
Factored	High	Higher (parallelizable)	~3–5×

Designing good verification questions

Question quality drives CoVe gains. Weak plans ask vague questions (“Is this policy fair?”); strong plans target atomic facts:

Entity questions — “What is the official name of plan tier X?”
Numeric questions — “How many days after purchase are digital goods refundable?”
Boolean eligibility — “Does promo code HARBOR20 extend the refund window?”
Temporal questions — “When did policy v3.2 take effect?”
Scope questions — “Does the EU consumer directive override the US terms for this SKU?”

Prompt the planner with examples from your domain. Cap question count (e.g. 5–7) to control cost. Reject meta-questions that merely restate the user query. Use structured outputs (JSON array of {"claim", "question"} pairs) so downstream code can fan out factored calls reliably.

CoVe vs RAG, self-consistency, and external fact-checkers

Approach	Mechanism	Best when	Weak when
RAG alone	Retrieve chunks, generate grounded answer	Corpus is complete and chunks match query	Model ignores or misquotes retrieved text
CoVe	Self-posed verification Q&A, then revise	Atomic facts checkable from memory or docs	Verification answers share same blind spots
Self-consistency	Sample N answers, majority vote	Discrete labels, reasoning diversity helps	Shared wrong premise across samples
LLM-as-judge	Separate model scores faithfulness	Need continuous quality scores	Judge agrees with generator's errors
Tool / DB lookup	Deterministic API for each fact	Structured policy DB, pricing APIs	Unstructured prose policies

Production stacks combine layers: RAG supplies evidence, CoVe catches misreads and over-generalizations, deterministic tools arbitrate numbers and dates. CoVe is not a replacement for citation requirements — attach source chunk IDs in the final response when compliance demands audit trails.

Harbor Support policy Q&A refactor

The refactor kept the same vector index over 1,200 policy paragraphs but changed the answer path:

Retrieve — top-6 chunks via hybrid search (unchanged).
Baseline — CoT answer citing chunk IDs in JSON schema with claims[] array.
Plan — one call outputs 3–6 verification questions mapped to specific claims.
Verify — factored calls with retrieval only (no baseline text); each answer includes supporting chunk ID or unknown.
Revise — final user-facing prose; claims without verification support are omitted or softened (“contact support”).
Router — FAQ intents with exact doc matches skip CoVe (single-call fast path).

Human review of 400 held-out tickets: hallucination rate 18% → 6%; p95 latency 2.4 s → 6.1 s on the CoVe tier; customer CSAT on policy answers +0.4 points. Escalations to human agents fell 12% because fewer answers contradicted the written policy.

CoVe with retrieval: wiring verification to evidence

Pure parametric CoVe still hallucinates if the model never saw the policy. Harbor's factored verification prompts include the same retrieved chunks as the baseline step, but not the baseline answer:

System: Answer using only the provided policy excerpts.
If the excerpts do not contain the answer, respond UNKNOWN.

Policy excerpts:
[chunk 42] Digital goods: non-refundable after download...
[chunk 87] Physical goods: 30-day return window...

Question: How many days after purchase can a digital download be refunded?

Mismatches between verification answers and baseline claims trigger automatic revision. Track hallucination proxies: rate of UNKNOWN verification responses, claim drop rate in revision, and user thumbs-down on policy topics.

Technique decision table

Approach	Best when	Skip when
Single-shot RAG	High volume, simple FAQ, strong retrieval	High-stakes compliance, frequent misquotes
CoVe (factored)	Multi-claim answers, policy/legal/medical prose	Creative writing, opinion, low factual stakes
Self-consistency	Math, classification, discrete choices	Long narrative with entangled factual threads
CoVe + tool calls	Dates, prices, inventory in structured APIs	No machine-readable source of truth
Human-in-the-loop	Irreversible actions (refunds, medical)	Latency-sensitive chat for low-risk topics

Common pitfalls

Verification sees the baseline — negates the whole method; enforce prompt templates that strip the draft from verify calls.
Vague verification questions — “Is this correct?” produces yes-man answers; require atomic, answerable questions.
Too many questions — latency and cost explode; cap and prioritize high-risk claims (money, health, legal).
No UNKNOWN escape hatch — models guess when evidence is missing; train prompts to admit ignorance.
Ignoring contradictions — revision step must drop or rewrite claims when verify conflicts with baseline.
CoVe on every query — route chit-chat and creative tasks to fast paths; reserve CoVe for factual tiers.
Same retrieval for all steps — verification may need different chunks; re-retrieve per question when queries are specific.
No regression suite — policy updates silently break verification templates; maintain labeled claim-level tests.

Production checklist

Define which intents require CoVe vs single-shot (risk tiering).
Implement factored verification with baseline hidden from verify prompts.
Use structured JSON for claim extraction and question planning.
Wire verification to retrieval or tools, not parametric memory alone.
Cap verification questions per query; prioritize high-impact claims.
Implement revision rules: drop, soften, or escalate on contradiction.
Log baseline, questions, verify answers, and final text for audit.
Benchmark hallucination rate and latency on a labeled holdout set.
Parallelize factored verify calls; monitor p95 latency and token spend.
Re-run evals after every policy corpus or prompt template change.

Key takeaways

Chain-of-verification drafts an answer, plans atomic verification questions, answers them without the draft visible, then revises — breaking confirmation bias that fuels confident hallucinations.
Factored verification (one call per question) costs more but delivers the independence that makes CoVe work; joint variants are faster but weaker.
Harbor Support cut policy hallucinations from 18% to 6% by pairing RAG with CoVe and a fast-path router for exact FAQ matches.
CoVe complements RAG and self-consistency; it does not replace citations, structured tools, or human review for high-stakes decisions.
Question quality and an UNKNOWN escape hatch matter as much as the four-step loop — vague checks produce vague fixes.