Guide
LLM self-consistency reasoning explained
Harbor Analytics' overnight fraud-scoring pipeline classified merchant chargebacks
with a single greedy
chain-of-thought
pass: “list risk signals, weigh each, output approve or review.” On borderline
cases — mixed velocity patterns, partial address matches, promo abuse that
looked like friendly fraud — one wrong intermediate leap produced confident
wrong labels. Switching to self-consistency (sample
N=7 independent CoT completions at temperature 0.7, majority-vote the
discrete outcome) lifted macro-F1 from 0.74 to 0.81 on a held-out dispute set at
6.8× token cost. Self-consistency is the simplest form of
test-time compute:
run the same prompt many times with stochastic decoding, then aggregate answers
instead of trusting the first trace.
Wang et al. (2022) showed the pattern on grade-school math and commonsense benchmarks: diverse reasoning paths often converge on the correct final answer even when any single path errs. Unlike Tree of Thoughts, there is no explicit search tree or backtracking — only parallel full completions and a vote. This guide covers the self-consistency loop, aggregation variants, how it differs from verifiers and process reward models, the Harbor Analytics refactor, a technique decision table, pitfalls, and a production checklist.
What self-consistency is
Autoregressive LLMs are stochastic when temperature or top-p sampling
is enabled. A math word problem might be solved correctly in five out of eight
sampled chains; greedy decoding (temperature 0) locks in whichever single path
the model prefers. Self-consistency exploits that diversity:
- Prompt for reasoning — use CoT (“think step by step”) or a domain template so each sample exposes intermediate steps, not only a final token.
- Sample
Nindependent completions with temperature > 0 (typicalN= 5–40 depending on task difficulty and budget). - Extract the final answer from each trace — parsed number, JSON enum, multiple-choice letter, or normalized string.
- Aggregate via majority vote, weighted vote, or confidence from logprobs.
- Return the winning answer; optionally attach the highest-scored supporting trace for audit.
The method assumes answer-level diversity: different reasoning paths should sometimes reach different conclusions, and the correct answer should appear more often than any single wrong one. It does not fix shared systematic bias (every sample misreads the same ambiguous clause) or tasks where intermediate steps are not verbalized.
Self-consistency vs single CoT and tree search
| Technique | Paths explored | Backtracking | Parallelism | Typical cost |
|---|---|---|---|---|
| Single CoT (greedy) | 1 | No | Trivial | 1× |
| Self-consistency | N full chains |
No | Excellent (batched) | N× |
| Tree of Thoughts | Branching partial states | Yes | Moderate | Variable, often higher |
| Best-of-N + verifier | N chains, rerank by score |
No | Good | N× + judge |
Self-consistency is easier to ship than ToT: no thought granularity design, no state evaluator between steps, no BFS/DFS controller. It parallelizes cleanly on vLLM and API batch endpoints. The tradeoff is no recovery from a wrong first step that every sample shares — if the prompt anchors a false premise, voting amplifies the error. ToT prunes bad branches early; self- consistency pays for full wrong completions.
Aggregation strategies
Majority vote (plurality)
Count extracted final answers; return the mode. Ties break by highest mean logprob, shortest trace, or a secondary LLM judge. Works for classification labels, multiple choice, and numeric answers after normalization (strip commas, round to agreed precision).
Weighted vote
Weight each sample by average token logprob, self-reported confidence (“confidence: 0.85” in structured output), or a lightweight verifier score. Reduces influence of fluent but wrong chains. Watch for miscalibrated confidence — models often sound certain when incorrect.
Universal self-consistency (USC)
For free-form answers without a discrete label, prompt a separate aggregation pass:
“Given these N candidate answers, synthesize the best
response.” Useful for summarization and open QA; adds latency and another
failure surface.
Choosing N and temperature
Start with N=5, temperature 0.5–0.8. Plot accuracy vs
N on a dev set; diminishing returns often appear after
N=10–20. Too-low temperature collapses diversity (all samples
identical); too-high temperature produces incoherent traces that parse poorly.
Match
sampling
settings to the task — math benefits from moderate diversity; JSON
extraction may need lower temperature with format checks.
Answer extraction and normalization
Voting fails silently when parsers disagree. Standardize extraction:
- Structured output — require a final line
ANSWER: <value>or JSON field{"decision": "review"}so regex parsing is reliable. - Numeric tasks — parse with SymPy or a strict float regex; treat 42 and 42.0 as equivalent.
- Equivalence classes — map “approve”, “APPROVE”, and “pass” to one bucket via a synonym table.
- Invalid samples — discard chains that fail format validation rather than voting “null”; track discard rate as a quality signal.
Log all raw traces for dispute review. Harbor Analytics stores hashes of each sample plus the voted label for regulatory audit.
Harbor Analytics fraud-scoring refactor
The refactor kept the same CoT rubric (velocity, device fingerprint, history, promo flags) but changed inference:
- Samples —
N=7at temperature 0.65, max 800 tokens per chain. - Extraction — JSON schema with
decisionenumapprove | review | denyandtop_signalsarray; invalid JSON discarded. - Vote — plurality on
decision; ties favorreview(asymmetric error cost). - Routing — unanimous
approveskips human queue; split votes always escalate. - Budget — complexity router sends obvious low-risk merchants to single-sample fast path (<300 ms).
p95 latency rose from 1.1 s to 4.6 s on the review tier, but false denials (merchant churn cost) fell 29%. Token spend per scored transaction went from ~900 to ~6.2k on the hard slice only — roughly 18% of volume.
Technique decision table
| Approach | Best when | Weak when |
|---|---|---|
| Single greedy CoT | High volume, tight latency, easy labels | Borderline reasoning, high asymmetric error cost |
| Self-consistency | Discrete answers, parallel inference, moderate budget | Shared wrong first step; free-form synthesis tasks |
| Best-of-N + ORM/PRM | Need ranked traces, continuous quality scores | Extra judge latency and training data |
| Tree of Thoughts | Branchy constraints, early mistakes costly | Simple classification; no clear intermediate states |
| RL reasoning model (o1-class) | Math/code at scale with vendor SLA | Opaque traces; fine-grained business rules |
Common pitfalls
- Temperature 0 with self-consistency — all samples
identical; you pay
N× cost for zero benefit. - Fragile answer parsing — voting on unnormalized strings splits one correct answer across buckets.
- Ignoring invalid chains — counting malformed outputs skews the vote toward the most common parse error.
- Systematic prompt bias — a misleading few-shot example poisons every sample; fix the prompt before sampling more.
- Wrong tie-break policy — defaulting to the aggressive class (deny) vs conservative (review) changes business metrics sharply.
- No cost router — running
N=20on every query burns budget; route easy cases to single-sample paths. - Conflating fluency with correctness — weighted voting on raw logprobs favors verbose wrong answers on some models.
- Missing regression tests — changing temperature or
Nwithout re-benchmarking silently drifts accuracy.
Production checklist
- Define extractable answer format (enum, number, JSON field) before sampling.
- Benchmark accuracy vs
Nand temperature on a labeled holdout set. - Implement normalization and invalid-sample discard with logged discard rates.
- Choose plurality vs weighted vote and document tie-break rules.
- Batch parallel samples on inference servers for throughput.
- Add complexity router: easy queries use single-sample fast path.
- Store all sample traces (or hashes) for audit and debugging.
- Monitor token cost per decision, p95 latency, and vote entropy (all-agree vs split).
- Alert when unanimous wrong votes spike — signals prompt or data drift.
- Compare against ToT and verifier baselines before committing to production
N.
Key takeaways
- Self-consistency samples multiple independent chain-of-thought completions and majority-votes the final answer — the simplest parallel test-time compute boost for discrete reasoning tasks.
- It is easier to deploy than Tree of Thoughts (no branch evaluators or search controllers) and parallelizes well, but cannot recover when every sample shares the same wrong first step.
- Harbor Analytics raised fraud-scoring macro-F1 from 0.74 to 0.81 with N=7 voting at 6.8× token cost on the hard slice, plus a router for fast single-sample paths on easy merchants.
- Reliable answer extraction and normalization matter as much as voting — fragile parsers silently destroy the gains.
- Pair self-consistency with routing and budget caps; running many samples on every query is wasteful when a single CoT suffices.
Related reading
- LLM chain-of-thought explained — the reasoning format self-consistency samples
- LLM test-time compute explained — inference scaling, verifiers, and reasoning models
- Tree of Thoughts reasoning explained — branching search when voting is not enough
- LLM sampling and decoding strategies explained — temperature, top-p, and diversity control