Guide

LLM self-consistency reasoning explained

Harbor Analytics' overnight fraud-scoring pipeline classified merchant chargebacks with a single greedy chain-of-thought pass: “list risk signals, weigh each, output approve or review.” On borderline cases — mixed velocity patterns, partial address matches, promo abuse that looked like friendly fraud — one wrong intermediate leap produced confident wrong labels. Switching to self-consistency (sample N=7 independent CoT completions at temperature 0.7, majority-vote the discrete outcome) lifted macro-F1 from 0.74 to 0.81 on a held-out dispute set at 6.8× token cost. Self-consistency is the simplest form of test-time compute: run the same prompt many times with stochastic decoding, then aggregate answers instead of trusting the first trace.

Wang et al. (2022) showed the pattern on grade-school math and commonsense benchmarks: diverse reasoning paths often converge on the correct final answer even when any single path errs. Unlike Tree of Thoughts, there is no explicit search tree or backtracking — only parallel full completions and a vote. This guide covers the self-consistency loop, aggregation variants, how it differs from verifiers and process reward models, the Harbor Analytics refactor, a technique decision table, pitfalls, and a production checklist.

What self-consistency is

Autoregressive LLMs are stochastic when temperature or top-p sampling is enabled. A math word problem might be solved correctly in five out of eight sampled chains; greedy decoding (temperature 0) locks in whichever single path the model prefers. Self-consistency exploits that diversity:

  1. Prompt for reasoning — use CoT (“think step by step”) or a domain template so each sample exposes intermediate steps, not only a final token.
  2. Sample N independent completions with temperature > 0 (typical N = 5–40 depending on task difficulty and budget).
  3. Extract the final answer from each trace — parsed number, JSON enum, multiple-choice letter, or normalized string.
  4. Aggregate via majority vote, weighted vote, or confidence from logprobs.
  5. Return the winning answer; optionally attach the highest-scored supporting trace for audit.

The method assumes answer-level diversity: different reasoning paths should sometimes reach different conclusions, and the correct answer should appear more often than any single wrong one. It does not fix shared systematic bias (every sample misreads the same ambiguous clause) or tasks where intermediate steps are not verbalized.

Self-consistency vs single CoT and tree search

Technique Paths explored Backtracking Parallelism Typical cost
Single CoT (greedy) 1 No Trivial
Self-consistency N full chains No Excellent (batched) N×
Tree of Thoughts Branching partial states Yes Moderate Variable, often higher
Best-of-N + verifier N chains, rerank by score No Good N× + judge

Self-consistency is easier to ship than ToT: no thought granularity design, no state evaluator between steps, no BFS/DFS controller. It parallelizes cleanly on vLLM and API batch endpoints. The tradeoff is no recovery from a wrong first step that every sample shares — if the prompt anchors a false premise, voting amplifies the error. ToT prunes bad branches early; self- consistency pays for full wrong completions.

Aggregation strategies

Majority vote (plurality)

Count extracted final answers; return the mode. Ties break by highest mean logprob, shortest trace, or a secondary LLM judge. Works for classification labels, multiple choice, and numeric answers after normalization (strip commas, round to agreed precision).

Weighted vote

Weight each sample by average token logprob, self-reported confidence (“confidence: 0.85” in structured output), or a lightweight verifier score. Reduces influence of fluent but wrong chains. Watch for miscalibrated confidence — models often sound certain when incorrect.

Universal self-consistency (USC)

For free-form answers without a discrete label, prompt a separate aggregation pass: “Given these N candidate answers, synthesize the best response.” Useful for summarization and open QA; adds latency and another failure surface.

Choosing N and temperature

Start with N=5, temperature 0.5–0.8. Plot accuracy vs N on a dev set; diminishing returns often appear after N=10–20. Too-low temperature collapses diversity (all samples identical); too-high temperature produces incoherent traces that parse poorly. Match sampling settings to the task — math benefits from moderate diversity; JSON extraction may need lower temperature with format checks.

Answer extraction and normalization

Voting fails silently when parsers disagree. Standardize extraction:

  • Structured output — require a final line ANSWER: <value> or JSON field {"decision": "review"} so regex parsing is reliable.
  • Numeric tasks — parse with SymPy or a strict float regex; treat 42 and 42.0 as equivalent.
  • Equivalence classes — map “approve”, “APPROVE”, and “pass” to one bucket via a synonym table.
  • Invalid samples — discard chains that fail format validation rather than voting “null”; track discard rate as a quality signal.

Log all raw traces for dispute review. Harbor Analytics stores hashes of each sample plus the voted label for regulatory audit.

Harbor Analytics fraud-scoring refactor

The refactor kept the same CoT rubric (velocity, device fingerprint, history, promo flags) but changed inference:

  • SamplesN=7 at temperature 0.65, max 800 tokens per chain.
  • Extraction — JSON schema with decision enum approve | review | deny and top_signals array; invalid JSON discarded.
  • Vote — plurality on decision; ties favor review (asymmetric error cost).
  • Routing — unanimous approve skips human queue; split votes always escalate.
  • Budgetcomplexity router sends obvious low-risk merchants to single-sample fast path (<300 ms).

p95 latency rose from 1.1 s to 4.6 s on the review tier, but false denials (merchant churn cost) fell 29%. Token spend per scored transaction went from ~900 to ~6.2k on the hard slice only — roughly 18% of volume.

Technique decision table

Approach Best when Weak when
Single greedy CoT High volume, tight latency, easy labels Borderline reasoning, high asymmetric error cost
Self-consistency Discrete answers, parallel inference, moderate budget Shared wrong first step; free-form synthesis tasks
Best-of-N + ORM/PRM Need ranked traces, continuous quality scores Extra judge latency and training data
Tree of Thoughts Branchy constraints, early mistakes costly Simple classification; no clear intermediate states
RL reasoning model (o1-class) Math/code at scale with vendor SLA Opaque traces; fine-grained business rules

Common pitfalls

  • Temperature 0 with self-consistency — all samples identical; you pay N× cost for zero benefit.
  • Fragile answer parsing — voting on unnormalized strings splits one correct answer across buckets.
  • Ignoring invalid chains — counting malformed outputs skews the vote toward the most common parse error.
  • Systematic prompt bias — a misleading few-shot example poisons every sample; fix the prompt before sampling more.
  • Wrong tie-break policy — defaulting to the aggressive class (deny) vs conservative (review) changes business metrics sharply.
  • No cost router — running N=20 on every query burns budget; route easy cases to single-sample paths.
  • Conflating fluency with correctness — weighted voting on raw logprobs favors verbose wrong answers on some models.
  • Missing regression tests — changing temperature or N without re-benchmarking silently drifts accuracy.

Production checklist

  • Define extractable answer format (enum, number, JSON field) before sampling.
  • Benchmark accuracy vs N and temperature on a labeled holdout set.
  • Implement normalization and invalid-sample discard with logged discard rates.
  • Choose plurality vs weighted vote and document tie-break rules.
  • Batch parallel samples on inference servers for throughput.
  • Add complexity router: easy queries use single-sample fast path.
  • Store all sample traces (or hashes) for audit and debugging.
  • Monitor token cost per decision, p95 latency, and vote entropy (all-agree vs split).
  • Alert when unanimous wrong votes spike — signals prompt or data drift.
  • Compare against ToT and verifier baselines before committing to production N.

Key takeaways

  • Self-consistency samples multiple independent chain-of-thought completions and majority-votes the final answer — the simplest parallel test-time compute boost for discrete reasoning tasks.
  • It is easier to deploy than Tree of Thoughts (no branch evaluators or search controllers) and parallelizes well, but cannot recover when every sample shares the same wrong first step.
  • Harbor Analytics raised fraud-scoring macro-F1 from 0.74 to 0.81 with N=7 voting at 6.8× token cost on the hard slice, plus a router for fast single-sample paths on easy merchants.
  • Reliable answer extraction and normalization matter as much as voting — fragile parsers silently destroy the gains.
  • Pair self-consistency with routing and budget caps; running many samples on every query is wasteful when a single CoT suffices.

Related reading