Guide

LLM chain-of-thought reasoning explained

Large language models predict the next token. On multi-step math, logic puzzles, and planning tasks, jumping straight to the final answer often fails — the model has no room to decompose the problem. Chain-of-thought (CoT) prompting asks the model to write intermediate reasoning steps before the answer, dramatically improving accuracy on tasks that require deliberate computation. The technique is simple to try ("Let's think step by step") but subtle to deploy: it increases latency and cost, can produce plausible-sounding wrong reasoning, and is being superseded by reasoning-native models that internalize the process. This guide explains how CoT works, zero-shot vs few-shot variants, self-consistency and tree-of-thought extensions, how reasoning models differ, when CoT helps vs hurts, and how to evaluate reasoning quality in production alongside prompt engineering and hallucination controls.

What chain-of-thought prompting is

Standard prompting: question in, answer out. Chain-of-thought prompting inserts an explicit reasoning trace between them:

Q: A store sells apples at $2 each. Maria buys 3 apples and pays with a $10 bill. How much change does she receive?

A: Maria buys 3 apples at $2 each, so 3 × $2 = $6. She pays $10, so change is $10 − $6 = $4. The answer is $4.

The intermediate lines are not magic — they allocate more autoregressive steps for the model to "compute" before committing to a final number. Wei et al. (2022) showed that on grade-school math benchmarks, CoT unlocked capabilities that appeared only in larger models; below a certain scale, models could not follow multi-step reasoning even when prompted.

CoT is not the same as showing work to the user for transparency (though it can serve that purpose). It is a compute budget expressed in natural language. Each reasoning token is another forward pass through the transformer, giving the model more opportunities to correct early mistakes before the answer token.

Zero-shot vs few-shot chain-of-thought

Zero-shot CoT

Kojima et al. (2022) found that appending a single phrase — "Let's think step by step" — to the question, without any worked examples, often triggers reasoning behavior in instruction-tuned models. Zero-shot CoT is the cheapest way to experiment: no example curation, no extra context tokens for demonstrations.

Variants that work similarly: "Think through this carefully before answering," "Break the problem into steps," or asking the model to output reasoning inside <thinking> tags and the final answer inside <answer> tags for easier parsing.

Few-shot CoT

Few-shot CoT includes one or more fully worked examples in the prompt — question, reasoning trace, and answer — before the actual question. This is more reliable for domain-specific formats (financial calculations, unit conversions, code debugging) because the examples teach both how to reason and how to format output.

Tradeoff: every example consumes context window and input-token cost. For long-context tasks, few-shot CoT may crowd out retrieved documents in a RAG pipeline. A common pattern is one concise exemplar plus zero-shot phrasing on the real question.

Why CoT improves accuracy (and when it does not)

CoT helps most on tasks with compositional structure:

  • Arithmetic and word problems (multi-step math)
  • Symbolic logic, constraint satisfaction, and puzzle games
  • Multi-hop factual questions where each hop must be correct
  • Planning and scheduling with interacting constraints
  • Code debugging where the bug is several steps from the symptom

CoT often does not help (or actively hurts) on:

  • Creative writing — extra "reasoning" adds fluff without improving prose.
  • Simple retrieval — "What is the capital of France?" needs no decomposition.
  • Highly parallel pattern matching — sentiment classification, entity tagging.
  • Tasks the model cannot do at all — CoT cannot fix missing knowledge; it may fabricate convincing but wrong intermediate steps (see hallucinations).

The failure mode to watch: confident wrong reasoning. A model that writes "Step 1: … Step 2: … Therefore the answer is 42" can persuade human reviewers even when Step 1 was wrong. Never treat the reasoning trace as ground truth — verify the final answer against tools, retrieval, or independent checks.

Self-consistency: vote across multiple reasoning paths

Wang et al. (2022) proposed self-consistency: sample multiple CoT completions at non-zero temperature, extract the final answer from each, and take the majority vote. Different reasoning paths may diverge early but converge on the correct answer; wrong paths often disagree with each other.

Typical setup: generate 5–20 completions, parse the final numeric or multiple-choice answer from each, return the mode. Accuracy gains on GSM8K-style math can be substantial — at the cost of generation tokens and latency.

Production tips:

  • Use structured output or regex to extract answers reliably from free-form CoT.
  • Stop early if all samples agree — no need to burn budget on unanimous paths.
  • Pair with a cheap verifier model or calculator tool for high-stakes decisions.
  • Log disagreement rate as a quality signal; high variance means low confidence.

Beyond linear CoT: tree-of-thought and search

Linear chain-of-thought commits to one reasoning path. Tree-of-thought (ToT) and graph-of-thought (GoT) treat reasoning as search: the model proposes multiple intermediate steps, a scorer evaluates promising branches, and unpromising paths are pruned. This mirrors how humans backtrack when a line of reasoning dead-ends.

ToT shines on puzzles with exploration (Game of 24, mini crosswords, strategic planning) where a single greedy CoT path fails. Costs are higher still — multiple LLM calls per depth level — so it is rarely the default in latency-sensitive APIs. Use when accuracy dominates and problems are small enough to search (bounded branching factor, shallow depth).

Agent frameworks sometimes implement a lighter version: the model proposes a plan, executes tool calls, observes results, and revises — functionally a tree with tool nodes as leaves. See AI agents and tool use for that architecture.

Reasoning models vs prompt-level CoT

A new class of reasoning-native models — OpenAI o-series, DeepSeek-R1, QwQ, and similar — are trained with reinforcement learning on chains that lead to correct answers. They produce long internal reasoning traces (sometimes hidden from API consumers) before a concise final response.

Differences from classic CoT prompting:

  • Learned behavior — reasoning is baked into weights, not only elicited by a phrase.
  • Hidden reasoning tokens — providers may bill for reasoning tokens separately and omit them from the visible response.
  • Higher baseline cost — even "simple" questions may trigger long internal chains unless the API offers a reasoning-effort knob.
  • Less prompt sensitivity — "Let's think step by step" adds little when the model already defaults to deep reasoning.

When to use which: prompt CoT on fast, cheap general models (GPT-4o-mini, Claude Haiku, open-weight 7B–70B) for math-heavy workflows where you control token budget. Reach for reasoning models on competition-level math, complex code synthesis, or multi-hour agent tasks where error cost exceeds inference cost.

Latency, cost, and UX tradeoffs

CoT reasoning is paid in tokens. A 50-token direct answer might become 300–800 tokens with visible reasoning — 6–16× output cost plus added time-to-first-token if you stream the whole trace.

  • Stream reasoning separately — show a collapsible "thinking" panel so users see progress without reading every step.
  • Cap reasoning length — "Use at most 5 steps" prevents runaway chains.
  • Route by difficulty — a classifier sends easy queries to direct answers and hard queries to CoT or a reasoning model.
  • Tool augmentation — offload arithmetic to a calculator, SQL to a database, instead of trusting the model to compute in prose.
  • Cache reasoning patterns — similar to prompt caching, static system prompts with CoT instructions benefit from prefix reuse.

Evaluating reasoning quality

Accuracy on the final answer is necessary but not sufficient. Production systems should track:

  • Answer correctness — exact match on math; LLM-as-judge or rubric scoring on open-ended tasks (see LLM evaluation).
  • Reasoning faithfulness — do intermediate steps actually support the conclusion, or are they post-hoc rationalization?
  • Parse failure rate — how often extraction of the final answer fails.
  • Token efficiency — accuracy per dollar across CoT vs direct vs reasoning-model routes.
  • User override rate — how often humans reject CoT-backed answers.

Benchmark suites commonly used: GSM8K, MATH, ARC-Challenge, BIG-Bench hard subsets. Run them on your exact prompt template — generic leaderboard scores do not transfer when your system prompt, tools, and retrieval differ.

Production checklist

  1. Classify queries: direct answer vs CoT vs reasoning model vs tool-augmented.
  2. Start with zero-shot CoT; add one few-shot exemplar only if format drift appears.
  3. Separate reasoning and answer in the prompt; parse the answer field programmatically.
  4. Never show raw CoT as authoritative — label it as model-generated working notes.
  5. Use calculators, code interpreters, or retrieval for facts the model should not infer.
  6. For high-stakes math, enable self-consistency (3–5 samples) or a verifier pass.
  7. Set max output tokens and step limits to control cost spikes.
  8. Log reasoning length, latency, and disagreement rate per route.
  9. A/B test CoT against direct answers on real user traffic — lab benchmarks lie.
  10. Re-evaluate when switching models; CoT sensitivity varies by training generation.

Key takeaways

  • CoT allocates compute — intermediate tokens give models room to decompose hard problems before answering.
  • Zero-shot is cheap to try — "Let's think step by step" costs one phrase; few-shot adds reliability for structured domains.
  • Self-consistency trades cost for accuracy — sample multiple chains and vote on the final answer.
  • Reasoning models internalize CoT — prompt tricks matter less; budget for hidden reasoning tokens instead.
  • Verify outputs — plausible reasoning can mask wrong answers; tools and eval harnesses are mandatory for production.

Related reading