Guide

LLM MCTS reasoning explained

Harbor Analytics ships a spreadsheet copilot that answers multi-step word problems: inventory turns, margin bridges, and unit-economics puzzles where one wrong intermediate number poisons the final answer. A single chain-of-thought pass scored 61% on their 150-problem eval. Engineers wrapped the same checkpoint in Monte Carlo tree search (MCTS): each node is a partial derivation step, UCB1 picks which branch to expand, the LLM proposes child moves and runs cheap rollouts to terminal answers, and visit counts backpropagate success signals up the tree. Accuracy rose to 84%; p95 latency grew from 3.1 s to 14.6 s. The win came from sample-efficient deep search, not from a bigger model.

LLM MCTS adapts the select-expand-simulate-backpropagate loop from game AI to language reasoning. The LLM acts as both policy (proposing next steps) and rollout policy (playing out random or greedy completions to estimate leaf value). Unlike fixed-width beam search, MCTS allocates more simulations to promising subtrees. Unlike breadth-first tree-of-thought, it can go deep on a single line when rollouts look strong. This guide covers the MCTS loop for text states, UCB and exploration constants, LLM priors and value heads, rollout design, the Harbor Analytics refactor, a technique decision table versus tree-of-thought and best-of-N, pitfalls, and a production checklist.

The MCTS loop for language states

Classic MCTS builds a search tree over game positions. For LLM reasoning, a state is the problem statement plus all accepted intermediate steps so far (equations, code lines, plan items). An action is one next step proposed by the model. The four phases repeat until a simulation budget is exhausted:

  1. Select — walk from the root using UCB1 (or PUCT with neural priors) until you reach a node that is not fully expanded or is terminal.
  2. Expand — ask the LLM for k candidate next steps; add one new child (or all k in batched variants).
  3. Simulate (rollout) — from the new child, let the LLM greedily or stochastically complete the derivation to a final answer; score with a verifier or outcome check.
  4. Backpropagate — update visit count N and accumulated value W for every ancestor; value is typically win=1 / loss=0 or a continuous verifier score.

After B simulations, return the action at the root with highest visit count (robust child) or highest mean value. Repeat action selection for multi-step problems: re-root at the chosen child and run another MCTS episode, or run one deep tree if states are shallow.

UCB, priors, and exploration

UCB1 balances exploitation and exploration:

UCB(child) = W/N + c * sqrt( ln(N_parent) / N_child )

c controls exploration (common values 1.0–1.4 for text). PUCT blends in an LLM prior P(a|s) from a policy prompt or logprob ranking:

PUCT(child) = Q/N + c_puct * P(a|s) * sqrt(N_parent) / (1 + N_child)

Priors matter when branching is wide: without them, early random rollouts starve good lines. Use the same checkpoint to score “how likely is this next step?” with a short completion or token logprobs if your stack exposes them. Cap children per node (k = 3–5) to keep expansion cost bounded.

Virtual loss (incrementing N during in-flight rollouts) prevents parallel workers from hammering the same child in batch MCTS deployments.

Rollouts, verifiers, and value signals

Rollout quality dominates MCTS strength. Options:

  • Fast greedy rollout — temperature 0 completion to final answer; cheapest; biased toward local errors.
  • Stochastic rollout — sample 2–3 rollouts per expansion and average value; reduces variance at 2–3× cost.
  • Verifier-terminated rollout — stop when a deterministic checker fires (unit mismatch, syntax error, constraint violation) and assign 0 value immediately.
  • Learned value head — a process reward model or outcome model scores partial states without full rollout.

For math, pipe final answers through a symbolic checker or numeric tolerance compare. For code, run partial AST parse plus unit tests on completion. Hybrid value = α · rollout_win + (1−α) · PRM_score often cuts rollout count by half while preserving accuracy.

State representation and action grammar

MCTS needs stable state transitions. Best practices:

  • Structured steps — JSON list of {step_id, expression, result} rather than free-form prose that parsers reinterpret each visit.
  • Typed actions — “define variable”, “apply formula”, “simplify”; prevents duplicate semantically identical children.
  • Terminal detection — explicit FINAL_ANSWER: token or schema validation before backprop.
  • Dedup hashing — hash normalized state strings to merge transpositions (same partial derivation reached by different paths).

Transposition tables are underused in LLM MCTS but valuable when order of commutative steps varies (“add A then B” vs “add B then A”).

Harbor Analytics refactor (worked example)

Problem class: GSM-style business math (margins, growth rates, break-even). State = problem text + list of derived quantities. Action = one derivation step with explicit unit labels. Policy prompt asks for three ranked next steps; PUCT uses logprob rank as prior.

Configuration: B = 40 simulations per decision, k = 4 expansions max per node, c_puct = 1.25, greedy rollout with temperature 0.3 for tie-break diversity. Verifier: sympy numeric compare within 0.1% plus unit dimensional analysis on intermediate steps.

Results on 150 held-out problems (unchanged 8B instruct model):

  • Exact-match accuracy: 61% → 84%
  • Problems with wrong intermediate caught before final: 12% → 47%
  • Mean simulations used: 38 (early terminal on easy items)
  • p95 latency: 3.1 s → 14.6 s
  • Mean LLM calls per problem: 1 → 52

Remaining failures: ambiguous problem statements where multiple interpretations scored equally on rollouts, and long chains (>8 steps) where simulation budget exhausted before depth. Raising B to 80 recovered 4 points on hard tier only; they tiered budget by user plan instead.

Technique decision table

ApproachStrengthWeaknessUse when
Linear CoT One pass; lowest latency No recovery from early errors Short chains; cheap verifier at end only
Beam / BFS ToT Systematic layer-by-layer pruning Fixed width wastes budget on weak layers Uniform branching depth; good per-step value function
LLM MCTS Adaptive depth; concentrates simulations on strong subtrees Complex orchestration; rollout variance Uneven branching; simulator or verifier at leaves; depth varies
Best-of-N Simple parallelization No intermediate scoring Short outputs; strong end verifier only
Learned policy (RL) Fast inference after training Expensive offline pipeline MCTS cost too high at scale; abundant labeled trajectories

See test-time compute for how MCTS fits the broader inference-budget toolbox alongside o1-style adaptive depth and verifier-guided search.

Common pitfalls

  • Rollout-policy mismatch — expansion uses diverse samples but rollout is greedy; values mis-rank children. Match temperatures or distill a rollout policy.
  • Unnormalized rewards — mixing 0/1 wins with continuous PRM scores without scaling breaks UCB comparisons.
  • No transposition handling — tree explodes on commutative steps; add state hashing.
  • Too few simulationsB < 20 often behaves like noisy best-of-N; calibrate B on a hard slice.
  • Expensive rollouts — full code execution every simulation; use lightweight partial checks plus one final integration test.
  • Ignoring priors — uniform UCB on 10+ children wanders randomly until visit counts stabilize.
  • Unbounded tree memory — cap total nodes per request; prune low-UCB subtrees after each action commit.

Production checklist

  • Define state schema, action grammar, and terminal conditions.
  • Choose UCB vs PUCT; document c, c_puct, k, B.
  • Implement verifier at rollout end and optional mid-rollout hard stops.
  • Calibrate simulation budget vs accuracy on easy/medium/hard tiers.
  • Add transposition hashing if steps are reorderable.
  • Log per-node visits, values, and chosen rollouts for debugging.
  • Cap wall-clock time; fall back to beam ToT or linear CoT on timeout.
  • A/B against ToT at equal token budget before defaulting to MCTS.
  • Cache LLM completions keyed by (state, action) across simulations.
  • Consider distilling high-visit MCTS trajectories into SFT or DPO data.

Key takeaways

  • MCTS adapts select-expand-simulate-backpropagate to LLM reasoning states.
  • UCB and PUCT priors allocate simulation budget to promising derivation lines.
  • Harbor Analytics lifted multi-step math accuracy from 61% to 84% with 40 simulations per step.
  • Rollout design and verifiers matter more than raw simulation count.
  • Compare against tree-of-thought and best-of-N at equal token budget before shipping.

Related reading