Guide

LLM tree-of-thought reasoning explained

Harbor Logistics runs an internal copilot that drafts same-day delivery routes: twelve stops, two vans, hard time windows, and a cold-chain constraint on three pallets. A single linear chain-of-thought answer — “visit stops in geographic order” — produced feasible routes only 54% of the time on their 200-scenario eval set. Early mistakes (locking the wrong anchor stop) could not be recovered because the model committed to one narrative. Engineers wrapped the same checkpoint in tree-of-thought (ToT) search: at each step the model proposes three candidate next-stop assignments, a lightweight value function scores partial routes, and breadth-first search keeps the top two branches until all stops are assigned. Feasible-route rate rose to 87%; median latency grew from 2.4 s to 9.1 s. The gain came from exploring reasoning, not from training a larger model.

Tree-of-thought treats LLM reasoning as search over a tree of intermediate states instead of one greedy chain. Each node is a partial solution (a thought); edges are proposed extensions; a controller (BFS, DFS, or beam search) decides which branches to expand and which to prune. This guide covers the ToT loop, thought decomposition and prompting, value functions vs verifiers, search controllers, the Harbor Logistics refactor, a technique decision table versus linear CoT, self-consistency, and best-of-N, pitfalls, and a production checklist.

From one chain to a search tree

Standard chain-of-thought (CoT) asks the model to emit step 1, then step 2, then step 3 in one pass. If step 1 is wrong, later steps rationalize the error. ToT breaks the problem into thought steps — semantically meaningful chunks (one stop assignment, one algebraic manipulation, one code refactor) — and generates multiple candidate thoughts per step before moving on.

The generic loop:

Decompose the task into a sequence of thought types (fixed upfront or chosen dynamically).
Propose k candidate thoughts for the current partial state (sampling or diverse prompts).
Evaluate each candidate with a value function or hard verifier.
Select which nodes to expand next (BFS: best breadth layer; DFS: deepest promising path; beam: fixed width b).
Terminate when a leaf passes completion checks or the budget is exhausted; backtrack if all branches fail.

Unlike self-consistency, which samples full independent chains and votes on the final answer, ToT can prune bad prefixes early and reuse promising partial work. Unlike best-of-N, which scores only complete outputs, ToT scores intermediate states — critical when wrong early choices make recovery impossible.

Thought decomposition and prompting

Decomposition quality dominates ToT success. Good thought units are:

Locally evaluable — you can score the partial state without seeing the full solution (remaining slack in a schedule, partial sum in math).
Small enough — one decision per thought; “plan the entire route” is not a thought, “assign stop 7 to van B” is.
Stable under paraphrase — the model should propose comparable candidates at each node (same schema, different content).

Prompt pattern for proposals at node s:

Given partial state:
{state_json}

Propose exactly 3 distinct next thoughts.
Each thought must be one sentence describing the next action only.
Format: numbered list 1–3.

For evaluation prompts, ask the model (or a separate judge) to rate 1–10 how promising the partial state is toward a valid completion, or pipe the state into a deterministic checker (time-window SAT, unit-test partial pass). Hybrid scoring — LLM heuristic plus hard constraint filter — is common in production.

Value functions, verifiers, and pruning

Every ToT implementation needs a scoring function V(state). Options:

LLM self-eval — fast to ship, noisy; calibrate with a held-out set and cap trust at early depths.
Process reward models (PRMs) — step-level classifiers trained on human labels; see process reward models.
Deterministic verifiers — constraint solvers, partial SQL execution, simulator steps; return −∞ on violation.
Learned heuristics — regression on features (remaining capacity, slack minutes) from historical optimal routes.

Pruning drops branches below a threshold or outside the top-b per layer. Aggressive pruning cuts cost but risks eliminating the only valid path; conservative pruning approaches brute force. Harbor Logistics used b = 2, k = 3 proposals per node, and a hard filter that discarded any partial route violating cold-chain dwell time before LLM scoring ran.

BFS, DFS, and beam controllers

The search strategy trades exploration breadth against depth:

Controller	Behavior	Best when
BFS (by value)	Expand all nodes at depth d, keep top b, proceed to d+1	Early mistakes are common; optimal prefix is unclear (planning, puzzles)
DFS (with backtrack)	Go deep on highest-V child; backtrack on failure	Depth is modest; good value function; lower memory
Beam search	Fixed-width frontier at each step (ToT paper default)	Balanced cost; known thought count

Token budget accounting: each proposal costs one short completion; each evaluation costs another. Total spend scales roughly O(depth × k × (1 + eval_cost) × effective_branches). Cap depth, k, and b per user tier. Async jobs (route planning, code migration plans) can afford BFS; chat assistants often use beam 2 with depth 3.

Harbor Logistics refactor (worked example)

Problem encoding: JSON with stops (geo, window, service minutes), vans (capacity, shift end), and tagged cold pallets. Thought = assign one unrouted stop to a van. Proposal prompt includes current assignments and remaining stop IDs. Value function = remaining slack minutes (learned regression) minus penalty for geographic zig-zag.

Controller: BFS with b = 2, k = 3, max depth 12. Deterministic filter rejects infeasible windows before scoring. On leaf nodes, a full OR-Tools check confirms feasibility; if none pass, return the highest-scoring partial with a “needs human review” flag.

Results on 200 held-out scenarios (unchanged base model):

Feasible routes: 54% → 87%
Median optimality gap vs solver baseline: 18% → 7%
p95 latency: 2.4 s → 9.1 s
Mean LLM calls per request: 1 → 28

Failure modes that remained: ambiguous geocodes (model invented plausible but wrong addresses) and equal-value ties at depth 4 where both branches looked fine until stop 10. Adding a tie-break diversity bonus (prefer under-used vans) recovered another 4 percentage points on a subset.

Technique decision table

Approach	Strength	Weakness	Use when
Linear CoT	One pass, lowest latency	No recovery from early errors	Simple multi-step tasks with low branching factor
Self-consistency	Easy parallelization; no tree bookkeeping	Wastes compute on full bad chains	Final answer easy to vote; steps not independently scorable
Best-of-N	Strong when full-output verifier exists	Ignores partial progress	Code generation with unit tests; short outputs
Tree-of-thought	Prunes bad prefixes; reuses partial states	Complex orchestration; many LLM calls	Planning, puzzles, multi-commitment search spaces
MCTS / learned policy	Sample-efficient deep search	Harder to implement; needs rollouts	Very large branching; simulator available

See test-time compute for how ToT fits the broader inference-budget toolbox alongside o1-style adaptive depth and verifier-guided search.

Common pitfalls

Thoughts too large — whole-plan nodes cannot be scored mid-flight; search collapses to best-of-N.
Unreliable value function — LLM self-grades correlate with fluency, not correctness; always anchor with deterministic checks where possible.
Duplicate proposals — low temperature yields three identical “next thoughts”; enforce diversity (different stop IDs, re-prompt with “avoid previous options”).
Unbounded depth — runaway token bills; set hard depth and call caps per request.
State drift — free-text partial states parse inconsistently; use structured JSON thoughts.
No baseline comparison — ToT adds latency; measure against CoT + self-consistency on the same eval before shipping.
Ignoring cache — identical sub-states across branches should reuse KV cache / prefix cache where the stack supports it.

Production checklist

Define thought granularity and a machine-readable state schema.
Choose controller (BFS / DFS / beam) and document k, b, max depth.
Implement deterministic hard filters before soft value scoring.
Calibrate the value function on a stratified eval set (report prune regret).
Cap LLM calls and wall-clock time; degrade to linear CoT on timeout.
Log the search tree (parent pointers, scores) for debugging and compliance.
Measure feasible-rate / accuracy vs p95 latency across budget tiers.
A/B against self-consistency at equal token budget before defaulting to ToT.
Version prompts for propose and evaluate steps independently.
Pair online ToT with offline PRM or DPO training when search cost is too high at scale.

Key takeaways

ToT searches over intermediate thoughts instead of one greedy chain.
Value functions and hard verifiers enable early pruning of dead branches.
Harbor Logistics lifted feasible delivery routes from 54% to 87% with BFS ToT.
BFS favors planning puzzles; beam search balances cost on fixed-depth tasks.
Compare against self-consistency and best-of-N at equal token budget before shipping.