Guide

Tree of Thoughts reasoning explained

Harbor Support's tier-2 escalation bot handled multi-constraint refund disputes with a single chain-of-thought pass: “check policy, check payment rail, decide outcome.” On edge cases — partial refunds across Stripe and on-chain SOL, conflicting timestamps, and promotional credits — the linear trace often committed to the wrong branch by step two and never recovered. Replacing one-shot CoT with Tree of Thoughts (ToT) search (generate multiple next-step hypotheses, score partial states, prune dead ends, backtrack) lifted correct-resolution rate from 61% to 79% on a held-out dispute set at 4.2× token cost. ToT treats reasoning as search over a tree of intermediate thoughts, not a single left-to-right monologue.

Yao et al. (2023) and Long (2023) formalized the pattern for puzzles, creative writing, and math: the LLM proposes candidate “thoughts” (coherent reasoning steps or partial plans), a separate prompt or judge model scores whether each state is promising, and a controller (breadth-first or depth-first) expands the frontier until a goal state is found or budget is exhausted. This guide covers the ToT loop, how it differs from self-consistency and test-time compute, BFS vs DFS tradeoffs, the Harbor Support refactor, a technique decision table, pitfalls, and a production checklist.

What Tree of Thoughts is

Standard autoregressive generation picks one next token at a time; early mistakes compound. Chain-of-thought exposes intermediate steps but still follows a single path. Tree of Thoughts generalizes CoT into explicit search:

Decompose the task into steps where each step is a “thought” — a short paragraph, equation line, policy clause citation, or game move.
Generate k candidate thoughts at each node (sampling or diverse prompting).
Evaluate each partial state: is it valid, on track, or hopeless? Scores can be LLM self-critique, a rubric, or a trained value head.
Search with BFS (keep top-b states per level) or DFS (go deep, backtrack on low scores).
Terminate when a leaf satisfies goal checks; return the winning trace and final answer.

ToT shines when the problem has explorable structure: Game of 24, crosswords, itinerary planning with hard constraints, or multi-step code edits where wrong assumptions early invalidate the whole patch. It is overkill for factual QA where retrieval plus one CoT pass suffices.

ToT vs chain-of-thought and self-consistency

Technique	Paths explored	Backtracking	Typical cost
Chain-of-thought	1 linear trace	No	Low
Self-consistency	`N` full independent chains, majority vote on answer	No (discard losers only at end)	Medium (×`N`)
Tree of Thoughts	Branching partial states, pruned by scores	Yes	High (depends on `k`, depth, `b`)

Self-consistency samples entire solutions in parallel; if every sample shares the same wrong first step, voting does not help. ToT can abandon a bad first step before spending tokens on a full completion. Conversely, self-consistency is simpler to implement (no state representation or evaluator) and parallelizes cleanly on batched inference servers.

Reasoning-native models (o1-class) internalize long hidden chains; they overlap with ToT's goal but do not replace explicit search when you need auditable branches, domain-specific validators, or integration with external tools (calculators, policy APIs).

The ToT control loop

Thought generation

Prompt the model with the problem, the path so far, and: “Propose k distinct next steps.” Diversity matters — use temperature, different system nudges (“optimistic vs conservative interpretation”), or ask for explicitly contradictory hypotheses in regulated domains.

State evaluation

The evaluator answers: How promising is this partial state on a 1–10 scale? or Does this violate any hard constraint? Separating generator and evaluator reduces sycophancy (the same model praising its own bad ideas). Options:

LLM-as-judge with a fixed rubric and few-shot calibration.
Programmatic checks — syntax parse, unit tests, SQL EXPLAIN, balance-sheet identities.
Learned value model — a small classifier on thought embeddings (similar to process reward models).

Search strategy

Breadth-first (BFS) keeps the top-b scored states at each depth. Good when the correct path is shallow but many first moves look plausible (creative writing outlines, policy interpretation forks).

Depth-first (DFS) commits to the highest-scored child until failure, then backtracks. Better when depth is moderate and evaluation is noisy — fewer parallel states, lower memory. Cap max depth and maintain a global best-so-far answer to avoid infinite wandering.

Termination and extraction

Define goal predicates: puzzle solved, all constraints satisfied, or evaluator score above threshold. Return the trace for user display (support audit logs) and the structured outcome (refund amount, code patch, plan JSON).

Harbor Support escalation refactor

The refactor modeled each dispute as a tree of policy interpretations rather than a single narrative:

Root — ingest ticket, payment IDs, timestamps, SKU rules.
Level 1 thoughts — three branches: full refund eligible, partial only, deny with appeal path.
Level 2 — per branch, allocate across rails (card chargeback window vs on-chain finality).
Evaluator — hybrid: hard SQL lookups on ledger tables (pass/fail) plus LLM judge for ambiguous promo-text clauses.
Search — BFS with b=2, k=3, max depth 4; timeout 12 s before fallback to human queue.

Token spend rose from ~1.2k to ~5.1k per hard ticket, but human handle time dropped 34% because agents received ranked traces with cited policy lines. Simple tickets still route through a single CoT path (<200 ms) via a complexity router.

Technique decision table

Approach	Best when	Weak when
Single CoT	Most support, extraction, single-hop math	Irreversible early mistakes, many constraints
Self-consistency	Stochastic tasks, no clear intermediate states	Shared systematic bias across samples
Tree of Thoughts	Planning, puzzles, multi-constraint policy/code	High latency budget unavailable; fuzzy evaluators
MCTS + world model	Games, simulations with rollouts	No cheap rollout; see MCTS guide
RL-trained reasoning model	Math/code at scale with vendor SLA	Opaque traces, fine-grained business rules

Common pitfalls

Same model generates and judges — inflated scores on fluent nonsense; split roles or add programmatic gates.
Unbounded branching — k=5 at depth 8 explodes cost; cap b, depth, and total token budget.
Vague thoughts — “consider alternatives” is not a searchable state; enforce concrete, checkable step formats.
No early-exit on easy cases — running full ToT on every query wastes money; route by classifier or heuristics.
Evaluator drift — rubric changes without regression tests; track agreement with human labels.
Ignoring latency SLOs — serial LLM calls add up; batch evaluations where vLLM allows.
Missing audit trail — regulated workflows need stored trees, not only final answers.
ToT on retrieval-heavy QA — fix RAG first; search over bad context multiplies errors.

Production checklist

Define thought granularity and max depth for the task domain.
Implement separate generator and evaluator (LLM, code, or hybrid).
Choose BFS vs DFS from branching factor and evaluation noise.
Set hard caps: k, b, depth, tokens, wall-clock timeout.
Add complexity router: easy queries skip ToT entirely.
Log full search trees for debugging and compliance review.
Benchmark against CoT and self-consistency on a labeled holdout set.
Monitor cost per resolved ticket and p95 latency separately.
Version prompts and rubrics; alert on evaluator score distribution shifts.
Provide human fallback when all branches score below threshold.

Key takeaways

Tree of Thoughts turns LLM reasoning into explicit search — generate multiple next-step hypotheses, score partial states, prune and backtrack.
It beats single-path CoT when early mistakes are costly and constraints are branchy; self-consistency alone cannot recover from a wrong first step shared across samples.
Harbor Support raised hard-dispute accuracy from 61% to 79% with BFS ToT at 4.2× token cost, plus a router that keeps simple tickets on fast CoT.
Separate generator and evaluator, cap branching, and log traces — fluent judges praising bad thoughts is the main failure mode.
Use ToT only where search structure exists; retrieval, routing, and reasoning models address different bottlenecks.