Guide

Monte Carlo Tree Search explained

Chess engines once relied on alpha-beta pruning over handcrafted evaluation functions. Go, with its vast branching factor, resisted that approach for decades. In 2006, Monte Carlo Tree Search (MCTS) changed the landscape: instead of evaluating every line to a fixed depth, MCTS builds a partial search tree guided by random rollouts (playouts to terminal or cutoff states) and statistically aggregates which moves win most often. Combined with deep neural networks, MCTS powered AlphaGo's victory over Lee Sedol and remains the backbone of strong board-game AI, turn-based strategy planners, and research into LLM reasoning. This guide covers the four MCTS phases, the UCB1 selection rule, neural enhancements, a Harbor Chess midgame planning worked example, an algorithm decision table, common pitfalls, and a practitioner checklist — alongside our reinforcement learning guide, multi-armed bandits guide, and game behavior trees guide.

What problem MCTS solves

Planning in games and decision problems means choosing an action when the future is uncertain and the search space is enormous. Brute-force minimax works when you can search deep enough with a reliable evaluator — checkers, chess at moderate depth. It fails when branching explodes (Go, hex, complex strategy games) or when no hand-tuned evaluation function exists.

MCTS treats each legal move from the current state as an arm in a multi-armed bandit problem at every tree node. It allocates simulation budget to moves that look promising or are poorly explored, then picks the move with the best empirical win rate after a fixed number of iterations (or time budget). No domain-specific evaluation function is required — only a simulator that can apply moves and detect terminal outcomes.

When MCTS is a good fit

  • Discrete actions with a manageable branching factor per step (roughly 10–400 legal moves).
  • Fast rollouts — random play to game end (or a depth cap) must be cheap.
  • Win/loss/draw signal or scalar outcome at terminal states.
  • Perfect or near-perfect information (standard MCTS assumes full observability).

MCTS struggles in continuous control (robot joint angles), real-time games requiring sub-16 ms decisions without GPU acceleration, and domains where rollouts are slow (full physics sim per step). In those cases, learned value networks, behavior trees, or model-free RL often fit better.

The four phases: selection, expansion, simulation, backpropagation

Each iteration of MCTS runs four steps. After N iterations (commonly 1,000–100,000 depending on time budget), the root's child with the highest visit count (or highest win rate) is played.

1. Selection

Start at the root. While the current node is fully expanded and not terminal, descend to the child with the highest UCB1 (Upper Confidence Bound) score — balancing exploitation (high win rate) and exploration (few visits):

UCB1(i) = wi / ni + c √(ln N / ni)

Here wi is total win value backpropagated through child i, ni is visit count for child i, N is parent visit count, and c is an exploration constant (classically √2 ≈ 1.41). The second term grows for rarely visited children, forcing the tree to try uncertain lines. This is the same exploration principle as UCB bandit algorithms.

2. Expansion

When selection reaches a node that is non-terminal but has unexpanded children, add one child — typically one random untried legal move. That child becomes the leaf for simulation. Some implementations expand all children at once; others expand lazily one at a time for memory efficiency.

3. Simulation (rollout)

From the new leaf, play moves until terminal or a depth/time cutoff. The rollout policy is usually uniform random over legal moves — cheap and unbiased. Stronger bots use a lightweight policy network or heuristic (capture moves first, avoid self-atari in Go) to make rollouts more informative at the cost of speed.

4. Backpropagation

Propagate the simulation result up the path to the root. Increment visit counts; add +1 for a win, 0 for loss, 0.5 for draw (or use values from a value network in AlphaGo-style MCTS). Flip perspective each level in two-player zero-sum games so parent statistics reflect the player-to-move at that node.

# Pseudocode: one MCTS iteration (two-player, win=1 loss=0)
def mcts_iteration(root):
    node = root
    path = [node]
    # Selection
    while node.is_fully_expanded() and not node.is_terminal():
        node = node.select_child_ucb1()
        path.append(node)
    # Expansion
    if not node.is_terminal() and not node.is_fully_expanded():
        node = node.expand_one_child()
        path.append(node)
    # Simulation
    result = rollout(node.state)  # 1.0 win, 0.0 loss, 0.5 draw
    # Backpropagation
    for n in reversed(path):
        n.visits += 1
        n.wins += result
        result = 1.0 - result  # alternate player perspective

AlphaGo and neural MCTS

Pure random-rollout MCTS reached strong amateur Go play but plateaued below professional level. AlphaGo replaced two weak links with neural networks trained by reinforcement learning:

  • Policy network p(a|s) — biases selection and rollouts toward plausible moves, shrinking effective branching.
  • Value network v(s) — estimates win probability from a position, replacing long random rollouts with a single forward pass.

PUCT (Predictor + UCT) modifies UCB to use prior probabilities from the policy network: moves the network likes get explored earlier even with few visits. At leaf nodes, the value network's v(s) substitutes for rollout outcome. Self-play generates training data; the loop — better nets improve MCTS, better MCTS improves self-play — is the core of modern game AI scaling.

The same pattern appears in chess engines (Leela Chess Zero), hex bots, and research systems that use LLMs as policy priors over reasoning steps — though LLM tree search is still an active research area with different cost profiles than board games.

Worked example: Harbor Chess midgame knight fork

Harbor Games ships a lightweight chess mode for casual players. The built-in AI uses MCTS with a 200 ms budget per move on mobile. Consider this midgame position (White to move): White's knight on e5 attacks the black queen on d7 and rook on c6 — a classic fork threat if the knight can land safely.

Candidate moves from the root include Nxd7 (win queen, but is it safe?), Nxc6 (win rook), Nd3 (reposition), and several king/rook shuffles. Random rollouts from Nxd7 often show White ahead when Black's recapture is unsound; rollouts from quiet moves show murkier outcomes. After 3,000 iterations in 200 ms:

  • Nxd7 — 1,420 visits, 58% win rate (exploitation favors this).
  • Nxc6 — 890 visits, 51% win rate.
  • Nd3 — 410 visits, 49% win rate (exploration bonus kept it alive).
  • Other moves — under 280 visits each, mostly below 45%.

The engine plays Nxd7 (highest visit count — robust child per common convention). A minimax engine with a shallow material evaluator might agree, but MCTS reached the conclusion without encoding "queen = 9 points." It discovered that line through statistics. If Black had a hidden refutation, UCB would eventually allocate more visits to defensive tries as rollout data shifted — the tree self-corrects given budget.

Production tuning for Harbor Chess: c = 1.2 (slightly less exploration than √2 for shorter budgets), rollout cap at 40 plies, illegal-move filtering in expansion, transposition table keyed by Zobrist hash to merge identical positions reached by different move orders.

Algorithm decision table

Method Best for Needs evaluator? Typical cost
MCTS (random rollout) Go, hex, medium branching, fast sim No (terminal outcome only) O(iterations × rollout depth)
Neural MCTS (AlphaGo-style) Top-tier board games, self-play training Learned policy + value nets GPU inference per node
Minimax + alpha-beta Chess, checkers, low branching with good eval Handcrafted or NN eval O(bd) pruned
Model-free RL (PPO, DQN) Continuous control, stochastic envs, no simulator Learned from interaction Many environment steps
Greedy / beam search LLM decoding, puzzles with heuristic score Per-step score function O(beam width × depth)

Common pitfalls

  • Too few iterations — 50 ms and 50 rollouts produce noisy move choices; always scale budget to hardware.
  • Biased rollouts without calibration — heuristic rollout policies can overfit quirks; validate against random baselines.
  • Ignoring transpositions — duplicate positions waste budget; use a transposition table or graph MCTS.
  • Wrong root selection rule — max visits (robust child) vs max win rate (aggressive) diverge under uncertainty; document your choice.
  • Applying vanilla MCTS to imperfect information — card games need ISMCTS (determinization) or different algorithms.
  • Memory blow-up — unbounded trees on long games require progressive widening or RAVE heuristics to limit children.
  • Determinization leakage — in hidden-info games, treating guessed worlds as real produces overconfident play.

Practitioner checklist

  • Confirm rollouts are fast enough to run thousands per second on target hardware.
  • Log visit counts and win rates at the root for debugging suspicious moves.
  • Tune exploration constant c on a held-out position suite, not one puzzle.
  • Add transposition hashing if the same position is reachable by transposition.
  • Use max-visit child at the root for stability; max-win-rate for analysis mode.
  • Cap rollout depth or switch to a value net when games exceed 200 plies routinely.
  • For two-player games, alternate backprop values (+1 / 0) per tree level.
  • Parallelize iterations with virtual loss to avoid thread contention on hot nodes.
  • Benchmark against minimax or human baselines before shipping AI difficulty tiers.
  • Document time budget per move so QA can reproduce AI behavior across devices.

Key takeaways

  • MCTS builds a search tree through selection, expansion, random rollouts, and backpropagation — no handcrafted eval required.
  • UCB1 at each node balances trying promising moves with exploring uncertain ones.
  • AlphaGo-style neural MCTS replaces random rollouts with policy priors and value networks for professional-strength play.
  • MCTS excels in discrete, simulatable, perfect-information games with fast playouts.
  • Production game AI needs transposition tables, time budgets, and tuned exploration — not just textbook pseudocode.

Related reading