Guide

Monte Carlo Tree Search explained

Minute twelve of a ranked chess match. Your queen sits on d3, both rooks are connected, and the opponent just pushed a knight to f6 — a move that looks quiet but opens three tactical forks on the next tempo. Brute-force minimax to depth eight evaluates 10⁸ nodes and still misses the knight sacrifice on h7 that only pays off six plies later. A human titled player does not enumerate every line; they sample plausible continuations, notice which branches keep returning winning endings, and spend more analysis time on those ideas. Monte Carlo Tree Search (MCTS) automates that sampling loop: it builds a search tree incrementally, plays thousands of fast random (or learned) rollouts from each candidate move, and backs win/loss statistics up the tree so promising lines get explored more often. AlphaGo used MCTS plus deep neural networks to defeat world champions at Go; modern board-game engines, card agents, and turn-based strategy titles use variants of the same core. This guide covers the four MCTS phases, UCB1 exploration, rollout policies, neural MCTS, when MCTS beats classical minimax, a Harbor Chess midgame-fork worked example, an algorithm decision table, common pitfalls, and a production checklist — the planning layer that complements reinforcement learning and board-game design on our site.

What MCTS is (and what it is not)

Monte Carlo Tree Search is a best-first search algorithm for sequential decision problems — especially two-player, perfect-information games with large branching factors (Go, chess, shogi, hex, many card games). Instead of evaluating every child at fixed depth like minimax, MCTS allocates simulations dynamically: branches that look strong receive more rollouts; weak-looking lines are sampled just enough to avoid missing traps.

MCTS is not a replacement for all game AI:

Real-time action games (fighters, shooters) rarely have time for thousands of rollouts per frame; behavior trees, utility AI, and perception-driven state machines dominate there.
Hidden-information games (poker, bridge) need belief states or information-set search; vanilla MCTS assumes you see the full board.
Tiny state spaces (tic-tac-toe, connect-four at low depth) are solved faster by alpha-beta pruning than by noisy rollouts.

Where MCTS shines

Use MCTS when legal moves per position exceed what depth-limited minimax can search, when evaluation functions are hard to hand-craft (Go before neural nets), or when you can run hundreds to millions of simulations per move offline or between turns. The algorithm is also a building block in RL pipelines: AlphaZero self-play generates training data by playing MCTS against itself.

The four phases: selection, expansion, simulation, backpropagation

Each MCTS iteration repeats four steps starting from the root (current game state):

Selection — walk down the existing tree, choosing child nodes by a policy (usually UCB1) until you reach a node that still has unexplored legal moves.
Expansion — add one new child node representing a single untried move from that position.
Simulation (rollout) — play moves from the new node to a terminal state (win/loss/draw) using a fast default policy — often uniform random legal moves.
Backpropagation — update visit counts and win statistics on every ancestor up to the root, incrementing wins for the side to move at each level according to rollout outcome.

After N iterations (bounded by a time budget or iteration count), MCTS picks the root child with the highest visit count — not necessarily the highest win rate, because visit count aggregates confidence from many simulations.

Tree reuse between moves

In turn-based games, the opponent’s reply narrows the tree: descend to the matching child and promote it as the new root, discarding sibling branches. Reusing prior work amortizes search cost across the game. Failing to reuse (rebuilding from scratch each ply) wastes CPU on mobile or web targets.

UCB1 and the exploration-exploitation tradeoff

During selection, each child node stores visit count N and win count W. The classic UCB1 (Upper Confidence Bound) score balances exploiting high win rates with exploring rarely visited moves:

UCB1 = W/N + C × sqrt(ln(N_parent) / N)

The exploitation term W/N favors moves that won past rollouts. The exploration term grows when a child has few visits relative to its parent. Constant C (often sqrt(2) for [0,1] rewards) tunes how aggressively you probe uncertain lines. TPUCT variants in AlphaGo replace the second term with a prior from a policy network — the same mathematical idea as UCB in multi-armed bandits: you have a limited budget of pulls (simulations) and must decide which lever (move) to sample next.

Progressive widening and RAVE

In games with hundreds of legal moves (Go), expanding every child at once is wasteful. Progressive widening limits how many children exist as visit count grows. RAVE (Rapid Action Value Estimation) shares statistics across moves that appear in the same rollout, giving early signal before a specific node accumulates direct visits — useful when simulations are expensive.

Rollout policies: random, heuristic, and neural

Rollout quality dominates MCTS strength. Pure random play to the end is easy to implement but noisy: a brilliant sacrifice looks losing if random replies throw away the advantage. Common upgrades:

Lightweight heuristics — capture highest-value piece, avoid immediate blunders, prefer center control in chess. Cheap and 10–50× stronger than uniform random.
Playout policies trained offline — small networks or logistic models that mimic human amateur play.
Truncated rollouts + value net — stop after k moves and evaluate the leaf with a learned value function instead of playing to terminal — the AlphaGo/AlphaZero pattern.

Rollouts must be fast. A chess engine budget of 800 ms per move might run 40,000 iterations; if each rollout averages 50 plies of move generation, microseconds per step matter. Bitboards, incremental Zobrist hashing, and precomputed attack tables are standard engineering, not optional optimizations.

Neural MCTS: AlphaGo and AlphaZero

AlphaGo (2016) combined MCTS with two neural networks: a policy network p(a|s) that predicts expert moves (narrowing which children to expand) and a value network v(s) that estimates win probability from a position (replacing random rollout endings). Selection uses PUCT; leaf evaluation calls the value net instead of simulating to the end.

AlphaZero dropped human game databases: self-play games generated by MCTS produced training positions; the network improved; stronger networks produced better MCTS; repeat. The same loop mastered Go, chess, and shogi from rules alone. For indie developers, full AlphaZero training is usually impractical, but inference-time neural MCTS with a small net trained on your own puzzle positions can beat hand-tuned heuristics in custom board games where no Stockfish exists.

Harbor Chess: midgame knight fork worked example

Harbor Chess is our internal teaching engine. Consider this simplified position after 1.e4 e5 2.Nf3 Nc6 3.Bc4 Nf6 4.Ng5 d5 5.exd5 Na5 — Black ’s knight on f6 eyes e4 and h7; White’s bishop on c4 pins f7 indirectly. MCTS with 2,000 random rollouts in 300 ms might show:

d5 recapture lines — 42% win rate, 380 visits
Nxf7 sacrifice — 58% win rate, 920 visits (UCB kept exploring after early wins)
Bxf7+ — 51% win rate, 700 visits

Random rollouts from Nxf7 often regain material or mate on f7 when Black mis-defends; the tree concentrates visits on the sacrifice even though one-ply minimax with a material-only eval might reject it. After the opponent plays Qe7, tree reuse promotes the Nxf7 subtree; the next search budget focuses on whether Bxf7+ or Ne5 refutes the queen. This mirrors how production engines blend MCTS-style statistics with tactical quiescence search — Harbor logs visit counts in debug UI so designers see why the AI chose a non-obvious move.

Algorithm decision table

Approach	Best for	Limitation	Typical budget
Minimax + alpha-beta	Chess-like games, strong eval functions	Fixed depth misses long tactics	Depth 8–14 plies
Plain MCTS + random rollouts	Prototyping, custom rulesets, Go-like games	Noisy without good rollouts	1k–100k sims/move
MCTS + policy/value net	Competitive agents without hand eval	Training pipeline cost	800 sims with GPU inference
Heuristic greedy / utility AI	RTS, RPG companions, mobile opponents	No lookahead guarantees	Sub-millisecond
Tablebase / solved endgames	≤7 piece chess, connect-four	Does not cover opening/midgame	Lookup O(1)

Common pitfalls

Choosing max win rate instead of max visits at the root — high win rate on 3 visits is noise; visit count is the robust pick.
Symmetric games without perspective flip — backprop must credit the player who stood to move at each node.
Illegal or slow move generation in rollouts — dominates wall time; profile before tuning UCB constants.
No tree reuse — recomputing from root every ply multiplies latency on web clients.
Applying vanilla MCTS to simultaneous or hidden moves — needs determinization or IS-MCTS extensions.
Overfitting self-play nets to one opening — diversify training positions or the agent collapses to gimmick lines.
Ignoring draws — threefold repetition and stalemate must score as 0.5 wins in backprop or endgame behavior drifts.

Practitioner checklist

Implement the four phases with unit tests on tic-tac-toe (known outcomes).
Log root children: visits, wins, UCB score, chosen move each turn.
Profile rollout throughput (simulations per second) before neural nets.
Add tree reuse on opponent moves; verify memory stays bounded (prune old roots).
Swap random rollouts for capture-aware or policy-guided playouts; measure Elo gain.
Tune exploration constant C on a fixed puzzle suite; do not guess.
Handle draws, resign thresholds, and time controls explicitly in terminal checks.
For shipped titles, cap thinking time and show “AI is thinking” with cancel on mobile.
Regression-test tactical puzzles where MCTS must find known mates in 3.
Document when the AI is MCTS vs scripted so designers set difficulty expectations.

Key takeaways

MCTS allocates search effort by statistics from simulated playouts, not fixed-depth tree search.
UCB1 formalizes exploration vs exploitation; neural priors extend the same idea.
Rollout quality matters more than iteration count once basics are correct.
AlphaZero-style self-play links MCTS to reinforcement learning for superhuman custom games.
Tree reuse and fast move gen separate prototype MCTS from shippable turn-based AI.