Guide

Decision trees and random forests explained

A decision tree is the most intuitive machine learning model: a flowchart of yes/no questions that routes each example to a leaf prediction. Trees handle nonlinear relationships and feature interactions without manual engineering — if age > 65 and income < $40k, predict "high churn risk." The catch is that a deep, unconstrained tree memorizes training noise. Random forests fix that by training hundreds of slightly different trees on bootstrap samples and random feature subsets, then averaging their votes. Together they form the backbone of modern tabular ML — and the conceptual foundation behind gradient boosting. This guide covers split criteria (Gini, entropy), pruning, bagging, feature importance, hyperparameters, and when forests beat logistic regression on structured data — building on machine learning fundamentals.

How a decision tree splits data

Training starts at the root with all examples. The algorithm searches every feature and every possible threshold (for numeric columns) or category partition (for categorical columns) to find the split that best separates the target labels. "Best" is measured by a impurity reduction score — how much the child nodes are purer than the parent.

Gini impurity

For classification, Gini impurity measures how often a randomly chosen example would be mislabeled if you assigned it according to the class distribution in the node:

Gini = 1 − Σ p_k²

where p_k is the fraction of class k in the node. A node with 100% one class has Gini 0 (pure); a 50/50 binary split has Gini 0.5 (maximally impure). The algorithm picks the split with the largest information gain: gain = Gini(parent) − weighted average Gini(children).

Entropy and information gain

An alternative criterion is entropy from information theory: H = −Σ p_k log₂(p_k). Splits maximize the reduction in entropy. In practice Gini and entropy produce nearly identical trees; scikit-learn defaults to Gini because it is slightly faster to compute (no logarithm).

Regression trees

For continuous targets, splits minimize variance (or mean squared error) in the child nodes. Each leaf predicts the mean (or median) of training targets that reach it. Regression trees power quantile forecasting, partial dependence plots, and the base learners inside gradient boosting libraries.

Growing, stopping and pruning

Left unchecked, a tree grows until every leaf is pure or contains a single example — perfect training accuracy, terrible generalization. Production trees need stopping rules:

max_depth — cap tree height (most important knob).
min_samples_split — require at least N examples before splitting a node.
min_samples_leaf — each leaf must hold at least N examples (prevents tiny leaves that fit noise).
max_leaf_nodes — limit total leaves for a balanced complexity budget.

Post-pruning (cost-complexity pruning) grows a full tree first, then removes subtrees that do not improve validation error enough. scikit-learn's DecisionTreeClassifier supports this via ccp_alpha. Pruning often beats pre-set depth limits because it adapts complexity to local data density.

Tune stopping parameters with cross-validation on a grid of max_depth, min_samples_leaf, and ccp_alpha. A shallow tree (depth 3–5) is a strong interpretable baseline; depth beyond 10 rarely helps on real tabular data without ensembling.

Why single trees fail — and how random forests fix it

Individual trees are high-variance estimators: a small change in training data can produce a completely different tree structure. Random forests apply two randomization tricks popularized by Leo Breiman:

Bootstrap aggregating (bagging) — each tree trains on a random sample with replacement (~63% unique rows per tree). Predictions average (regression) or majority-vote (classification) across trees.
Feature randomness — at each split, only a random subset of features is considered (typically sqrt(n_features) for classification, n_features / 3 for regression). Decorrelates trees so averaging actually reduces variance.

The result: lower variance without the bias increase you'd get from aggressively pruning a single tree. Random forests handle mixed numeric and categorical features, missing values (with appropriate implementations), and nonlinear interactions — all without explicit feature engineering.

Out-of-bag (OOB) error

Roughly one-third of rows never appear in a given tree's bootstrap sample. Those out-of-bag examples act as a free validation set: each tree predicts OOB rows it did not train on, and errors aggregate across the forest. OOB score approximates cross-validation without extra compute — useful for quick model selection, though still validate on a held-out test set before deployment.

Feature importance and interpretability

Random forests expose feature importance scores — typically mean decrease in impurity (how much each feature reduces Gini/entropy across all splits, weighted by node size). Importance ranks which columns drive predictions: useful for debugging, stakeholder reports, and pruning useless features.

Caveats: importance is biased toward high-cardinality features (many unique values get more split opportunities). Prefer permutation importance (shuffle one column, measure validation drop) for unbiased ranking. For regulatory or clinical settings where you need to trace individual decisions, a single pruned tree or SHAP values on the forest may be clearer than reading 500 trees.

Partial dependence plots show how predictions change as one feature varies while others are marginalized — a middle ground between global importance and local explanations.

Key hyperparameters

Parameter	Typical range	Effect
`n_estimators`	100–500	More trees reduce variance; diminishing returns after ~300; linear inference cost
`max_depth`	None or 10–30	Deeper trees fit more interactions; None lets leaves grow until pure (often fine inside a forest)
`max_features`	`sqrt`, `log2`, or 0.3–0.5	Lower values decorrelate trees more; too low adds bias
`min_samples_leaf`	1–10	Higher values smooth predictions; helps with noisy labels
`class_weight`	`balanced` or dict	Upweights minority classes in imbalanced classification

Random forests have fewer fragile hyperparameters than gradient boosting — a reason they are popular defaults. Start with n_estimators=200, max_features='sqrt', and tune min_samples_leaf and max_depth if OOB or validation scores plateau.

Decision trees vs random forests vs gradient boosting

Method	Bias / variance	Training speed	Typical use
Single decision tree	Low bias, high variance	Very fast	Interpretable rules, teaching, quick EDA
Random forest	Moderate bias, low variance	Fast (parallelizable)	Robust tabular baseline, feature screening
Gradient boosting	Low bias, moderate variance	Slower (sequential)	Maximum accuracy on structured data, competitions
Logistic regression	High bias, low variance	Fastest	Linear boundaries, regulated interpretability

A practical workflow: train a random forest for a strong baseline and feature importance map; if you need more accuracy and can invest in tuning, switch to XGBoost or LightGBM (see our gradient boosting guide). Keep logistic regression when coefficients must be auditable or data is tiny.

Common mistakes

Evaluating on training data — a deep single tree hits 100% training accuracy while failing on new rows. Always hold out a test set.
One-hot encoding before tree models — unnecessary for most tree implementations; native categorical support (CatBoost, LightGBM) or ordinal encoding often works fine.
Trusting impurity importance blindly — correlated features split importance; use permutation importance for ranking.
Using random forests on tiny datasets — with <500 rows, a single pruned tree or regularized linear model often generalizes better.
Ignoring class imbalance — set class_weight='balanced' or tune threshold using precision-recall metrics, not accuracy.
Expecting calibrated probabilities — forest vote fractions rank well but may need Platt scaling or isotonic regression for true probability estimates.

Production checklist

Split data with temporal or group constraints if leakage is possible.
Train a random forest baseline; record OOB score and permutation importance.
Tune max_depth, min_samples_leaf, and n_estimators via cross-validation.
Compare against logistic regression and gradient boosting on the same validation folds.
Evaluate with task-appropriate metrics (AUC, F1, RMSE) on held-out test data.
Serialize the fitted model with joblib or ONNX; pin scikit-learn version.
Document top features and sanity-check against domain knowledge.
Monitor feature distributions in production; retrain when drift appears.

Key takeaways

Decision trees partition feature space with greedy impurity-minimizing splits — intuitive but high-variance alone.
Random forests bag bootstrap samples and randomize features per split, averaging decorrelated trees for robust tabular predictions.
Stopping rules and pruning control single-tree complexity; forests tolerate deeper base trees because averaging reduces variance.
Feature importance helps screening and explanation, but prefer permutation importance for unbiased rankings.
Forests are a strong default on structured data; reach for gradient boosting when you need the last few points of accuracy.