Guide
Decision trees and random forests explained
A decision tree is the most intuitive machine learning model: a flowchart of yes/no questions that routes each example to a leaf prediction. Trees handle nonlinear relationships and feature interactions without manual engineering — if age > 65 and income < $40k, predict "high churn risk." The catch is that a deep, unconstrained tree memorizes training noise. Random forests fix that by training hundreds of slightly different trees on bootstrap samples and random feature subsets, then averaging their votes. Together they form the backbone of modern tabular ML — and the conceptual foundation behind gradient boosting. This guide covers split criteria (Gini, entropy), pruning, bagging, feature importance, hyperparameters, and when forests beat logistic regression on structured data — building on machine learning fundamentals.
How a decision tree splits data
Training starts at the root with all examples. The algorithm searches every feature and every possible threshold (for numeric columns) or category partition (for categorical columns) to find the split that best separates the target labels. "Best" is measured by a impurity reduction score — how much the child nodes are purer than the parent.
Gini impurity
For classification, Gini impurity measures how often a randomly chosen example would be mislabeled if you assigned it according to the class distribution in the node:
Gini = 1 − Σ pk²
where pk is the fraction of class k in
the node. A node with 100% one class has Gini 0 (pure); a 50/50 binary split
has Gini 0.5 (maximally impure). The algorithm picks the split with the
largest information gain:
gain = Gini(parent) − weighted average Gini(children).
Entropy and information gain
An alternative criterion is entropy from information theory:
H = −Σ pk log2(pk). Splits
maximize the reduction in entropy. In practice Gini and entropy produce
nearly identical trees; scikit-learn defaults to Gini because it is slightly
faster to compute (no logarithm).
Regression trees
For continuous targets, splits minimize variance (or mean squared error) in the child nodes. Each leaf predicts the mean (or median) of training targets that reach it. Regression trees power quantile forecasting, partial dependence plots, and the base learners inside gradient boosting libraries.
Growing, stopping and pruning
Left unchecked, a tree grows until every leaf is pure or contains a single example — perfect training accuracy, terrible generalization. Production trees need stopping rules:
- max_depth — cap tree height (most important knob).
- min_samples_split — require at least N examples before splitting a node.
- min_samples_leaf — each leaf must hold at least N examples (prevents tiny leaves that fit noise).
- max_leaf_nodes — limit total leaves for a balanced complexity budget.
Post-pruning (cost-complexity pruning) grows a full tree
first, then removes subtrees that do not improve validation error enough.
scikit-learn's DecisionTreeClassifier supports this via
ccp_alpha. Pruning often beats pre-set depth limits because it
adapts complexity to local data density.
Tune stopping parameters with
cross-validation
on a grid of max_depth, min_samples_leaf, and
ccp_alpha. A shallow tree (depth 3–5) is a strong interpretable
baseline; depth beyond 10 rarely helps on real tabular data without
ensembling.
Why single trees fail — and how random forests fix it
Individual trees are high-variance estimators: a small change in training data can produce a completely different tree structure. Random forests apply two randomization tricks popularized by Leo Breiman:
- Bootstrap aggregating (bagging) — each tree trains on a random sample with replacement (~63% unique rows per tree). Predictions average (regression) or majority-vote (classification) across trees.
- Feature randomness — at each split, only a random subset
of features is considered (typically
sqrt(n_features)for classification,n_features / 3for regression). Decorrelates trees so averaging actually reduces variance.
The result: lower variance without the bias increase you'd get from aggressively pruning a single tree. Random forests handle mixed numeric and categorical features, missing values (with appropriate implementations), and nonlinear interactions — all without explicit feature engineering.
Out-of-bag (OOB) error
Roughly one-third of rows never appear in a given tree's bootstrap sample. Those out-of-bag examples act as a free validation set: each tree predicts OOB rows it did not train on, and errors aggregate across the forest. OOB score approximates cross-validation without extra compute — useful for quick model selection, though still validate on a held-out test set before deployment.
Feature importance and interpretability
Random forests expose feature importance scores — typically mean decrease in impurity (how much each feature reduces Gini/entropy across all splits, weighted by node size). Importance ranks which columns drive predictions: useful for debugging, stakeholder reports, and pruning useless features.
Caveats: importance is biased toward high-cardinality features (many unique values get more split opportunities). Prefer permutation importance (shuffle one column, measure validation drop) for unbiased ranking. For regulatory or clinical settings where you need to trace individual decisions, a single pruned tree or SHAP values on the forest may be clearer than reading 500 trees.
Partial dependence plots show how predictions change as one feature varies while others are marginalized — a middle ground between global importance and local explanations.
Key hyperparameters
| Parameter | Typical range | Effect |
|---|---|---|
n_estimators |
100–500 | More trees reduce variance; diminishing returns after ~300; linear inference cost |
max_depth |
None or 10–30 | Deeper trees fit more interactions; None lets leaves grow until pure (often fine inside a forest) |
max_features |
sqrt, log2, or 0.3–0.5 |
Lower values decorrelate trees more; too low adds bias |
min_samples_leaf |
1–10 | Higher values smooth predictions; helps with noisy labels |
class_weight |
balanced or dict |
Upweights minority classes in imbalanced classification |
Random forests have fewer fragile hyperparameters than gradient boosting — a
reason they are popular defaults. Start with
n_estimators=200, max_features='sqrt', and tune
min_samples_leaf and max_depth if OOB or validation
scores plateau.
Decision trees vs random forests vs gradient boosting
| Method | Bias / variance | Training speed | Typical use |
|---|---|---|---|
| Single decision tree | Low bias, high variance | Very fast | Interpretable rules, teaching, quick EDA |
| Random forest | Moderate bias, low variance | Fast (parallelizable) | Robust tabular baseline, feature screening |
| Gradient boosting | Low bias, moderate variance | Slower (sequential) | Maximum accuracy on structured data, competitions |
| Logistic regression | High bias, low variance | Fastest | Linear boundaries, regulated interpretability |
A practical workflow: train a random forest for a strong baseline and feature importance map; if you need more accuracy and can invest in tuning, switch to XGBoost or LightGBM (see our gradient boosting guide). Keep logistic regression when coefficients must be auditable or data is tiny.
Common mistakes
- Evaluating on training data — a deep single tree hits 100% training accuracy while failing on new rows. Always hold out a test set.
- One-hot encoding before tree models — unnecessary for most tree implementations; native categorical support (CatBoost, LightGBM) or ordinal encoding often works fine.
- Trusting impurity importance blindly — correlated features split importance; use permutation importance for ranking.
- Using random forests on tiny datasets — with <500 rows, a single pruned tree or regularized linear model often generalizes better.
- Ignoring class imbalance — set
class_weight='balanced'or tune threshold using precision-recall metrics, not accuracy. - Expecting calibrated probabilities — forest vote fractions rank well but may need Platt scaling or isotonic regression for true probability estimates.
Production checklist
- Split data with temporal or group constraints if leakage is possible.
- Train a random forest baseline; record OOB score and permutation importance.
- Tune
max_depth,min_samples_leaf, andn_estimatorsvia cross-validation. - Compare against logistic regression and gradient boosting on the same validation folds.
- Evaluate with task-appropriate metrics (AUC, F1, RMSE) on held-out test data.
- Serialize the fitted model with joblib or ONNX; pin scikit-learn version.
- Document top features and sanity-check against domain knowledge.
- Monitor feature distributions in production; retrain when drift appears.
Key takeaways
- Decision trees partition feature space with greedy impurity-minimizing splits — intuitive but high-variance alone.
- Random forests bag bootstrap samples and randomize features per split, averaging decorrelated trees for robust tabular predictions.
- Stopping rules and pruning control single-tree complexity; forests tolerate deeper base trees because averaging reduces variance.
- Feature importance helps screening and explanation, but prefer permutation importance for unbiased rankings.
- Forests are a strong default on structured data; reach for gradient boosting when you need the last few points of accuracy.
Related reading
- Machine learning fundamentals explained — supervised learning, bias-variance tradeoff, and evaluation basics
- Gradient boosting and ensemble learning explained — when sequential boosting beats bagging on tabular data
- Feature engineering explained — encoding, binning, and interaction features that still help linear models
- Overfitting and cross-validation explained — reliable hyperparameter tuning without leaking test data