Guide

Hyperparameter tuning explained

A gradient-boosted classifier with default settings scores 0.81 AUC on your holdout set. After a weekend of manual tweaking — deeper trees, lower learning rate, more regularization — the same pipeline hits 0.89. Nothing about the data changed; only the hyperparameters did. Hyperparameters are the knobs you set before training: learning rate, tree depth, batch size, dropout, weight decay. Model parameters (weights) are learned during training; hyperparameters control how that learning happens. Tune them badly and you burn compute on models that overfit or underfit. Tune them well and a modest architecture beats a larger one trained with defaults. This guide covers search strategies from grid search to Bayesian optimization, validation discipline that prevents cheating your metrics, and a checklist for teams shipping models to production.

Hyperparameters vs model parameters

Every training run has two layers of configuration. Model parameters — neural network weights, tree split thresholds, linear regression coefficients — are updated by the optimizer from gradients or greedy splits. You do not set them by hand.

Hyperparameters sit outside the gradient loop. They define the hypothesis space and training dynamics:

Capacity — number of layers, max_depth in XGBoost, embedding dimension.
Learning dynamics — learning rate, batch size, momentum, warmup steps.
Regularization — L1/L2 weight decay, dropout rate, min_child_weight, early-stopping patience.
Data handling — augmentation strength, class weights, subsample ratio.
Search budget — number of boosting rounds, training epochs (often coupled with early stopping).

Poor hyperparameter choices cannot be fixed by more data or a bigger cluster. A learning rate ten times too high diverges; a tree depth of 20 memorizes noise on a 5,000-row tabular set. Tuning is not optional polish — it is part of building a model that generalizes.

Common hyperparameters by model family

Search spaces differ by architecture. Start with the knobs that move metrics most before exploring exotic combinations.

Gradient-boosted trees (XGBoost, LightGBM, CatBoost)

learning_rate (eta) — step size per tree; lower values need more trees but often generalize better.
max_depth / num_leaves — tree complexity; primary overfitting lever on small datasets.
subsample and colsample_bytree — row/column bagging; cheap regularization.
min_child_weight / min_data_in_leaf — minimum samples per leaf; raises bias, cuts variance.
reg_alpha / reg_lambda — L1/L2 penalties on leaf weights.
n_estimators — pair with early stopping rather than fixing a large number blindly.

See our gradient boosting guide for how these interact during training.

Neural networks and deep learning

Learning rate schedule — constant, cosine decay, or warmup-then-decay; often the single biggest lever.
Batch size — affects gradient noise, memory, and effective learning rate; scale LR with batch size (linear scaling rule).
Optimizer choice — AdamW is a strong default; SGD with momentum can win with careful scheduling.
Dropout and weight decay — regularization for transformers and CNNs.
Architecture width/depth — sometimes a hyperparameter when comparing model sizes under a FLOPs budget.

LLM fine-tuning

LoRA rank and target modules — capacity vs memory tradeoff.
Learning rate — typically 1e-5 to 5e-4 for full fine-tune; 1e-4 to 3e-4 for LoRA.
Epochs vs steps — small curated datasets overfit fast; watch eval loss, not train loss.
Batch size and gradient accumulation — effective batch without OOM.

Search strategies: grid, random, and Bayesian

Exhaustive search over all combinations explodes combinatorially. Three families dominate production workflows.

Grid search

Define a discrete list per hyperparameter and evaluate every combination. Simple to implement and parallelize, but cost grows as O(∏ n_i). Five parameters with four values each is 1,024 training runs. Grid search also wastes budget on dimensions that do not matter — if learning rate dominates and dropout is irrelevant, you still pay for every dropout value at every learning rate.

Use grid search when you have two or three well-understood knobs and a cheap training loop (small tabular models).

Random search

Sample each hyperparameter independently from a defined distribution (uniform, log-uniform for learning rates). Bergstra and Bengio (2012) showed random search finds good regions faster than grid search when only a few dimensions matter — you explore more unique values of the important knobs per fixed budget.

Default choice for most teams: 50–200 random trials with cross-validated scoring beats a coarse 5×5×5 grid on the same wall clock.

Bayesian optimization

Build a surrogate model (Gaussian process, TPE) of score = f(hyperparameters) from past trials. The acquisition function (Expected Improvement, UCB) picks the next point that balances exploration and exploitation. Libraries like Optuna, Hyperopt, and Ray Tune wrap this pattern.

Bayesian search shines when each trial is expensive (deep nets, large boosting runs) and you have 20–100 trials, not 10,000. It adapts to which dimensions actually move the metric.

Successive halving and Hyperband

Not every trial deserves a full training budget. Successive halving trains many configurations on a small data subset or few epochs, keeps the top half, doubles their budget, repeats. Hyperband runs multiple halving brackets with different starting populations. ASHA (Asynchronous Successive Halving) in Ray Tune stops bad trials early without waiting for a synchronization barrier.

Pair Hyperband with Bayesian suggesters: Optuna's TPESampler + HyperbandPruner is a common production stack for neural architecture and boosting search.

Validation discipline: do not tune on your test set

Hyperparameter search is itself a learning process. If you pick the config with the best validation score across 200 trials, you have implicitly fit to validation noise. The fix is layered validation.

Holdout structure

Train set — used inside each trial to fit weights.
Validation set — used to score each trial and pick hyperparameters.
Test set — touched once, after all tuning, for an unbiased final estimate.

Nested cross-validation

For small datasets, a single validation fold is noisy. Nested CV wraps an inner loop (tune on inner train/val splits) inside an outer loop (report performance on held-out outer folds). The outer score estimates generalization; the inner loop picks hyperparameters per outer train fold. Expensive but gold standard for publications and high-stakes models.

Leakage traps

Preprocessing on full data — scaling or imputing before the split leaks validation statistics into training. Fit scalers inside each CV fold.
Temporal leakage — random splits on time-series data let the model see the future. Use chronological splits.
Repeated peeking — running 500 trials, noticing test set underperforms, tuning again. Test set is burned; collect fresh holdout data.

Our cross-validation guide covers stratified, group, and time-series CV patterns in depth.

Early stopping as implicit hyperparameter search

n_estimators in boosting and epochs in deep learning are often better treated as upper bounds than fixed values. Early stopping monitors validation loss each round/epoch and halts when improvement stalls for patience steps.

This turns training duration into a learned hyperparameter without a separate search loop. Best practice for gradient boosting:

Set n_estimators high (e.g. 5,000).
Set early_stopping_rounds (e.g. 50).
Pass a validation set to fit().
Use best_iteration for inference.

For neural nets, save checkpoints on best validation metric, not last epoch. Learning rate schedulers (cosine, one-cycle) interact with early stopping — tune them together, not in isolation.

Multi-objective and constrained tuning

Production models rarely optimize a single number. You may need high recall and sub-50ms latency and a model under 100 MB.

Pareto fronts — Optuna's multi-objective studies return non-dominated configs; product picks from the frontier.
Constrained optimization — maximize AUC subject to latency < 50ms; prune trials that violate constraints early.
Threshold tuning — hyperparameter search finds the model; a separate pass tunes classification threshold on validation for target precision/recall.

Log inference latency and model size alongside validation metrics during every trial. A 0.3% AUC gain that doubles serving cost is often a bad trade.

Production workflow and reproducibility

Tuning without tracking is archaeology. Standardize on:

Experiment tracking — MLflow, Weights & Biases, or Neptune; log params, metrics, artifacts, git SHA, data version.
Seeds — fix random seeds per trial for reproducibility; still run 3–5 seeds for final selected config to estimate variance.
Search space as code — version-control Optuna suggest_float ranges in git, not notebook cells.
Promotion gates — best trial on validation must beat production baseline on a frozen shadow dataset before deploy.
Drift awareness — hyperparameters tuned on 2024 data may need re-search when concept drift shifts the problem.

Common failure modes

Tuning on the test set — optimistic metrics that collapse in production.
Too few trials — five random samples and declaring victory; signal is noise.
Log-scale mistakes — sampling learning rate uniformly between 0.001 and 0.1 wastes trials on 0.08–0.1; use log-uniform.
Ignoring interaction effects — high learning rate needs fewer trees; tune jointly, not one-at-a-time.
Stale search space — ranges copied from a blog post for a different dataset size.
No baseline — spending a week tuning a neural net that still loses to default LightGBM on tabular data.
Chasing validation noise — 0.001 AUC differences across trials; pick simpler config or collect more validation data.

Production checklist

Define train / validation / test splits before any search; document split logic (random, temporal, group).
Establish a strong default baseline (library defaults or last production config).
Choose search method by trial cost: grid (cheap, few dims), random (general), Bayesian + pruner (expensive trials).
Use log-uniform sampling for learning rates and regularization strengths.
Enable early stopping for iterative trainers; treat round/epoch count as a bound, not a target.
Log every trial's params, metrics, duration, and model artifact to a tracking server.
Evaluate the winning config on the untouched test set once; report confidence intervals across seeds.
Document promoted hyperparameters in model cards and deployment configs.
Schedule periodic re-tuning when drift monitors flag performance decay.
Compare tuned model against baseline on business KPIs, not only offline metrics.

Key takeaways

Hyperparameters control how models learn; bad defaults waste data and compute.
Random search beats coarse grid search on the same budget when few dimensions matter.
Bayesian optimization + early pruning is the right tool when each trial is expensive.
Nested validation prevents overfitting your hyperparameter choices to a lucky fold.
Early stopping is free hyperparameter tuning for training duration.
Track, version, and re-tune — tuning is not a one-time notebook exercise.