Guide
Bayesian optimization explained
Harbor Analytics needed to tune a gradient-boosted defect classifier with
eight hyperparameters — tree depth, learning rate, min_child_weight,
subsample ratio, column sampling, L1/L2 regularization, and class-weight scale.
A full grid over coarse steps would need 3,000+ training runs at
twelve minutes each — nearly a month of GPU time. Random search would cut
that, but still waste trials in flat regions of the space. They switched to
Bayesian optimization: after twenty warm-up trials, each new
configuration was chosen by a probabilistic model of “which hyperparameters
are most likely to beat our current best validation F1.” Forty trials later
they matched the grid’s best score and stopped. Bayesian optimization
(BO) is a sequential strategy for tuning expensive, noisy black-box
functions when each evaluation costs real time or money. It builds a surrogate
model of the objective, quantifies uncertainty, and picks the next point to
evaluate by maximizing an acquisition function that balances exploration
and exploitation. This guide explains the GP + acquisition-function loop, TPE
as a scalable alternative, multi-fidelity extensions, a Harbor Analytics classifier
worked example, a method decision table, common pitfalls, and a production checklist.
What Bayesian optimization is
You want to maximize (or minimize) an unknown function
f(x) over a bounded search space x ∈ X.
Each evaluation of f is expensive: training a neural network,
running a physics simulation, A/B testing a product change. You cannot take
gradients of f because it is a black box — you submit
x, wait, and receive a noisy scalar score y = f(x) + ε.
BO maintains a surrogate model — typically a Gaussian
process (GP) — that estimates both the mean and uncertainty of
f(x) at unseen points given observations so far. The next
configuration is chosen by maximizing an acquisition function
that rewards high predicted values and high uncertainty (regions worth
exploring). After evaluating that point, the surrogate updates and the loop
repeats.
The “Bayesian” label reflects updating beliefs about
f as data arrive — the same framing as
Bayesian inference,
though BO is a decision procedure, not posterior estimation for its own sake.
For low-dimensional, moderate-budget tuning it routinely beats grid and random
search; it is the engine behind TPE samplers in
Optuna
and GP-based tools like BoTorch and scikit-optimize.
The algorithm in five steps
- Define the search space — continuous ranges
(
learning_rate ∈ [1e-4, 0.3]log-uniform), integers (max_depth ∈ {3,…,12}), categoricals (booster ∈ {gbtree, dart}). - Initialize with n random or quasi-random (Sobol/Latin hypercube) evaluations to seed the surrogate.
- Fit a surrogate on all
(xₙ, yₙ)pairs. A GP places a prior over functions and yields a posterior meanμ(x)and varianceσ²(x)at any candidatex. - Optimize an acquisition function over
Xto pickxnext— not by retraining, but by querying the cheap surrogate millions of times. - Evaluate
f(xnext)(train + validate), append the result, refit the surrogate, and loop until budget or convergence.
The surrogate is cheap; the true f is expensive. BO spends
compute on modeling so it spends less on bad trials.
Gaussian process surrogates
A Gaussian process is a distribution over functions. Pick any finite set of inputs and the function values are jointly Gaussian. A kernel (covariance function) encodes smoothness: the Matérn kernel is common for HPO because it is less rigid than squared exponential. Hyperparameters of the kernel itself (length scales) are often fit by maximizing marginal likelihood on observed data.
GPs shine in low to moderate dimension (roughly 2–20 effective dimensions) with tens to a few hundred observations. Strengths:
- Well-calibrated uncertainty away from observed points.
- Closed-form posterior updates after each new observation (though refitting kernel hyperparameters is still iterative).
- Natural handling of noisy observations via a noise term in the kernel.
Weaknesses: cubic scaling in the number of observations (refit cost grows),
curse of dimensionality in raw feature count, and awkwardness with many
categorical variables unless encoded carefully. That is why production HPO
libraries often use TPE (Tree-structured Parzen Estimator)
for higher-dimensional spaces — it models
P(x | y < threshold) vs P(x | y ≥ threshold)
with kernel density trees instead of a full GP over f.
Acquisition functions: exploration vs exploitation
The acquisition function α(x) scores how valuable it would
be to evaluate x next. Three classics:
Expected Improvement (EI)
EI measures the expected gain over the current best observed value
f+, accounting for uncertainty. It is zero at points
already sampled (no duplicate trials) and high where the posterior mean is
good or variance is large. EI is the default in many GP-BO libraries
because it is intuitive and empirically strong.
Upper Confidence Bound (UCB)
UCB picks argmax μ(x) + κσ(x). The exploration
weight κ trades off mean vs uncertainty explicitly.
Large κ explores; small κ exploits.
UCB connects BO to
multi-armed bandit
algorithms — each hyperparameter vector is an arm with unknown reward.
Probability of Improvement (PI)
PI maximizes the chance of beating f+ by any amount.
It over-exploits compared to EI (too greedy in practice) but is useful when
noise is high and any marginal gain matters.
Modern stacks add batch acquisition (evaluate several points per round on parallel GPUs) and entropy search for multi-objective tuning (accuracy vs latency). The principle is unchanged: spend the next expensive evaluation where the surrogate is most informative.
Worked example: Harbor Analytics defect classifier
Harbor’s pipeline: XGBoost on tabular sensor features, optimized for validation F1 on a fixed stratified split. Search space:
max_depth: 3–10 (integer)learning_rate: 0.01–0.3 (log-uniform)subsample,colsample_bytree: 0.5–1.0min_child_weight: 1–20reg_alpha,reg_lambda: 1e-8–10 (log-uniform)scale_pos_weight: 1–15 (class imbalance knob)
Setup: Optuna study with TPE sampler, 20 random startup trials, then 40 TPE-guided trials. Median pruner kills trials whose intermediate F1 trails the study median at epoch 50 of boosting rounds.
Trajectory: Random startup found F1 = 0.847 by trial 11.
TPE trials 22–28 explored high scale_pos_weight with shallow
trees — poor scores but informative for the density model. Trial 31
combined depth 6, learning rate 0.06, and scale_pos_weight 8.2
for F1 = 0.871. Trials 32–60 clustered nearby (exploitation) with no
improvement; study stopped at budget.
Outcome: Best F1 0.871 in 60 trials (~12 GPU-hours) vs projected 600+ hours for a naive grid. The winning config was not in the team’s manual shortlist — BO found a interaction between class weight and shallow trees that grid templates missed.
Multi-fidelity and early stopping
Pure BO assumes each evaluation has one cost. Training often allows cheap partial evaluations: fewer boosting rounds, smaller data subsample, fewer epochs. Multi-fidelity BO (Hyperband, BOHB, Successive Halving) allocates more budget only to promising configs.
Pattern: run many configs at low fidelity (10% data, 50 trees), keep the top third, promote survivors to medium fidelity, repeat. The surrogate sees low-fidelity scores as biased but correlated proxies for the true metric. Combine with pruning (as in Optuna’s MedianPruner) to stop hopeless trials mid-training.
This is the practical bridge between BO theory and hyperparameter tuning workflows on real clusters: Bayesian logic for which configs to try, bandit-style scheduling for how long to train each one.
Method decision table
| Scenario | Recommended approach | Why |
|---|---|---|
| 1–3 hyperparameters, cheap evaluations | Grid search or exhaustive sweep | BO overhead not worth it; full coverage is feasible |
| 4–15 dims, expensive training, <200 trials | GP-BO (EI/UCB) or TPE | Sequential learning beats random; TPE scales better in dimension |
| 20+ dims, mixed categorical/continuous | TPE, CMA-ES, or random + early stopping | GP surrogates degrade; density-ratio methods handle categoricals |
| Parallel GPU cluster (8+ workers) | TPE + asynchronous constant-liar or batch EI | Fill workers without waiting for sequential GP refits |
| Training supports cheap partial metrics | BOHB / Hyperband + MedianPruner | Multi-fidelity cuts wall-clock dramatically |
| Two objectives (accuracy + latency) | Multi-objective BO (Pareto front) | Single scalarization hides tradeoffs stakeholders need |
| Online product knob tuning | Contextual bandits or Thompson sampling | Non-stationary traffic; BO assumes fixed f |
Common pitfalls
- Validation leakage — tuning on the test set, or refitting preprocessors on the full dataset inside each trial, inflates scores. Keep a locked holdout; use nested CV for final reporting.
- Noisy objectives — single-seed neural training is noisy; BO chases variance. Fix seeds, average 2–3 runs for finalists, or model noise explicitly in the GP.
- Too few startup trials — TPE/GP with <10 points extrapolates badly; use at least 10–20 random trials first.
- Wrong scaling — learning rates and regularization should be log-uniform; linear sampling wastes trials at uninformative extremes.
- Ignoring trial failures — OOM crashes and NaN losses should be recorded as worst-case penalties, not dropped silently (surrogate needs to learn forbidden regions).
- Over-tuning validation — 500 BO trials on one validation fold overfits that fold. Cap budget, refresh folds, or use nested cross-validation for the final config pick.
- High-dimensional spaces — BO cannot rescue 50 irrelevant features; reduce dimension with domain knowledge or feature selection first.
Production checklist
- Define the objective (maximize F1, minimize RMSE + latency penalty) and the evaluation protocol before launching trials.
- Specify search space with sensible bounds from literature or coarse random probe — not unbounded.
- Log-uniform sample rates, regularization, and small positive hyperparameters.
- Seed 10–20 random trials; then switch to TPE or GP-BO.
- Enable pruning / multi-fidelity when intermediate metrics are available.
- Fix data splits and preprocessing pipelines across trials for fair comparison.
- Store every trial in a database (Optuna RDB, MLflow) with params, metrics, and artifacts.
- Set a hard trial budget; stop early if improvement plateaus for N consecutive trials.
- Re-evaluate the winner on a fresh holdout or nested CV before production.
- Document the chosen config and the search range so future retrains do not re-tune from scratch unnecessarily.
Key takeaways
- BO models the objective with a cheap surrogate and picks the next expensive evaluation strategically.
- Acquisition functions formalize exploration vs exploitation — EI and UCB are the workhorses.
- TPE scales to the mixed spaces common in ML tooling; GP-BO is strongest in low dimensions.
- Multi-fidelity scheduling pairs BO with early stopping for real training budgets.
- BO saves compute, not discipline — validation hygiene still determines whether the best trial generalizes.
Related reading
- Hyperparameter tuning explained — grid vs random vs Bayesian search in the full HPO workflow
- Optuna fundamentals explained — TPE samplers, pruners, and distributed studies in practice
- Bayesian inference explained — priors, posteriors, and the probabilistic framing BO inherits
- Multi-armed bandits explained — online exploration algorithms related to UCB acquisition