Guide

Bayesian optimization explained

Harbor Analytics needed to tune a gradient-boosted defect classifier with eight hyperparameters — tree depth, learning rate, min_child_weight, subsample ratio, column sampling, L1/L2 regularization, and class-weight scale. A full grid over coarse steps would need 3,000+ training runs at twelve minutes each — nearly a month of GPU time. Random search would cut that, but still waste trials in flat regions of the space. They switched to Bayesian optimization: after twenty warm-up trials, each new configuration was chosen by a probabilistic model of “which hyperparameters are most likely to beat our current best validation F1.” Forty trials later they matched the grid’s best score and stopped. Bayesian optimization (BO) is a sequential strategy for tuning expensive, noisy black-box functions when each evaluation costs real time or money. It builds a surrogate model of the objective, quantifies uncertainty, and picks the next point to evaluate by maximizing an acquisition function that balances exploration and exploitation. This guide explains the GP + acquisition-function loop, TPE as a scalable alternative, multi-fidelity extensions, a Harbor Analytics classifier worked example, a method decision table, common pitfalls, and a production checklist.

What Bayesian optimization is

You want to maximize (or minimize) an unknown function f(x) over a bounded search space x ∈ X. Each evaluation of f is expensive: training a neural network, running a physics simulation, A/B testing a product change. You cannot take gradients of f because it is a black box — you submit x, wait, and receive a noisy scalar score y = f(x) + ε.

BO maintains a surrogate model — typically a Gaussian process (GP) — that estimates both the mean and uncertainty of f(x) at unseen points given observations so far. The next configuration is chosen by maximizing an acquisition function that rewards high predicted values and high uncertainty (regions worth exploring). After evaluating that point, the surrogate updates and the loop repeats.

The “Bayesian” label reflects updating beliefs about f as data arrive — the same framing as Bayesian inference, though BO is a decision procedure, not posterior estimation for its own sake. For low-dimensional, moderate-budget tuning it routinely beats grid and random search; it is the engine behind TPE samplers in Optuna and GP-based tools like BoTorch and scikit-optimize.

The algorithm in five steps

Define the search space — continuous ranges (learning_rate ∈ [1e-4, 0.3] log-uniform), integers (max_depth ∈ {3,…,12}), categoricals (booster ∈ {gbtree, dart}).
Initialize with n random or quasi-random (Sobol/Latin hypercube) evaluations to seed the surrogate.
Fit a surrogate on all (xₙ, yₙ) pairs. A GP places a prior over functions and yields a posterior mean μ(x) and variance σ²(x) at any candidate x.
Optimize an acquisition function over X to pick x_next — not by retraining, but by querying the cheap surrogate millions of times.
Evaluate f(x_next) (train + validate), append the result, refit the surrogate, and loop until budget or convergence.

The surrogate is cheap; the true f is expensive. BO spends compute on modeling so it spends less on bad trials.

Gaussian process surrogates

A Gaussian process is a distribution over functions. Pick any finite set of inputs and the function values are jointly Gaussian. A kernel (covariance function) encodes smoothness: the Matérn kernel is common for HPO because it is less rigid than squared exponential. Hyperparameters of the kernel itself (length scales) are often fit by maximizing marginal likelihood on observed data.

GPs shine in low to moderate dimension (roughly 2–20 effective dimensions) with tens to a few hundred observations. Strengths:

Well-calibrated uncertainty away from observed points.
Closed-form posterior updates after each new observation (though refitting kernel hyperparameters is still iterative).
Natural handling of noisy observations via a noise term in the kernel.

Weaknesses: cubic scaling in the number of observations (refit cost grows), curse of dimensionality in raw feature count, and awkwardness with many categorical variables unless encoded carefully. That is why production HPO libraries often use TPE (Tree-structured Parzen Estimator) for higher-dimensional spaces — it models P(x | y < threshold) vs P(x | y ≥ threshold) with kernel density trees instead of a full GP over f.

Acquisition functions: exploration vs exploitation

The acquisition function α(x) scores how valuable it would be to evaluate x next. Three classics:

Expected Improvement (EI)

EI measures the expected gain over the current best observed value f⁺, accounting for uncertainty. It is zero at points already sampled (no duplicate trials) and high where the posterior mean is good or variance is large. EI is the default in many GP-BO libraries because it is intuitive and empirically strong.

Upper Confidence Bound (UCB)

UCB picks argmax μ(x) + κσ(x). The exploration weight κ trades off mean vs uncertainty explicitly. Large κ explores; small κ exploits. UCB connects BO to multi-armed bandit algorithms — each hyperparameter vector is an arm with unknown reward.

Probability of Improvement (PI)

PI maximizes the chance of beating f⁺ by any amount. It over-exploits compared to EI (too greedy in practice) but is useful when noise is high and any marginal gain matters.

Modern stacks add batch acquisition (evaluate several points per round on parallel GPUs) and entropy search for multi-objective tuning (accuracy vs latency). The principle is unchanged: spend the next expensive evaluation where the surrogate is most informative.

Worked example: Harbor Analytics defect classifier

Harbor’s pipeline: XGBoost on tabular sensor features, optimized for validation F1 on a fixed stratified split. Search space:

max_depth: 3–10 (integer)
learning_rate: 0.01–0.3 (log-uniform)
subsample, colsample_bytree: 0.5–1.0
min_child_weight: 1–20
reg_alpha, reg_lambda: 1e-8–10 (log-uniform)
scale_pos_weight: 1–15 (class imbalance knob)

Setup: Optuna study with TPE sampler, 20 random startup trials, then 40 TPE-guided trials. Median pruner kills trials whose intermediate F1 trails the study median at epoch 50 of boosting rounds.

Trajectory: Random startup found F1 = 0.847 by trial 11. TPE trials 22–28 explored high scale_pos_weight with shallow trees — poor scores but informative for the density model. Trial 31 combined depth 6, learning rate 0.06, and scale_pos_weight 8.2 for F1 = 0.871. Trials 32–60 clustered nearby (exploitation) with no improvement; study stopped at budget.

Outcome: Best F1 0.871 in 60 trials (~12 GPU-hours) vs projected 600+ hours for a naive grid. The winning config was not in the team’s manual shortlist — BO found a interaction between class weight and shallow trees that grid templates missed.

Multi-fidelity and early stopping

Pure BO assumes each evaluation has one cost. Training often allows cheap partial evaluations: fewer boosting rounds, smaller data subsample, fewer epochs. Multi-fidelity BO (Hyperband, BOHB, Successive Halving) allocates more budget only to promising configs.

Pattern: run many configs at low fidelity (10% data, 50 trees), keep the top third, promote survivors to medium fidelity, repeat. The surrogate sees low-fidelity scores as biased but correlated proxies for the true metric. Combine with pruning (as in Optuna’s MedianPruner) to stop hopeless trials mid-training.

This is the practical bridge between BO theory and hyperparameter tuning workflows on real clusters: Bayesian logic for which configs to try, bandit-style scheduling for how long to train each one.

Method decision table

Scenario	Recommended approach	Why
1–3 hyperparameters, cheap evaluations	Grid search or exhaustive sweep	BO overhead not worth it; full coverage is feasible
4–15 dims, expensive training, <200 trials	GP-BO (EI/UCB) or TPE	Sequential learning beats random; TPE scales better in dimension
20+ dims, mixed categorical/continuous	TPE, CMA-ES, or random + early stopping	GP surrogates degrade; density-ratio methods handle categoricals
Parallel GPU cluster (8+ workers)	TPE + asynchronous constant-liar or batch EI	Fill workers without waiting for sequential GP refits
Training supports cheap partial metrics	BOHB / Hyperband + MedianPruner	Multi-fidelity cuts wall-clock dramatically
Two objectives (accuracy + latency)	Multi-objective BO (Pareto front)	Single scalarization hides tradeoffs stakeholders need
Online product knob tuning	Contextual bandits or Thompson sampling	Non-stationary traffic; BO assumes fixed `f`

Common pitfalls

Validation leakage — tuning on the test set, or refitting preprocessors on the full dataset inside each trial, inflates scores. Keep a locked holdout; use nested CV for final reporting.
Noisy objectives — single-seed neural training is noisy; BO chases variance. Fix seeds, average 2–3 runs for finalists, or model noise explicitly in the GP.
Too few startup trials — TPE/GP with <10 points extrapolates badly; use at least 10–20 random trials first.
Wrong scaling — learning rates and regularization should be log-uniform; linear sampling wastes trials at uninformative extremes.
Ignoring trial failures — OOM crashes and NaN losses should be recorded as worst-case penalties, not dropped silently (surrogate needs to learn forbidden regions).
Over-tuning validation — 500 BO trials on one validation fold overfits that fold. Cap budget, refresh folds, or use nested cross-validation for the final config pick.
High-dimensional spaces — BO cannot rescue 50 irrelevant features; reduce dimension with domain knowledge or feature selection first.

Production checklist

Define the objective (maximize F1, minimize RMSE + latency penalty) and the evaluation protocol before launching trials.
Specify search space with sensible bounds from literature or coarse random probe — not unbounded.
Log-uniform sample rates, regularization, and small positive hyperparameters.
Seed 10–20 random trials; then switch to TPE or GP-BO.
Enable pruning / multi-fidelity when intermediate metrics are available.
Fix data splits and preprocessing pipelines across trials for fair comparison.
Store every trial in a database (Optuna RDB, MLflow) with params, metrics, and artifacts.
Set a hard trial budget; stop early if improvement plateaus for N consecutive trials.
Re-evaluate the winner on a fresh holdout or nested CV before production.
Document the chosen config and the search range so future retrains do not re-tune from scratch unnecessarily.

Key takeaways

BO models the objective with a cheap surrogate and picks the next expensive evaluation strategically.
Acquisition functions formalize exploration vs exploitation — EI and UCB are the workhorses.
TPE scales to the mixed spaces common in ML tooling; GP-BO is strongest in low dimensions.
Multi-fidelity scheduling pairs BO with early stopping for real training budgets.
BO saves compute, not discipline — validation hygiene still determines whether the best trial generalizes.