Guide
Optuna fundamentals explained
A team trains an XGBoost classifier with twelve hyperparameters. Grid search over coarse ranges needs 4,096 fits; random search wastes budget on obviously bad combinations; and half the trials train for 500 trees when 50 would have revealed they cannot beat the current champion. Optuna is an open-source hyperparameter optimization framework that replaces brute force with adaptive sampling (Tree-structured Parzen Estimator, CMA-ES), early pruning (Median, Hyperband), and a define-by-run API where the search space itself can depend on earlier trial results. It integrates with scikit-learn, PyTorch, LightGBM, and MLflow for experiment logging. This guide covers studies and trials, samplers and pruners, storage backends, multi-objective optimization, a Harbor Analytics defect classifier worked example, a tooling decision table, common pitfalls, and a practitioner checklist — building on our broader hyperparameter tuning overview.
Core concepts: studies, trials, and objectives
Optuna organizes search around three primitives:
- Study — a named optimization session with a direction
(
minimizeormaximize), sampler, pruner, and optional storage backend. One study answers one question: “What hyperparameters maximize validation AUC for this dataset?” - Trial — a single evaluation of one hyperparameter
configuration. The
objective(trial)function suggests parameters, trains a model, returns a scalar metric (or tuple for multi-objective). - Objective function — your training code wrapped to
accept a
optuna.Trialobject and return the metric Optuna should optimize.
The minimal pattern:
import optuna
def objective(trial):
lr = trial.suggest_float("lr", 1e-4, 1e-1, log=True)
depth = trial.suggest_int("max_depth", 3, 12)
# ... train and evaluate ...
return val_auc
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)
print(study.best_params, study.best_value)
Unlike sklearn’s GridSearchCV, which requires a fixed
parameter grid upfront, Optuna’s define-by-run style lets
you branch: suggest a large hidden_size only when
architecture == "mlp", or skip expensive data-augmentation knobs
when a cheap baseline already fails. That flexibility matters for neural
architectures and conditional pipelines where the search space is not a
rectangular grid.
Suggesting parameters: distributions and constraints
trial.suggest_* methods define the search space per trial:
suggest_float(name, low, high, log=True)— uniform or log-uniform continuous values; uselog=Truefor learning rates and regularization strengths that span orders of magnitude.suggest_int(name, low, high, step=2)— discrete integers (tree depth, layer counts).suggest_categorical(name, choices)— enums like["gbtree", "dart"]booster types or optimizer names.suggest_discrete_uniform/suggest_loguniform— legacy aliases; prefer explicitsuggest_floatwithlog=Truein Optuna 3.x.
Parameter constraints and relationships
Use optuna.samplers.TPESampler(constant_liar=True) for parallel
workers, and enforce relationships in code:
assert trial.params["min_child_weight"] <= trial.params["max_depth"] * 10
or reject invalid combos by raising optuna.TrialPruned(). For
linear constraints across continuous parameters (e.g.
lr * batch_size < 1.0), Optuna 3.4+ supports
constraints_func on the study. Document valid regions in your
objective docstring so future maintainers know which suggestions are legal.
Samplers: from random search to Bayesian optimization
The sampler decides which hyperparameters to try next given completed trials:
- TPESampler (default) — Tree-structured Parzen Estimator; models good vs bad regions of the search space separately and samples where the “good” density is high. Strong default for mixed continuous/categorical spaces with 50–500 trials.
- CmaEsSampler — covariance matrix adaptation; excels on continuous spaces under ~20 dimensions when trials are expensive. Poor fit for heavy categorical branching.
- RandomSampler — baseline and ablation; useful for verifying TPE actually beats random on your problem size.
- NSGAIISampler / MOTPESampler — multi-objective Pareto fronts when you optimize accuracy and latency jointly.
- GridSampler / BruteForceSampler — exhaustive small grids when you truly need complete coverage.
Set explicitly when defaults misbehave:
optuna.create_study(sampler=optuna.samplers.TPESampler(seed=42, n_startup_trials=20)).
The n_startup_trials knob runs pure random exploration before TPE
builds density models — raise it when the search space is wide or noisy.
Pruners: stop bad trials early
Training 500 epochs for every trial is wasteful when validation loss diverges by epoch 12. Pruners compare intermediate metrics across trials and terminate unpromising ones:
- MedianPruner — prunes trials worse than the median of previous trials at the same step. Simple default for iterative training.
- HyperbandPruner / SuccessiveHalvingPruner — allocate more budget to promising trials; ideal when epoch count or dataset subset size is the resource dimension.
- PercentilePruner — prune below a configurable percentile at each step.
- NopPruner — disable pruning for fast, single-shot estimators (small sklearn fits).
Report intermediate values inside training loops:
for epoch in range(100):
val_loss = train_one_epoch(...)
trial.report(val_loss, epoch)
if trial.should_prune():
raise optuna.TrialPruned()
Pruning interacts with cross-validation: report the mean fold metric at each fold
index as the step, not per-epoch metrics across mismatched fold sizes. For
gradient boosting,
report validation metric every early_stopping_rounds trees so
Hyperband can kill shallow doomed configs cheaply.
Storage, parallelism, and dashboards
By default Optuna stores studies in memory — fine for notebooks, lost on restart. Production setups use RDB storage:
storage = "postgresql://optuna:***@db.internal/optuna"
study = optuna.create_study(
study_name="harbor-defect-v4",
storage=storage,
load_if_exists=True,
direction="maximize",
)
Multiple workers call study.optimize(objective, n_trials=25)
concurrently; the storage backend coordinates trial allocation. SQLite works for
single-machine parallelism with storage="sqlite:///optuna.db" but
serializes writes — prefer PostgreSQL or MySQL beyond a handful of
workers.
Visualization and callbacks
optuna.visualization.plot_optimization_history,
plot_param_importances, and plot_parallel_coordinate
render in Jupyter via Plotly. The optional Optuna Dashboard
(optuna-dashboard package) serves a web UI over RDB storage for
teams without a separate experiment tracker. Callbacks like
MLflowCallback log each trial as an MLflow run automatically,
bridging search and registry workflows in
MLOps pipelines.
Framework integrations
Optuna ships integration modules that reduce boilerplate:
- LightGBM / XGBoost / CatBoost —
optuna.integration.LightGBMPruningCallbackwires pruning into training callbacks natively. - PyTorch Lightning —
PyTorchLightningPruningCallbackmonitors validation loss per epoch. - Keras —
KerasPruningCallbackformodel.fitloops. - sklearn — wrap estimators manually or use
optuna.integration.sklearnhelpers; pair withcross_val_scoreinside the objective for stable metrics.
For sklearn pipelines, return the mean out-of-fold metric from
StratifiedKFold rather than a single train/val split —
otherwise Optuna overfits the holdout fold you accidentally tuned on. Nested
cross-validation (outer loop for final estimate, inner loop inside objective) is
expensive but mandatory when reporting unbiased performance to stakeholders.
Worked example: Harbor Analytics defect classifier
Harbor Supply photographs conveyor belts for surface defects. The vision team trains a ResNet-18 classifier on 40,000 labeled crops with heavy class imbalance (defects are 2.3% of frames). They need better recall at fixed false-positive rate, not just raw accuracy.
Study design
- Study:
harbor-defect-resnet18-v4, directionmaximize, metric = validation PR-AUC on a held-out week of production images. - Sampler:
TPESampler(n_startup_trials=30, seed=7) - Pruner:
HyperbandPruner(min_resource=5, max_resource=50, reduction_factor=3)where resource = training epochs. - Storage: PostgreSQL shared across four GPU workers, 200 trials budget.
Search space (excerpt)
def objective(trial):
lr = trial.suggest_float("lr", 1e-5, 3e-3, log=True)
wd = trial.suggest_float("weight_decay", 1e-6, 1e-2, log=True)
aug = trial.suggest_categorical("augment", ["light", "heavy"])
focal_gamma = trial.suggest_float("focal_gamma", 0.0, 3.0)
batch = trial.suggest_categorical("batch", [32, 64, 128])
# train with epoch loop + trial.report / should_prune
return pr_auc
After 200 trials, Hyperband pruned 61% before epoch 15, saving roughly 120 GPU-hours.
Best trial (#173) achieved PR-AUC 0.84 vs 0.79 for the hand-tuned baseline.
The team logs all trials via MLflowCallback, registers the best
checkpoint through
MLflow,
and exports plot_param_importances to show that
focal_gamma and augment dominated gains —
informing the next labeling budget for hard negatives.
Tooling decision table
| Need | Optuna fit | Alternative |
|---|---|---|
| Python HPO with pruning + define-by-run | Strong default | Ray Tune, Hyperopt |
| Simple grid on sklearn pipeline | Overkill | GridSearchCV, RandomizedSearchCV |
| Distributed GPU cluster scheduling | Via storage + workers | Ray Tune + Ray cluster |
| Multi-objective Pareto search | NSGA-II / MOTPE built-in | BoTorch, pymoo |
| JVM / Spark native tuning | Python only | Hyperopt Spark, SynapseML |
| Experiment UI + model registry | Via MLflow callback | W&B Sweeps (SaaS) |
| Neural architecture search (NAS) | Partial (conditional params) | DARTS, AutoKeras |
Common pitfalls
- Optimizing on the test set — repeatedly evaluating the same holdout fold across trials leaks information. Hold out a final test set untouched until search completes.
- Noisy objectives — single-split metrics swing wildly; use cross-validation or multiple seeds averaged per trial.
- Pruning without intermediate reports — forgetting
trial.reportmakes every pruner a no-op. - Too few startup trials — TPE with
n_startup_trials=5on a 15-dimensional space explores poorly; budget 10–20% of total trials for random warmup. - In-memory studies on long jobs — a killed worker loses all history. Persist to RDB from day one.
- Duplicate study names —
load_if_exists=Trueresumes old trials; use versioned study names (project-v4) when you change the search space. - Logging every trial to Production — register only the champion after nested validation; Optuna finds training optima, not deployment guarantees.
Practitioner checklist
- Define a single scalar objective aligned with business metrics (PR-AUC, not accuracy on imbalanced data).
- Reserve a final test set never seen during
study.optimize. - Choose TPE as default; switch to CmaEs for low-dimensional continuous problems.
- Enable Hyperband or Median pruning for iterative trainers; use NopPruner for instant fits.
- Configure PostgreSQL (or equivalent) storage before parallel workers start.
- Set
n_startup_trialsto at least 10% of total budget. - Report intermediate metrics at consistent step indices across trials.
- Integrate MLflowCallback or manual logging so trials are auditable.
- Visualize param importances before declaring search complete.
- Re-train the best config with multiple seeds and report mean +/- std.
- Version the study name when data distributions or features change.
Key takeaways
- Optuna frames hyperparameter search as studies of trials optimizing an objective function.
- Define-by-run APIs support conditional search spaces impossible with fixed grids.
- TPESampler and HyperbandPruner are the default powerful combo for iterative ML training.
- RDB storage enables parallel workers and survives process restarts.
- Pair Optuna with MLflow for logging and registry; hold out a final test set Optuna never sees.
Related reading
- Hyperparameter tuning explained — grid, random, and Bayesian search theory behind Optuna’s samplers
- MLflow fundamentals explained — logging Optuna trials and registering champion models
- scikit-learn fundamentals explained — pipelines and estimators commonly wrapped in Optuna objectives
- Gradient boosting explained — tree hyperparameters Optuna tunes most often on tabular data