Guide

Optuna fundamentals explained

A team trains an XGBoost classifier with twelve hyperparameters. Grid search over coarse ranges needs 4,096 fits; random search wastes budget on obviously bad combinations; and half the trials train for 500 trees when 50 would have revealed they cannot beat the current champion. Optuna is an open-source hyperparameter optimization framework that replaces brute force with adaptive sampling (Tree-structured Parzen Estimator, CMA-ES), early pruning (Median, Hyperband), and a define-by-run API where the search space itself can depend on earlier trial results. It integrates with scikit-learn, PyTorch, LightGBM, and MLflow for experiment logging. This guide covers studies and trials, samplers and pruners, storage backends, multi-objective optimization, a Harbor Analytics defect classifier worked example, a tooling decision table, common pitfalls, and a practitioner checklist — building on our broader hyperparameter tuning overview.

Core concepts: studies, trials, and objectives

Optuna organizes search around three primitives:

Study — a named optimization session with a direction (minimize or maximize), sampler, pruner, and optional storage backend. One study answers one question: “What hyperparameters maximize validation AUC for this dataset?”
Trial — a single evaluation of one hyperparameter configuration. The objective(trial) function suggests parameters, trains a model, returns a scalar metric (or tuple for multi-objective).
Objective function — your training code wrapped to accept a optuna.Trial object and return the metric Optuna should optimize.

The minimal pattern:

import optuna

def objective(trial):
    lr = trial.suggest_float("lr", 1e-4, 1e-1, log=True)
    depth = trial.suggest_int("max_depth", 3, 12)
    # ... train and evaluate ...
    return val_auc

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)
print(study.best_params, study.best_value)

Unlike sklearn’s GridSearchCV, which requires a fixed parameter grid upfront, Optuna’s define-by-run style lets you branch: suggest a large hidden_size only when architecture == "mlp", or skip expensive data-augmentation knobs when a cheap baseline already fails. That flexibility matters for neural architectures and conditional pipelines where the search space is not a rectangular grid.

Suggesting parameters: distributions and constraints

trial.suggest_* methods define the search space per trial:

suggest_float(name, low, high, log=True) — uniform or log-uniform continuous values; use log=True for learning rates and regularization strengths that span orders of magnitude.
suggest_int(name, low, high, step=2) — discrete integers (tree depth, layer counts).
suggest_categorical(name, choices) — enums like ["gbtree", "dart"] booster types or optimizer names.
suggest_discrete_uniform / suggest_loguniform — legacy aliases; prefer explicit suggest_float with log=True in Optuna 3.x.

Parameter constraints and relationships

Use optuna.samplers.TPESampler(constant_liar=True) for parallel workers, and enforce relationships in code: assert trial.params["min_child_weight"] <= trial.params["max_depth"] * 10 or reject invalid combos by raising optuna.TrialPruned(). For linear constraints across continuous parameters (e.g. lr * batch_size < 1.0), Optuna 3.4+ supports constraints_func on the study. Document valid regions in your objective docstring so future maintainers know which suggestions are legal.

Samplers: from random search to Bayesian optimization

The sampler decides which hyperparameters to try next given completed trials:

TPESampler (default) — Tree-structured Parzen Estimator; models good vs bad regions of the search space separately and samples where the “good” density is high. Strong default for mixed continuous/categorical spaces with 50–500 trials.
CmaEsSampler — covariance matrix adaptation; excels on continuous spaces under ~20 dimensions when trials are expensive. Poor fit for heavy categorical branching.
RandomSampler — baseline and ablation; useful for verifying TPE actually beats random on your problem size.
NSGAIISampler / MOTPESampler — multi-objective Pareto fronts when you optimize accuracy and latency jointly.
GridSampler / BruteForceSampler — exhaustive small grids when you truly need complete coverage.

Set explicitly when defaults misbehave: optuna.create_study(sampler=optuna.samplers.TPESampler(seed=42, n_startup_trials=20)). The n_startup_trials knob runs pure random exploration before TPE builds density models — raise it when the search space is wide or noisy.

Pruners: stop bad trials early

Training 500 epochs for every trial is wasteful when validation loss diverges by epoch 12. Pruners compare intermediate metrics across trials and terminate unpromising ones:

MedianPruner — prunes trials worse than the median of previous trials at the same step. Simple default for iterative training.
HyperbandPruner / SuccessiveHalvingPruner — allocate more budget to promising trials; ideal when epoch count or dataset subset size is the resource dimension.
PercentilePruner — prune below a configurable percentile at each step.
NopPruner — disable pruning for fast, single-shot estimators (small sklearn fits).

Report intermediate values inside training loops:

for epoch in range(100):
    val_loss = train_one_epoch(...)
    trial.report(val_loss, epoch)
    if trial.should_prune():
        raise optuna.TrialPruned()

Pruning interacts with cross-validation: report the mean fold metric at each fold index as the step, not per-epoch metrics across mismatched fold sizes. For gradient boosting, report validation metric every early_stopping_rounds trees so Hyperband can kill shallow doomed configs cheaply.

Storage, parallelism, and dashboards

By default Optuna stores studies in memory — fine for notebooks, lost on restart. Production setups use RDB storage:

storage = "postgresql://optuna:***@db.internal/optuna"
study = optuna.create_study(
    study_name="harbor-defect-v4",
    storage=storage,
    load_if_exists=True,
    direction="maximize",
)

Multiple workers call study.optimize(objective, n_trials=25) concurrently; the storage backend coordinates trial allocation. SQLite works for single-machine parallelism with storage="sqlite:///optuna.db" but serializes writes — prefer PostgreSQL or MySQL beyond a handful of workers.

Visualization and callbacks

optuna.visualization.plot_optimization_history, plot_param_importances, and plot_parallel_coordinate render in Jupyter via Plotly. The optional Optuna Dashboard (optuna-dashboard package) serves a web UI over RDB storage for teams without a separate experiment tracker. Callbacks like MLflowCallback log each trial as an MLflow run automatically, bridging search and registry workflows in MLOps pipelines.

Framework integrations

Optuna ships integration modules that reduce boilerplate:

LightGBM / XGBoost / CatBoost — optuna.integration.LightGBMPruningCallback wires pruning into training callbacks natively.
PyTorch Lightning — PyTorchLightningPruningCallback monitors validation loss per epoch.
Keras — KerasPruningCallback for model.fit loops.
sklearn — wrap estimators manually or use optuna.integration.sklearn helpers; pair with cross_val_score inside the objective for stable metrics.

For sklearn pipelines, return the mean out-of-fold metric from StratifiedKFold rather than a single train/val split — otherwise Optuna overfits the holdout fold you accidentally tuned on. Nested cross-validation (outer loop for final estimate, inner loop inside objective) is expensive but mandatory when reporting unbiased performance to stakeholders.

Worked example: Harbor Analytics defect classifier

Harbor Supply photographs conveyor belts for surface defects. The vision team trains a ResNet-18 classifier on 40,000 labeled crops with heavy class imbalance (defects are 2.3% of frames). They need better recall at fixed false-positive rate, not just raw accuracy.

Study design

Study: harbor-defect-resnet18-v4, direction maximize, metric = validation PR-AUC on a held-out week of production images.
Sampler: TPESampler(n_startup_trials=30, seed=7)
Pruner: HyperbandPruner(min_resource=5, max_resource=50, reduction_factor=3) where resource = training epochs.
Storage: PostgreSQL shared across four GPU workers, 200 trials budget.

Search space (excerpt)

def objective(trial):
    lr = trial.suggest_float("lr", 1e-5, 3e-3, log=True)
    wd = trial.suggest_float("weight_decay", 1e-6, 1e-2, log=True)
    aug = trial.suggest_categorical("augment", ["light", "heavy"])
    focal_gamma = trial.suggest_float("focal_gamma", 0.0, 3.0)
    batch = trial.suggest_categorical("batch", [32, 64, 128])
    # train with epoch loop + trial.report / should_prune
    return pr_auc

After 200 trials, Hyperband pruned 61% before epoch 15, saving roughly 120 GPU-hours. Best trial (#173) achieved PR-AUC 0.84 vs 0.79 for the hand-tuned baseline. The team logs all trials via MLflowCallback, registers the best checkpoint through MLflow, and exports plot_param_importances to show that focal_gamma and augment dominated gains — informing the next labeling budget for hard negatives.

Tooling decision table

Need	Optuna fit	Alternative
Python HPO with pruning + define-by-run	Strong default	Ray Tune, Hyperopt
Simple grid on sklearn pipeline	Overkill	GridSearchCV, RandomizedSearchCV
Distributed GPU cluster scheduling	Via storage + workers	Ray Tune + Ray cluster
Multi-objective Pareto search	NSGA-II / MOTPE built-in	BoTorch, pymoo
JVM / Spark native tuning	Python only	Hyperopt Spark, SynapseML
Experiment UI + model registry	Via MLflow callback	W&B Sweeps (SaaS)
Neural architecture search (NAS)	Partial (conditional params)	DARTS, AutoKeras

Common pitfalls

Optimizing on the test set — repeatedly evaluating the same holdout fold across trials leaks information. Hold out a final test set untouched until search completes.
Noisy objectives — single-split metrics swing wildly; use cross-validation or multiple seeds averaged per trial.
Pruning without intermediate reports — forgetting trial.report makes every pruner a no-op.
Too few startup trials — TPE with n_startup_trials=5 on a 15-dimensional space explores poorly; budget 10–20% of total trials for random warmup.
In-memory studies on long jobs — a killed worker loses all history. Persist to RDB from day one.
Duplicate study names — load_if_exists=True resumes old trials; use versioned study names (project-v4) when you change the search space.
Logging every trial to Production — register only the champion after nested validation; Optuna finds training optima, not deployment guarantees.

Practitioner checklist

Define a single scalar objective aligned with business metrics (PR-AUC, not accuracy on imbalanced data).
Reserve a final test set never seen during study.optimize.
Choose TPE as default; switch to CmaEs for low-dimensional continuous problems.
Enable Hyperband or Median pruning for iterative trainers; use NopPruner for instant fits.
Configure PostgreSQL (or equivalent) storage before parallel workers start.
Set n_startup_trials to at least 10% of total budget.
Report intermediate metrics at consistent step indices across trials.
Integrate MLflowCallback or manual logging so trials are auditable.
Visualize param importances before declaring search complete.
Re-train the best config with multiple seeds and report mean +/- std.
Version the study name when data distributions or features change.

Key takeaways

Optuna frames hyperparameter search as studies of trials optimizing an objective function.
Define-by-run APIs support conditional search spaces impossible with fixed grids.
TPESampler and HyperbandPruner are the default powerful combo for iterative ML training.
RDB storage enables parallel workers and survives process restarts.
Pair Optuna with MLflow for logging and registry; hold out a final test set Optuna never sees.