Guide
ML experiment tracking explained
Your teammate asks which run produced the 0.94 AUC churn model in production.
You scroll through Slack, three Jupyter notebooks named final_v2.ipynb,
and a CSV export from six weeks ago. Nobody remembers the learning rate, the exact
train/validation split, or whether class weights were enabled. That is what happens
when experiments live only in notebooks. ML experiment tracking is
the discipline — and the tooling — of logging every training run with its
hyperparameters, metrics, code revision, data snapshot, and output artifacts so
results are comparable, auditable, and reproducible. This guide covers what to log,
how trackers organize runs, reproducibility pillars, integration with
MLOps and
hyperparameter tuning,
a churn-classifier worked example, a build-vs-buy decision table, common pitfalls,
and a production checklist.
What experiment tracking records
An experiment is a logical project — “churn Q2 2026” or “fraud classifier v3.” Each run is one training attempt inside that experiment. Mature trackers capture four layers:
- Parameters — hyperparameters (
learning_rate,max_depth), feature flags, random seeds, data split ratios. - Metrics — scalar time series: train/val loss, AUC, F1, RMSE, logged per epoch or per step.
- Artifacts — model files, confusion matrices, SHAP plots, prediction CSVs, ONNX exports.
- Context — git commit hash, Docker image digest, library versions, dataset version or hash, who launched the run, hardware type.
Without all four, you can compare numbers but cannot reproduce or defend a model in a regulated review. Trackers turn ad-hoc notebook work into an indexed lab notebook the whole team can search.
Core concepts: runs, experiments, and registries
Runs and parallel comparison
When you sweep twelve learning rates, you get twelve runs under one experiment. The UI (or API) lets you sort by validation AUC, overlay learning curves, and diff parameters side by side. That is the minimum viable value — killing the “spreadsheet of results” anti-pattern.
Tags and naming conventions
Adopt a naming scheme early: churn-lgbm-lr0.05-cw1-seed42 beats
run_7. Tags like baseline, production-candidate,
or leakage-fix make filtering across hundreds of runs practical.
Model registry
The best run is not automatically production-ready. A model registry promotes a versioned artifact through stages — Staging, Production, Archived — with approval gates, lineage back to the parent run, and links to serving endpoints. This closes the loop between experimentation and model serving.
Reproducibility: the four pillars
Tracking logs what happened; reproducibility ensures you can make it happen again. Four pillars must be pinned together:
1. Code
Log the git SHA (or tarball hash for non-git workflows). Uncommitted local edits are the #1 reproducibility killer — CI should refuse to register production candidates from dirty trees.
2. Data
Record dataset version, snapshot path, or content hash. If training reads “latest” from a warehouse without a snapshot, rerunning the same params next week trains on different rows. Pair tracking with immutable snapshots or tools like DVC/LakeFS for large files.
3. Environment
Pin Python, CUDA, and library versions in a lockfile or container image. Log the image digest at run start. A model trained on scikit-learn 1.3 may behave differently on 1.5 after tree algorithm tweaks.
4. Randomness
Set and log seeds for Python, NumPy, and framework RNGs. Some GPU ops remain nondeterministic — document that limitation rather than pretending bit-identical replay.
Reproducibility supports fair comparison during cross-validation and is a prerequisite for serious hyperparameter search.
What to log (and what to skip)
Log everything needed to answer: “Why did this run beat that one?” and “Can we ship it?”
- Always log — hyperparameters, final and per-epoch metrics, data split config, class weights, feature list version, git SHA, seed.
- Log selectively — large intermediate tensors, per-sample gradients (storage cost explodes). Sample or aggregate instead.
- Log on failure too — stack traces and partial metrics help debug OOMs and NaN divergences without rerunning blindly.
- Log business metrics — not just AUC but expected revenue lift or fraud dollars saved at a chosen threshold.
For deep learning, log learning-rate schedules, batch size, and gradient norms.
For gradient boosting, log num_leaves, min_child_weight,
and early-stopping round. Consistency matters more than volume.
Integration with the ML lifecycle
Experiment tracking sits at the center of a mature MLOps stack:
- Feature pipelines — link runs to feature store versions so train-serve parity is traceable.
- Training CI — nightly retrain jobs auto-log runs; regressions trigger alerts when val AUC drops below the production baseline.
- Hyperparameter search — Optuna, Ray Tune, or SageMaker emit runs to the tracker; the best trial gets promoted to registry.
- Serving — production endpoint metadata points to registry version and parent run for rollback.
- Monitoring — when drift fires, engineers spawn a new experiment branch from the last known-good run.
Worked example: churn classifier sweep
A subscription SaaS team predicts 30-day churn. They create experiment
churn-2026q2 and launch twelve LightGBM runs varying
learning_rate (0.01–0.2), num_leaves (16–64),
and scale_pos_weight (1 vs 3) with a fixed seed and frozen
training snapshot s3://ml-data/churn/v2026-06-01.parquet.
Each run logs per-iteration val AUC, precision at 0.5 threshold, training
duration, and git SHA a3f91c2. Run churn-lgbm-lr0.05-leaves32-cw3
tops the leaderboard at val AUC 0.941. The team logs a confusion matrix artifact,
registers the model as churn-lgbm/3 in Staging, runs shadow inference
for two weeks, then promotes to Production with an approval note linking the run ID.
Three months later, drift alerts fire. They filter runs tagged
production-candidate, compare against the current prod run, and
branch a new experiment from the same data snapshot — without guessing which
notebook worked.
Tool decision table
| Situation | Recommendation |
|---|---|
| Solo researcher, local sklearn/PyTorch | MLflow local or TensorBoard — low setup, sufficient for dozens of runs. |
| Small team, need collaboration and rich viz | Weights & Biases or Neptune — shared dashboards, report generation. |
| Enterprise, on-prem governance requirements | MLflow Tracking Server + Model Registry behind SSO; or SageMaker Experiments. |
| Heavy deep learning, GPU cluster | W&B or TensorBoard with cluster launcher integration; log system metrics. |
| Regulated industry needing audit trails | Registry with approval workflows; immutable run records; tie to data lineage. |
| Prototype with <10 runs total | Structured JSON logs in git may suffice — upgrade before team grows. |
MLflow is the de facto open standard (params/metrics/artifacts API, framework autologging for sklearn, XGBoost, PyTorch). W&B excels at visualization and team workflows. Avoid building a bespoke tracker unless you have unusual compliance constraints — maintenance cost is high.
Common pitfalls
- Logging without conventions — inconsistent param names
(
lrvslearning_rate) break comparison queries. - Mutable “latest” data — runs are not reproducible if the underlying table changes daily.
- Notebook-only workflow — interactive tweaks never logged; the “best” cell output is lost on kernel restart.
- Metric leakage in logs — logging test-set AUC during tuning and picking the max — use a held-out test only once at the end.
- Artifact bloat — storing full model checkpoints every epoch fills storage; keep best-only or top-k checkpoints.
- Orphan runs — CI jobs that crash before
run.end()leave incomplete records; use try/finally or context managers. - Registry without governance — anyone can push to Production; add role-based promotion and changelog notes.
- Ignoring failed runs — failures teach as much as successes; tag root cause (OOM, NaN, data schema mismatch).
Production checklist
- Every training script calls the tracker API — no manual copy-paste metrics.
- Standard param and metric naming schema documented in the repo README.
- Git SHA, seed, and data snapshot version logged at run start.
- Container image digest or conda lockfile recorded per run.
- Failed runs log exceptions and partial metrics before exit.
- Leaderboard query defined (e.g., max val AUC with min precision floor).
- Model registry stages with required approvers for Production promotion.
- Production endpoint metadata links to registry version and parent run.
- Retention policy for artifacts (e.g., 90 days for exploratory, indefinite for prod lineage).
- Quarterly audit: can you reproduce the current production model from logged artifacts?
Key takeaways
- Experiment tracking logs params, metrics, artifacts, and context for every training run.
- Reproducibility requires pinning code, data, environment, and seeds — not just saving the model file.
- Model registries bridge experimentation and production serving with versioned promotion.
- Adopt trackers early — retrofitting after hundreds of notebook experiments is painful.
- Pair tracking with solid MLOps hygiene and fair validation practices.
Related reading
- MLOps explained — pipelines, deployment, and monitoring around tracked experiments
- Hyperparameter tuning explained — search strategies that generate the runs you need to compare
- Feature stores explained — versioning entity features linked to training runs
- Overfitting and cross-validation explained — evaluating runs without leakage