Guide

ML experiment tracking explained

Your teammate asks which run produced the 0.94 AUC churn model in production. You scroll through Slack, three Jupyter notebooks named final_v2.ipynb, and a CSV export from six weeks ago. Nobody remembers the learning rate, the exact train/validation split, or whether class weights were enabled. That is what happens when experiments live only in notebooks. ML experiment tracking is the discipline — and the tooling — of logging every training run with its hyperparameters, metrics, code revision, data snapshot, and output artifacts so results are comparable, auditable, and reproducible. This guide covers what to log, how trackers organize runs, reproducibility pillars, integration with MLOps and hyperparameter tuning, a churn-classifier worked example, a build-vs-buy decision table, common pitfalls, and a production checklist.

What experiment tracking records

An experiment is a logical project — “churn Q2 2026” or “fraud classifier v3.” Each run is one training attempt inside that experiment. Mature trackers capture four layers:

Parameters — hyperparameters (learning_rate, max_depth), feature flags, random seeds, data split ratios.
Metrics — scalar time series: train/val loss, AUC, F1, RMSE, logged per epoch or per step.
Artifacts — model files, confusion matrices, SHAP plots, prediction CSVs, ONNX exports.
Context — git commit hash, Docker image digest, library versions, dataset version or hash, who launched the run, hardware type.

Without all four, you can compare numbers but cannot reproduce or defend a model in a regulated review. Trackers turn ad-hoc notebook work into an indexed lab notebook the whole team can search.

Core concepts: runs, experiments, and registries

Runs and parallel comparison

When you sweep twelve learning rates, you get twelve runs under one experiment. The UI (or API) lets you sort by validation AUC, overlay learning curves, and diff parameters side by side. That is the minimum viable value — killing the “spreadsheet of results” anti-pattern.

Tags and naming conventions

Adopt a naming scheme early: churn-lgbm-lr0.05-cw1-seed42 beats run_7. Tags like baseline, production-candidate, or leakage-fix make filtering across hundreds of runs practical.

Model registry

The best run is not automatically production-ready. A model registry promotes a versioned artifact through stages — Staging, Production, Archived — with approval gates, lineage back to the parent run, and links to serving endpoints. This closes the loop between experimentation and model serving.

Reproducibility: the four pillars

Tracking logs what happened; reproducibility ensures you can make it happen again. Four pillars must be pinned together:

1. Code

Log the git SHA (or tarball hash for non-git workflows). Uncommitted local edits are the #1 reproducibility killer — CI should refuse to register production candidates from dirty trees.

2. Data

Record dataset version, snapshot path, or content hash. If training reads “latest” from a warehouse without a snapshot, rerunning the same params next week trains on different rows. Pair tracking with immutable snapshots or tools like DVC/LakeFS for large files.

3. Environment

Pin Python, CUDA, and library versions in a lockfile or container image. Log the image digest at run start. A model trained on scikit-learn 1.3 may behave differently on 1.5 after tree algorithm tweaks.

4. Randomness

Set and log seeds for Python, NumPy, and framework RNGs. Some GPU ops remain nondeterministic — document that limitation rather than pretending bit-identical replay.

Reproducibility supports fair comparison during cross-validation and is a prerequisite for serious hyperparameter search.

What to log (and what to skip)

Log everything needed to answer: “Why did this run beat that one?” and “Can we ship it?”

Always log — hyperparameters, final and per-epoch metrics, data split config, class weights, feature list version, git SHA, seed.
Log selectively — large intermediate tensors, per-sample gradients (storage cost explodes). Sample or aggregate instead.
Log on failure too — stack traces and partial metrics help debug OOMs and NaN divergences without rerunning blindly.
Log business metrics — not just AUC but expected revenue lift or fraud dollars saved at a chosen threshold.

For deep learning, log learning-rate schedules, batch size, and gradient norms. For gradient boosting, log num_leaves, min_child_weight, and early-stopping round. Consistency matters more than volume.

Integration with the ML lifecycle

Experiment tracking sits at the center of a mature MLOps stack:

Feature pipelines — link runs to feature store versions so train-serve parity is traceable.
Training CI — nightly retrain jobs auto-log runs; regressions trigger alerts when val AUC drops below the production baseline.
Hyperparameter search — Optuna, Ray Tune, or SageMaker emit runs to the tracker; the best trial gets promoted to registry.
Serving — production endpoint metadata points to registry version and parent run for rollback.
Monitoring — when drift fires, engineers spawn a new experiment branch from the last known-good run.

Worked example: churn classifier sweep

A subscription SaaS team predicts 30-day churn. They create experiment churn-2026q2 and launch twelve LightGBM runs varying learning_rate (0.01–0.2), num_leaves (16–64), and scale_pos_weight (1 vs 3) with a fixed seed and frozen training snapshot s3://ml-data/churn/v2026-06-01.parquet.

Each run logs per-iteration val AUC, precision at 0.5 threshold, training duration, and git SHA a3f91c2. Run churn-lgbm-lr0.05-leaves32-cw3 tops the leaderboard at val AUC 0.941. The team logs a confusion matrix artifact, registers the model as churn-lgbm/3 in Staging, runs shadow inference for two weeks, then promotes to Production with an approval note linking the run ID.

Three months later, drift alerts fire. They filter runs tagged production-candidate, compare against the current prod run, and branch a new experiment from the same data snapshot — without guessing which notebook worked.

Tool decision table

Situation	Recommendation
Solo researcher, local sklearn/PyTorch	MLflow local or TensorBoard — low setup, sufficient for dozens of runs.
Small team, need collaboration and rich viz	Weights & Biases or Neptune — shared dashboards, report generation.
Enterprise, on-prem governance requirements	MLflow Tracking Server + Model Registry behind SSO; or SageMaker Experiments.
Heavy deep learning, GPU cluster	W&B or TensorBoard with cluster launcher integration; log system metrics.
Regulated industry needing audit trails	Registry with approval workflows; immutable run records; tie to data lineage.
Prototype with <10 runs total	Structured JSON logs in git may suffice — upgrade before team grows.

MLflow is the de facto open standard (params/metrics/artifacts API, framework autologging for sklearn, XGBoost, PyTorch). W&B excels at visualization and team workflows. Avoid building a bespoke tracker unless you have unusual compliance constraints — maintenance cost is high.

Common pitfalls

Logging without conventions — inconsistent param names (lr vs learning_rate) break comparison queries.
Mutable “latest” data — runs are not reproducible if the underlying table changes daily.
Notebook-only workflow — interactive tweaks never logged; the “best” cell output is lost on kernel restart.
Metric leakage in logs — logging test-set AUC during tuning and picking the max — use a held-out test only once at the end.
Artifact bloat — storing full model checkpoints every epoch fills storage; keep best-only or top-k checkpoints.
Orphan runs — CI jobs that crash before run.end() leave incomplete records; use try/finally or context managers.
Registry without governance — anyone can push to Production; add role-based promotion and changelog notes.
Ignoring failed runs — failures teach as much as successes; tag root cause (OOM, NaN, data schema mismatch).

Production checklist

Every training script calls the tracker API — no manual copy-paste metrics.
Standard param and metric naming schema documented in the repo README.
Git SHA, seed, and data snapshot version logged at run start.
Container image digest or conda lockfile recorded per run.
Failed runs log exceptions and partial metrics before exit.
Leaderboard query defined (e.g., max val AUC with min precision floor).
Model registry stages with required approvers for Production promotion.
Production endpoint metadata links to registry version and parent run.
Retention policy for artifacts (e.g., 90 days for exploratory, indefinite for prod lineage).
Quarterly audit: can you reproduce the current production model from logged artifacts?

Key takeaways

Experiment tracking logs params, metrics, artifacts, and context for every training run.
Reproducibility requires pinning code, data, environment, and seeds — not just saving the model file.
Model registries bridge experimentation and production serving with versioned promotion.
Adopt trackers early — retrofitting after hundreds of notebook experiments is painful.
Pair tracking with solid MLOps hygiene and fair validation practices.