Guide
MLflow fundamentals explained
A data scientist retrains a churn model every Monday. By Wednesday nobody can remember which random seed, feature subset, or threshold produced the 0.91 AUC that shipped last month. The notebook is gone; the pickle file on a shared drive has no metadata. MLflow exists to prevent exactly this: an open-source platform that records every training attempt as a structured run with parameters, metrics, code version, and artifacts, then packages the winning model for deployment through a model registry. It integrates natively with scikit-learn, PyTorch, XGBoost, and dozens of other libraries via autologging. This guide covers experiments and runs, the Tracking API, Projects, Models and flavors, registry stages, a Harbor Analytics churn forecaster worked example, a tooling decision table, common pitfalls, and a production checklist — alongside our broader MLOps overview.
What MLflow is and the four components
MLflow is an Apache-licensed toolkit for managing the machine learning lifecycle. Unlike a single-purpose logging library, it spans four loosely coupled components that teams adopt incrementally:
- Tracking — log parameters, metrics, tags, and artifacts per run; compare results in the UI or API.
- Projects — package training code with a
MLprojectfile and reproducible entry points. - Models — save models in a standard format with environment metadata (
conda.yamlorrequirements.txt). - Registry — versioned model store with staging labels (Staging, Production, Archived) and access control.
Most teams start with Tracking only: wrap training loops in
mlflow.start_run(), point the tracking URI at a local directory or
remote server, and stop losing experiment history. Registry and Projects matter
once multiple engineers promote models through CI/CD pipelines described in our
model serving guide.
When MLflow is the right default
- Python-first ML teams training tabular models, classical NLP, or modest deep learning workloads.
- Self-hosted or air-gapped environments where SaaS experiment trackers are blocked by policy.
- Organizations already on Databricks — managed MLflow is the native tracking layer.
- Teams needing a model registry without building custom artifact storage and promotion workflows.
Consider alternatives when you need rich collaborative dashboards for large research teams (Weights & Biases), Git-native data versioning as the primary primitive (DVC), or Kubernetes-native pipeline orchestration without Python glue (Kubeflow Pipelines). MLflow complements those tools; it rarely replaces an entire MLOps stack alone.
Experiments, runs, and the Tracking API
An experiment is a named bucket for related runs — for example
churn-forecaster-v2 or fraud-xgboost-tuning. Each
run is one execution of training code. Runs capture:
- Parameters — hyperparameters and config knobs (strings, floats, ints). Immutable once logged.
- Metrics — numeric time series (AUC per epoch, validation loss). Can be logged multiple times per run.
- Tags — arbitrary key-value metadata (git commit, dataset version, author).
- Artifacts — files: model weights, plots, confusion matrices, SHAP summaries.
The minimal Python pattern wraps your training block:
import mlflow
mlflow.set_experiment("churn-forecaster")
with mlflow.start_run(run_name="lr-baseline"):
mlflow.log_param("C", 1.0)
mlflow.log_param("max_iter", 500)
# ... train model ...
mlflow.log_metric("val_auc", 0.87)
mlflow.log_artifact("reports/confusion_matrix.png")
Set MLFLOW_TRACKING_URI to file:///path/to/mlruns for local
storage, http://mlflow-server:5000 for a shared tracking server, or a
cloud-backed URI (S3 + PostgreSQL metadata store) for production. The MLflow UI
(mlflow ui) renders parallel coordinates plots and metric charts so you
can filter runs by parameter ranges — essential when
hyperparameter search
produces hundreds of candidates.
Autologging
For supported libraries, one line captures most training metadata automatically:
mlflow.sklearn.autolog() # or mlflow.pytorch.autolog()
with mlflow.start_run():
pipeline.fit(X_train, y_train)
Autolog records estimator parameters, training metrics, the fitted model as an
artifact, and the conda environment. Disable specific facets with
mlflow.autolog(log_models=False) when models are huge or you log them
manually with custom signatures.
MLflow Models: flavors, signatures, and loading
Saving a model through MLflow produces a directory (or archive) with a
MLmodel YAML manifest describing flavors —
framework-specific load paths. A scikit-learn classifier might expose both
python_function (generic pyfunc) and sklearn flavors.
mlflow.sklearn.log_model(
sk_model=clf,
artifact_path="model",
registered_model_name="churn-classifier"
)
A model signature documents input and output schemas (column names,
tensor shapes). Define it explicitly with mlflow.models.infer_signature
on sample inputs so downstream
serving layers can
validate requests. Load any logged model with
mlflow.pyfunc.load_model("runs:/<run_id>/model") for framework-agnostic
inference, or use flavor-specific loaders when you need native objects.
Input examples and environment files
Attach input_example when logging so MLflow can infer signatures and
generate example payloads. The bundled conda.yaml or
python_env.yaml pins dependencies — critical for reproducibility,
but review it: autolog sometimes captures overly broad version ranges. Pin exact
versions before promoting to Production.
Model Registry: stages, versions, and promotion
The Model Registry sits on top of the tracking server. Registering
a model creates a named entity (e.g. harbor-churn-classifier) with
versioned artifacts linked to source runs. Each version carries:
- Stage labels: None, Staging, Production, Archived
- Description and tags (owner, validation dataset hash, approval ticket)
- Lineage back to the originating run ID and git commit tag
Promotion workflows typically move a validated version to Staging after offline
evaluation, then to Production after shadow or canary deployment. Archive superseded
versions instead of deleting — auditors and rollback paths depend on history.
Use the Python client (MlflowClient().transition_model_version_stage)
or REST API from CI pipelines; avoid manual UI-only promotion at scale.
Aliases and deployment targets
MLflow 2.9+ supports model aliases (@champion,
@challenger) for dynamic resolution without rewriting stage enums.
Serving infrastructure (SageMaker, Azure ML, custom FastAPI wrappers) loads by
alias or stage URI: models:/harbor-churn-classifier/Production.
MLflow Projects and reproducibility
An MLflow Project is a directory with MLproject defining
named entry points, parameters with defaults, and the environment (conda, virtualenv,
or Docker). Run remotely with:
mlflow run https://github.com/org/churn-training \
-P learning_rate=0.01 -P max_depth=6
Projects shine when training code lives in Git and you want identical environments on laptops, CI runners, and GPU nodes. They are less popular than plain scripts plus Docker for teams already standardized on containers, but the parameter schema and automatic run logging remain useful for internal ML platforms.
Always tag runs with mlflow.set_tag("mlflow.source.git.commit", sha) or
enable Git integration so the UI links to the exact code revision. Pair with dataset
version tags (DVC hash, Delta table version) for full lineage.
Worked example: Harbor Analytics churn forecaster
Harbor Analytics predicts which free-tier users will cancel within 30 days. The team runs weekly retraining on a feature store export with scikit-learn pipelines (imputation, scaling, gradient boosting classifier).
Experiment structure
- Experiment:
harbor-churn-weekly - Tags per run:
data_snapshot=2026-06-02,feature_set=v3 - Parameters:
max_depth,learning_rate,min_child_weight - Metrics:
val_auc,val_precision_at_10pct,calibration_error - Artifacts: SHAP summary plot, calibration curve PNG,
threshold_report.json
After Optuna search logs 80 runs, the data scientist filters
val_auc > 0.89 and calibration_error < 0.03, selects
run a1b2c3, registers harbor-churn-classifier version 47, and
transitions it to Staging. The deployment service loads
models:/harbor-churn-classifier@challenger for shadow traffic against
version 46 in Production. If shadow precision holds for seven days, CI promotes
version 47 to Production and archives 46.
This closes the loop between experimentation and MLOps governance without custom spreadsheet tracking.
Tooling decision table
| Need | MLflow fit | Alternative |
|---|---|---|
| Python experiment logging + model registry | Strong default | Neptune, W&B (SaaS) |
| Self-hosted, no external SaaS | Strong | ClearML self-hosted |
| Git-native dataset versioning | Partial (tags only) | DVC + CML |
| Rich media dashboards (images, 3D) | Basic UI | Weights & Biases |
| Databricks lakehouse integration | Native | Unity Catalog models |
| Kubernetes pipeline DAGs | Via Projects or external | Kubeflow, Airflow |
| LLM prompt/version tracking | Emerging (LLM flavor) | LangSmith, Phoenix |
Common pitfalls
- Logging inside tight loops — calling
log_metricper mini-batch on thousands of steps floods the tracking server. Log per epoch or use throttled callbacks. - Giant artifacts — logging multi-GB checkpoints without
log_models=Falsefills S3 buckets. Log only the final epoch or use external artifact stores with lifecycle rules. - Mutable parameters — MLflow parameters are meant to be immutable. Log final values once; use metrics for values that change during training.
- Missing signatures — models without input schemas break serving validation. Always infer or declare signatures on promotion.
- Stale Production aliases — promoting in the UI but forgetting to update the serving URI leaves traffic on old weights. Automate promotion in CI with health checks.
- Local-only tracking on teams —
./mlrunson laptops is not shared knowledge. Centralize the tracking URI early. - Environment drift — trusting autologged
conda.yamlwithout pinning lets Production diverge from training. Lock versions in the registry artifact.
Practitioner checklist
- Set
MLFLOW_TRACKING_URIto a shared server or cloud backend before team training begins. - Name experiments consistently (
project-task); use run names or tags for variants. - Enable autolog for supported frameworks; disable model logging when artifacts are too large.
- Tag every run with git commit, dataset version, and author.
- Log validation metrics on held-out data — never only training loss.
- Define model signatures with
infer_signatureon representative inputs. - Register models by logical name; use Staging before Production promotion.
- Wire CI to transition stages after automated evaluation gates pass.
- Archive old Production versions; never delete registry history without policy review.
- Monitor served models for drift separately — MLflow tracks training, not live traffic.
Key takeaways
- MLflow Tracking records parameters, metrics, tags, and artifacts per run for reproducible comparison.
- Autologging integrates with scikit-learn, PyTorch, XGBoost, and more with minimal code changes.
- MLflow Models package estimators with flavors, signatures, and environment metadata for portable deployment.
- Model Registry versions models through Staging and Production with lineage to source runs.
- Projects optional but useful for parameterized, Git-based reproducible training entry points.
Related reading
- MLOps explained — pipelines, deployment, monitoring, and governance beyond tracking
- scikit-learn fundamentals explained — pipelines and estimators MLflow autologs
- Hyperparameter tuning explained — search strategies whose runs MLflow should capture
- Model serving explained — loading registry models into production inference APIs