Guide

MLflow fundamentals explained

A data scientist retrains a churn model every Monday. By Wednesday nobody can remember which random seed, feature subset, or threshold produced the 0.91 AUC that shipped last month. The notebook is gone; the pickle file on a shared drive has no metadata. MLflow exists to prevent exactly this: an open-source platform that records every training attempt as a structured run with parameters, metrics, code version, and artifacts, then packages the winning model for deployment through a model registry. It integrates natively with scikit-learn, PyTorch, XGBoost, and dozens of other libraries via autologging. This guide covers experiments and runs, the Tracking API, Projects, Models and flavors, registry stages, a Harbor Analytics churn forecaster worked example, a tooling decision table, common pitfalls, and a production checklist — alongside our broader MLOps overview.

What MLflow is and the four components

MLflow is an Apache-licensed toolkit for managing the machine learning lifecycle. Unlike a single-purpose logging library, it spans four loosely coupled components that teams adopt incrementally:

  • Tracking — log parameters, metrics, tags, and artifacts per run; compare results in the UI or API.
  • Projects — package training code with a MLproject file and reproducible entry points.
  • Models — save models in a standard format with environment metadata (conda.yaml or requirements.txt).
  • Registry — versioned model store with staging labels (Staging, Production, Archived) and access control.

Most teams start with Tracking only: wrap training loops in mlflow.start_run(), point the tracking URI at a local directory or remote server, and stop losing experiment history. Registry and Projects matter once multiple engineers promote models through CI/CD pipelines described in our model serving guide.

When MLflow is the right default

  • Python-first ML teams training tabular models, classical NLP, or modest deep learning workloads.
  • Self-hosted or air-gapped environments where SaaS experiment trackers are blocked by policy.
  • Organizations already on Databricks — managed MLflow is the native tracking layer.
  • Teams needing a model registry without building custom artifact storage and promotion workflows.

Consider alternatives when you need rich collaborative dashboards for large research teams (Weights & Biases), Git-native data versioning as the primary primitive (DVC), or Kubernetes-native pipeline orchestration without Python glue (Kubeflow Pipelines). MLflow complements those tools; it rarely replaces an entire MLOps stack alone.

Experiments, runs, and the Tracking API

An experiment is a named bucket for related runs — for example churn-forecaster-v2 or fraud-xgboost-tuning. Each run is one execution of training code. Runs capture:

  • Parameters — hyperparameters and config knobs (strings, floats, ints). Immutable once logged.
  • Metrics — numeric time series (AUC per epoch, validation loss). Can be logged multiple times per run.
  • Tags — arbitrary key-value metadata (git commit, dataset version, author).
  • Artifacts — files: model weights, plots, confusion matrices, SHAP summaries.

The minimal Python pattern wraps your training block:

import mlflow

mlflow.set_experiment("churn-forecaster")
with mlflow.start_run(run_name="lr-baseline"):
    mlflow.log_param("C", 1.0)
    mlflow.log_param("max_iter", 500)
    # ... train model ...
    mlflow.log_metric("val_auc", 0.87)
    mlflow.log_artifact("reports/confusion_matrix.png")

Set MLFLOW_TRACKING_URI to file:///path/to/mlruns for local storage, http://mlflow-server:5000 for a shared tracking server, or a cloud-backed URI (S3 + PostgreSQL metadata store) for production. The MLflow UI (mlflow ui) renders parallel coordinates plots and metric charts so you can filter runs by parameter ranges — essential when hyperparameter search produces hundreds of candidates.

Autologging

For supported libraries, one line captures most training metadata automatically:

mlflow.sklearn.autolog()  # or mlflow.pytorch.autolog()
with mlflow.start_run():
    pipeline.fit(X_train, y_train)

Autolog records estimator parameters, training metrics, the fitted model as an artifact, and the conda environment. Disable specific facets with mlflow.autolog(log_models=False) when models are huge or you log them manually with custom signatures.

MLflow Models: flavors, signatures, and loading

Saving a model through MLflow produces a directory (or archive) with a MLmodel YAML manifest describing flavors — framework-specific load paths. A scikit-learn classifier might expose both python_function (generic pyfunc) and sklearn flavors.

mlflow.sklearn.log_model(
    sk_model=clf,
    artifact_path="model",
    registered_model_name="churn-classifier"
)

A model signature documents input and output schemas (column names, tensor shapes). Define it explicitly with mlflow.models.infer_signature on sample inputs so downstream serving layers can validate requests. Load any logged model with mlflow.pyfunc.load_model("runs:/<run_id>/model") for framework-agnostic inference, or use flavor-specific loaders when you need native objects.

Input examples and environment files

Attach input_example when logging so MLflow can infer signatures and generate example payloads. The bundled conda.yaml or python_env.yaml pins dependencies — critical for reproducibility, but review it: autolog sometimes captures overly broad version ranges. Pin exact versions before promoting to Production.

Model Registry: stages, versions, and promotion

The Model Registry sits on top of the tracking server. Registering a model creates a named entity (e.g. harbor-churn-classifier) with versioned artifacts linked to source runs. Each version carries:

  • Stage labels: None, Staging, Production, Archived
  • Description and tags (owner, validation dataset hash, approval ticket)
  • Lineage back to the originating run ID and git commit tag

Promotion workflows typically move a validated version to Staging after offline evaluation, then to Production after shadow or canary deployment. Archive superseded versions instead of deleting — auditors and rollback paths depend on history. Use the Python client (MlflowClient().transition_model_version_stage) or REST API from CI pipelines; avoid manual UI-only promotion at scale.

Aliases and deployment targets

MLflow 2.9+ supports model aliases (@champion, @challenger) for dynamic resolution without rewriting stage enums. Serving infrastructure (SageMaker, Azure ML, custom FastAPI wrappers) loads by alias or stage URI: models:/harbor-churn-classifier/Production.

MLflow Projects and reproducibility

An MLflow Project is a directory with MLproject defining named entry points, parameters with defaults, and the environment (conda, virtualenv, or Docker). Run remotely with:

mlflow run https://github.com/org/churn-training \
  -P learning_rate=0.01 -P max_depth=6

Projects shine when training code lives in Git and you want identical environments on laptops, CI runners, and GPU nodes. They are less popular than plain scripts plus Docker for teams already standardized on containers, but the parameter schema and automatic run logging remain useful for internal ML platforms.

Always tag runs with mlflow.set_tag("mlflow.source.git.commit", sha) or enable Git integration so the UI links to the exact code revision. Pair with dataset version tags (DVC hash, Delta table version) for full lineage.

Worked example: Harbor Analytics churn forecaster

Harbor Analytics predicts which free-tier users will cancel within 30 days. The team runs weekly retraining on a feature store export with scikit-learn pipelines (imputation, scaling, gradient boosting classifier).

Experiment structure

  • Experiment: harbor-churn-weekly
  • Tags per run: data_snapshot=2026-06-02, feature_set=v3
  • Parameters: max_depth, learning_rate, min_child_weight
  • Metrics: val_auc, val_precision_at_10pct, calibration_error
  • Artifacts: SHAP summary plot, calibration curve PNG, threshold_report.json

After Optuna search logs 80 runs, the data scientist filters val_auc > 0.89 and calibration_error < 0.03, selects run a1b2c3, registers harbor-churn-classifier version 47, and transitions it to Staging. The deployment service loads models:/harbor-churn-classifier@challenger for shadow traffic against version 46 in Production. If shadow precision holds for seven days, CI promotes version 47 to Production and archives 46.

This closes the loop between experimentation and MLOps governance without custom spreadsheet tracking.

Tooling decision table

NeedMLflow fitAlternative
Python experiment logging + model registryStrong defaultNeptune, W&B (SaaS)
Self-hosted, no external SaaSStrongClearML self-hosted
Git-native dataset versioningPartial (tags only)DVC + CML
Rich media dashboards (images, 3D)Basic UIWeights & Biases
Databricks lakehouse integrationNativeUnity Catalog models
Kubernetes pipeline DAGsVia Projects or externalKubeflow, Airflow
LLM prompt/version trackingEmerging (LLM flavor)LangSmith, Phoenix

Common pitfalls

  • Logging inside tight loops — calling log_metric per mini-batch on thousands of steps floods the tracking server. Log per epoch or use throttled callbacks.
  • Giant artifacts — logging multi-GB checkpoints without log_models=False fills S3 buckets. Log only the final epoch or use external artifact stores with lifecycle rules.
  • Mutable parameters — MLflow parameters are meant to be immutable. Log final values once; use metrics for values that change during training.
  • Missing signatures — models without input schemas break serving validation. Always infer or declare signatures on promotion.
  • Stale Production aliases — promoting in the UI but forgetting to update the serving URI leaves traffic on old weights. Automate promotion in CI with health checks.
  • Local-only tracking on teams./mlruns on laptops is not shared knowledge. Centralize the tracking URI early.
  • Environment drift — trusting autologged conda.yaml without pinning lets Production diverge from training. Lock versions in the registry artifact.

Practitioner checklist

  • Set MLFLOW_TRACKING_URI to a shared server or cloud backend before team training begins.
  • Name experiments consistently (project-task); use run names or tags for variants.
  • Enable autolog for supported frameworks; disable model logging when artifacts are too large.
  • Tag every run with git commit, dataset version, and author.
  • Log validation metrics on held-out data — never only training loss.
  • Define model signatures with infer_signature on representative inputs.
  • Register models by logical name; use Staging before Production promotion.
  • Wire CI to transition stages after automated evaluation gates pass.
  • Archive old Production versions; never delete registry history without policy review.
  • Monitor served models for drift separately — MLflow tracks training, not live traffic.

Key takeaways

  • MLflow Tracking records parameters, metrics, tags, and artifacts per run for reproducible comparison.
  • Autologging integrates with scikit-learn, PyTorch, XGBoost, and more with minimal code changes.
  • MLflow Models package estimators with flavors, signatures, and environment metadata for portable deployment.
  • Model Registry versions models through Staging and Production with lineage to source runs.
  • Projects optional but useful for parameterized, Git-based reproducible training entry points.

Related reading