Guide

MLOps explained

A data scientist ships a model with 0.94 AUC in a Jupyter notebook. Six weeks later, revenue from the feature it powers has dropped 18% — but nobody notices until finance asks why. The training script was never versioned, features in production diverged from training, and nobody was watching for drift. MLOps (machine learning operations) is the engineering discipline that closes that gap: treating ML systems like software products with reproducible pipelines, tested deployments, continuous monitoring, and governed retraining — not one-off experiments. This guide covers the ML lifecycle, experiment tracking, feature stores, serving patterns, CI/CD for models, monitoring, governance, and how MLOps connects to ML fundamentals, feature engineering, and validation discipline.

MLOps vs DevOps — same goals, harder artifacts

DevOps automates building, testing, and deploying code. MLOps adds three complications that plain CI/CD does not solve alone:

  • Data is a dependency — training sets change daily; a green build today can train on corrupted labels tomorrow.
  • Models decay — unlike a REST handler, a classifier’s accuracy erodes as the world shifts; you must monitor and retrain on a schedule or trigger.
  • Reproducibility is fragile — random seeds, library versions, hardware (GPU vs CPU), and non-deterministic ops can make “the same” training run produce different weights.

MLOps is not a single tool — it is a set of practices and platform choices (often MLflow, Kubeflow, Weights & Biases, SageMaker, Vertex AI, or bespoke orchestration) that wrap the full loop: ingest data, train, evaluate, register, deploy, observe, and retrain.

The ML lifecycle in production

Map every production ML system to these stages. Gaps between stages are where incidents hide:

  1. Data ingestion and validation — schema checks, null rates, distribution baselines, PII handling, and lineage (which raw table produced which feature set).
  2. Feature engineering — transforms applied consistently in training and serving; ideally centralized in a feature store so online and offline paths share definitions.
  3. Training and evaluation — versioned code, data snapshots, hyperparameters, and metrics logged automatically; holdout sets that reflect production traffic, not random shuffles that leak future information.
  4. Model registry — approved artifacts with metadata (metrics, data hash, author, approval status). Only registered models reach production.
  5. Deployment — batch scoring jobs, real-time APIs, or embedded edge models; often with canary or shadow traffic before full cutover.
  6. Monitoring — latency, throughput, data drift, prediction drift, and business KPIs tied to model output.
  7. Retraining — scheduled or event-driven pipelines that promote a new model only if it beats the incumbent on agreed gates.

Teams that stop at stage 4 (“we deployed it”) discover stage 6 the hard way when silent degradation eats margin.

Experiment tracking and reproducibility

Without tracking, “the model in prod” becomes folklore. Every training run should record:

  • Git commit hash of training code
  • Dataset version or snapshot ID (not “latest CSV on Bob’s laptop”)
  • Hyperparameters and random seeds
  • Environment lockfile (conda, pip, Docker image digest)
  • Evaluation metrics on multiple slices (overall, per segment, per time window)
  • Artifacts: weights, ONNX export, preprocessing pickles

Tools like MLflow, W&B, or Neptune centralize this. The goal is that any engineer can reproduce run exp-2847 and explain why it beat exp-2812 — critical for audits, debugging regressions, and hyperparameter decisions that actually stick.

Train-serve skew

The most common MLOps failure: training uses pandas aggregations in a notebook; serving reimplements them in Java with subtle differences (timezone handling, missing-value defaults, string normalization). Skew shows up as great offline metrics and poor live performance. Fix with shared feature definitions, contract tests that compare training-batch vs online-batch outputs on identical rows, and feature stores that compute once and serve many consumers.

Feature stores — one definition, two speeds

A feature store (Feast, Tecton, Hopsworks, or cloud-native equivalents) separates feature logic from model code:

  • Offline store — historical tables for training and backtesting at scale (warehouse or lake).
  • Online store — low-latency key-value lookups for real-time inference (Redis, DynamoDB, Bigtable).

Point-in-time correct joins prevent leakage: when training on events from March, you only use features that would have been known at each event timestamp — not aggregates that peek into April. Feature stores encode those temporal rules so data scientists do not re-derive them per project.

Not every team needs a feature store on day one. Adopt when you have three or more models sharing features, or when train-serve skew incidents become recurring. Until then, strict shared libraries and integration tests may suffice.

Model serving patterns

How you serve depends on latency, throughput, and freshness requirements:

  • Batch inference — score millions of rows hourly or nightly (churn lists, credit limits). Simple to operate; stale until the next run.
  • Real-time API — REST or gRPC endpoint behind a load balancer; target p99 latency budgets (e.g. under 100 ms for fraud). GPU autoscaling, model quantization, and caching hot features matter here.
  • Streaming — Kafka/Flink pipelines score events as they arrive; useful for personalization and anomaly alerts.
  • Embedded / edge — ONNX or TFLite on device; pairs with federated or periodic cloud retraining.

Safe rollouts

Never replace 100% of traffic instantly. Patterns borrowed from canary deployments:

  • Shadow mode — new model receives live traffic but predictions are logged, not acted on; compare against production model offline.
  • Canary — route 1–5% of traffic to the challenger; promote if business and error metrics hold.
  • A/B test — when the decision is product-facing (ranking, pricing), run controlled experiments with statistical power, not gut feel.

Rollback must be one command: pin the registry to the previous artifact version and flip traffic — no emergency retraining during an outage.

CI/CD for machine learning

ML pipelines need tests at three layers:

  • Data tests — Great Expectations, dbt tests, or custom validators: row counts, uniqueness, referential integrity, distribution bounds. Fail the pipeline before bad data trains a model.
  • Model tests — minimum accuracy/AUC on a golden set; fairness thresholds across protected groups; maximum inference latency on a reference machine; no regression beyond X% vs current production champion.
  • Integration tests — end-to-end scoring on fixture inputs; contract tests between feature service and model container.

Continuous training (CT) retrains on a schedule when new labeled data arrives; continuous delivery (CD) promotes only if gates pass. Treat promoted models like releases: changelog, owner, rollback plan. For LLM-heavy stacks, add eval suites (human rubrics, LLM-as-judge with skepticism, regression sets of golden prompts) before swapping production endpoints.

Monitoring — what to watch after launch

Infrastructure metrics (CPU, GPU memory) are necessary but insufficient. ML-specific signals include:

  • Data drift — input feature distributions shift vs training baseline (PSI, KL divergence, per-feature alerts).
  • Prediction drift — score distribution changes even if inputs look stable (often a precursor to concept drift).
  • Performance decay — when labels arrive with delay (fraud confirmed days later), track rolling precision/recall; proxy metrics when labels are sparse.
  • Operational SLOs — inference latency, error rate, queue depth for batch jobs; tie to SLOs and error budgets if ML is on the critical path.
  • Business KPIs — click-through, approval rate, chargebacks, support tickets — the metrics executives actually care about.

Alert on actionable thresholds with runbooks: “PSI > 0.25 on income_band for 24h → page on-call, do not auto-retrain without human review.” Blind auto-retraining on drift can amplify bad data (label leakage from a broken upstream ETL).

Governance, lineage, and model cards

Regulated industries and responsible-AI programs require traceability:

  • Lineage — which data snapshot + code version produced model v3.2.1; which downstream dashboards consume its scores.
  • Model cards — intended use, training data demographics, known limitations, bias evaluation results, and contact for questions.
  • Access control — who can promote to production; separation between experimenters and approvers.
  • PII and retention — features logged for debugging may contain sensitive fields; TTL and redaction policies apply to prediction logs too.

Governance is not bureaucracy — it is how you answer “why did the model deny this loan?” without a three-week archaeology project.

Common anti-patterns

  • Notebook in production — manual export of pickles with no registry or tests.
  • Metrics theater — optimizing AUC on a shuffled validation set while production is a sliding time window with covariate shift.
  • Monitoring only uptime — API returns 200 while predictions are nonsense.
  • Retrain on everything — no champion/challenger comparison; new model silently worse.
  • Feature duplication — three teams define “days_since_last_purchase” three different ways.
  • Ignoring feedback delay — fraud labels arrive in 30 days; you declare victory after 48 hours of live traffic.

Production checklist

  • Version training code, data snapshots, and environment for every run.
  • Centralize experiments; register only approved models for deployment.
  • Share feature definitions between training and serving; test for skew.
  • Choose serving pattern (batch, real-time, streaming) with explicit latency SLOs.
  • Roll out via shadow or canary; document one-step rollback.
  • Add data, model, and integration tests to CI/CD pipelines.
  • Monitor drift, prediction distributions, delayed performance, and business KPIs.
  • Define retraining triggers and human approval gates.
  • Publish model cards and maintain lineage for audit requests.

Key takeaways

  • MLOps operationalizes the full ML lifecycle — not just training accuracy.
  • Reproducibility and train-serve parity prevent silent production failures.
  • Feature stores and registries scale shared infrastructure across models.
  • Safe rollouts (shadow, canary) and drift monitoring catch decay before revenue does.
  • Governance (lineage, model cards) turns ML from a black box into an accountable system.

Related reading