Guide

Model serving explained

Your fraud classifier scores 0.97 AUC offline. In production, checkout latency spikes from 120 ms to 2.4 s during a flash sale — GPUs sit idle 70% of the time because each request runs as a batch of one, and the preprocessing pipeline in the API layer diverges from training. Model serving is the engineering layer that turns a trained artifact into a reliable inference endpoint: predictable latency, high throughput, versioned rollouts, and observability. It sits at the center of MLOps — after training and before users see predictions. This guide covers batch vs online vs streaming inference, the request path through preprocessing and postprocessing, popular inference runtimes (ONNX Runtime, TensorRT, Triton, vLLM), batching strategies, scaling with autoscaling, safe canary and shadow deployments, LLM-specific concerns like KV-cache and quantization, a fraud-scoring API worked example, a serving-pattern decision table, common pitfalls, and a practitioner checklist.

What model serving is — and what it is not

Model serving is the runtime that accepts inputs, runs inference, and returns predictions (or embeddings, or generated tokens) within agreed latency and availability targets. It is distinct from:

Training — offline, batch-oriented, optimizes weights over epochs; tolerates hours of compute.
Batch scoring — offline inference over a fixed dataset (nightly churn predictions). No user waits; throughput matters more than p99 latency.
Feature computation — often upstream in a stream processor or feature store; serving consumes precomputed features or raw inputs.

Serving must handle concurrent requests, cold starts, model updates without downtime, and monitoring for drift. A notebook that calls model.predict() in a loop is not serving — it lacks isolation, scaling, versioning, and SLO enforcement.

Three serving patterns

Batch inference

Run the model over a scheduled or triggered dataset — e.g. score every active user for churn risk at 03:00 UTC. Results land in a warehouse or cache; the product reads precomputed scores. Pros: maximize GPU utilization, no per-request latency budget. Cons: predictions can be hours stale; not suitable for real-time personalization or fraud blocks.

Online (real-time) inference

Each user action triggers a synchronous API call: “Is this transaction fraudulent?” The caller blocks until a prediction returns. Target latencies are typically 10–200 ms for tabular/vision models, 200 ms–2 s for small LLMs. This is the default pattern for interactive products.

Streaming inference

A stream processor (Kafka, Flink) applies the model to events as they arrive — e.g. anomaly scores on IoT sensor readings. Latency is “event time to score,” often sub-second, but not tied to a single HTTP request. Useful when many downstream consumers need the same score without hammering a central API.

The serving request path

A production inference request passes through four stages. Skipping or misaligning any stage is the top cause of silent accuracy loss:

Ingress — authentication, rate limiting, request validation, schema checks (JSON schema, protobuf).
Preprocessing — tokenization, normalization, one-hot encoding, image resize — must match training exactly. Store preprocessing code or a serialized pipeline (sklearn Pipeline, ONNX preprocessing ops) with the model artifact.
Inference — forward pass on CPU/GPU/TPU. This is what specialized model servers optimize.
Postprocessing — softmax to label, threshold calibration, bounding-box NMS, JSON formatting. Return confidence scores when downstream systems need them for monitoring.

Train-serve skew occurs when any preprocessing step differs between training and serving — different tokenizers, missing null handling, floating-point order-of-operations. Log a sample of raw inputs and preprocessed tensors in shadow mode to catch skew before it hits revenue.

Inference runtimes and model servers

Raw PyTorch or TensorFlow in a Flask wrapper works for demos; production teams adopt runtimes that fuse operators, manage GPU memory pools, and batch concurrent requests:

ONNX Runtime — cross-framework, CPU and GPU EPs (execution providers). Export sklearn, PyTorch, or TensorFlow to ONNX for a portable artifact. Good default for tabular and mid-size vision models.
TensorRT — NVIDIA-specific graph optimization and kernel fusion. Often 2–5× faster than eager PyTorch on the same GPU; requires NVIDIA hardware.
NVIDIA Triton Inference Server — multi-framework, multi-model, dynamic batching, model ensemble pipelines, gRPC and HTTP. Industry standard for GPU clusters orchestrated by Kubernetes.
TorchServe — PyTorch-native, custom handlers, model archiving (.mar files). Simpler ops if your stack is already PyTorch-only.
vLLM / TensorRT-LLM / TGI — LLM-specific servers with continuous batching, PagedAttention KV-cache, and tensor parallelism for 7B–70B models. Pair with speculative decoding when latency-critical.
BentoML / Ray Serve — Python-first frameworks that wrap any model behind FastAPI/gRPC with built-in batching and scaling hooks.

Choose based on framework lock-in, hardware, and team ops maturity — not benchmark leaderboard scores on synthetic workloads.

Latency, throughput, and batching

GPUs achieve peak throughput at batch sizes of 8–64, but users send single requests. Dynamic batching queues incoming requests for a short window (e.g. 5 ms), groups them, runs one forward pass, and fans out results. You trade a few milliseconds of queue delay for 3–10× higher throughput.

Key metrics to define in your SLO:

p50 / p95 / p99 latency — tail latency drives user experience; optimize p99, not just averages.
Throughput — requests per second or tokens per second (LLMs).
GPU utilization — sustained below 40% often means under-batching or CPU preprocessing bottlenecks.
Queue depth — growing queues signal you need more replicas before latency explodes.

Static batching (fixed batch size N) suits offline scoring. Continuous batching (vLLM) adds new sequences to an in-flight batch as others finish — essential for variable-length LLM generation.

Scaling, replicas, and cold starts

Horizontally scale stateless inference replicas behind a load balancer. Each replica loads the model into GPU memory at startup — a cold start of 30–120 s for large LLMs. Mitigations:

Keep a minimum replica count > 0 in production (no scale-to-zero for latency-sensitive paths).
Warmup requests after deploy before shifting traffic.
Model caching on shared volumes so replicas do not re-download multi-GB weights.
CPU fallback for tiny models when GPU pool is saturated (with explicit latency degradation alerts).

Autoscale on GPU utilization, request queue depth, or custom metrics (tokens/s). Scale-up should be faster than scale-down to avoid flapping during traffic spikes.

Versioning, canary, and shadow deployments

Serving multiple model versions simultaneously is standard practice:

Blue/green — switch 100% traffic from v1 to v2 atomically after health checks pass.
Canary — route 5% of traffic to v2; compare latency, error rate, and business metrics before full rollout.
Shadow — v2 receives a copy of live traffic but responses are discarded; compare v1 vs v2 predictions offline without user risk.

Pin model version, preprocessing hash, and runtime version in every prediction log. When accuracy regresses, you need to know exactly which artifact served each request.

LLM serving — what changes

Large language models add constraints classical serving guides underplay:

Autoregressive decoding — each token depends on all prior tokens; latency grows with output length.
KV-cache — store key/value tensors per sequence to avoid recomputing attention over the prefix. Memory, not FLOPs, often caps concurrency.
Quantization — INT8/FP8/INT4 weights cut memory 2–4× with modest quality loss; essential for serving 70B on a single node.
Streaming responses — return tokens via SSE/WebSocket so time-to-first-token feels fast even when total generation takes seconds.

For retrieval-augmented generation, serving splits across an embedding service, a vector database, and the LLM — each with its own SLO and scaling policy.

Worked example: real-time fraud scoring API

A fintech team deploys an XGBoost fraud model scoring card transactions at checkout:

SLA — p99 < 50 ms, 99.9% availability, max 0.1% error rate.
Artifact — ONNX export of XGBoost + sklearn preprocessing pipeline (log transforms, one-hot for merchant category).
Runtime — Triton on 2 GPU replicas (g4dn.xlarge) with dynamic batching (max batch 32, queue delay 3 ms).
API — gRPC from payment service; REST gateway for admin tools only.
Threshold — score > 0.82 blocks transaction; 0.60–0.82 routes to step-up auth (postprocessing rule, not model output).
Monitoring — log score distribution hourly; PSI alert if merchant-category mix shifts; shadow v2 model at 10% traffic for two weeks before canary.

Peak traffic: 800 RPS. Dynamic batching lifts GPU utilization from 25% to 78%; p99 drops from 110 ms to 38 ms versus per-request inference.

Serving pattern decision table

Scenario	Recommended pattern	Typical runtime
Nightly user churn scores	Batch inference on Spark/warehouse	ONNX Runtime CPU cluster
Checkout fraud block (<50 ms)	Online sync API + dynamic batching	Triton + ONNX or native XGBoost
Product search embeddings	Online API + result cache (Redis)	ONNX Runtime or dedicated embedding server
Chat assistant (7B LLM)	Online streaming + continuous batching	vLLM or TensorRT-LLM
IoT anomaly on sensor stream	Streaming inference in Flink/Kafka	Lightweight ONNX on stream task
Computer vision on edge devices	On-device inference	TensorRT / CoreML / TFLite

Common pitfalls

Train-serve skew — different preprocessing silently erodes accuracy; version and test preprocessing with the model.
No batching on GPU — serving one row at a time wastes 80% of GPU capacity; enable dynamic batching or move small models to CPU.
Ignoring tail latency — optimizing mean latency while p99 is 10× higher frustrates users during load spikes.
Scale-to-zero on GPUs — cold starts during traffic ramps cause timeout cascades; keep warm replicas.
Unlogged predictions — without score logs you cannot debug drift, audit decisions, or replay incidents.
Big-bang model swaps — deploying v2 to 100% without shadow or canary risks undetected regressions on edge cases.
Preprocessing on the API thread — CPU-bound tokenization blocks the event loop; offload to thread pools or compile into the graph.

Production checklist

Define SLOs: p99 latency, availability, max error rate, throughput target.
Serialize preprocessing with the model artifact; test train-serve parity.
Choose runtime based on framework, hardware, and ops maturity — not hype.
Enable dynamic or continuous batching for GPU workloads.
Log every prediction: model version, input hash, score, latency, request ID.
Implement canary or shadow rollout for every model version change.
Monitor latency percentiles, error rate, GPU utilization, and queue depth.
Alert on prediction distribution shift (PSI) and error-rate spikes.
Run warmup traffic after deploy before promoting to production traffic.
Document rollback procedure — previous model version hot-swappable in <5 min.

Key takeaways

Model serving is the production runtime for inference — distinct from training and from offline batch scoring.
The request path is ingress → preprocess → infer → postprocess; train-serve skew is the silent killer.
Dynamic batching trades milliseconds of queue time for multi-fold GPU throughput gains.
Use canary and shadow deployments to validate new models before full traffic cutover.
LLM serving adds KV-cache, streaming, and quantization as first-class concerns — not optional optimizations.