Guide
LLM inference serving explained: batching, engines and production deployment
A model that runs beautifully in a Jupyter notebook can collapse under real traffic. Inference serving is the engineering layer that turns a frozen checkpoint into a reliable API: tokenizing requests, scheduling GPU work, streaming completions, and staying within memory and latency budgets when hundreds of users chat at once. This guide walks through the request lifecycle, why batching matters more than raw FLOPs, popular serving engines like vLLM and Hugging Face TGI, parallelism strategies, the metrics that actually define user experience, and a checklist for shipping LLM endpoints that survive production load — building on KV cache and quantization fundamentals.
Serving vs training: different optimization targets
Training maximizes throughput over huge batches across many GPUs for days. Serving maximizes useful work per dollar per second under unpredictable, bursty request patterns. A 70B model trained on a cluster is useless to end users until you can answer: how many concurrent chats fit on one A100, what p95 latency looks like at peak, and what happens when a 32K-token RAG prompt lands in the same batch as a three-word query.
Three constraints dominate production inference:
- GPU memory — model weights plus KV cache for every active sequence; memory is usually the hard ceiling before compute.
- Memory bandwidth — during decode, each new token reads the full weight stack; smaller quantized models help because bytes moved per token drop.
- Scheduling — how requests share the same GPU without one long generation starving everyone else.
Teams that skip serving engineering and call a hosted API still face the same trade-offs — they just pay per token instead of per GPU hour. Understanding the stack helps you choose models, context lengths, and routing policies intelligently.
The inference request lifecycle
Every chat completion, regardless of framework, follows the same coarse pipeline:
- Tokenization — raw text becomes token IDs using the model's vocabulary (BPE or SentencePiece). Special tokens mark roles in chat templates.
- Prefill — the model processes the entire prompt in parallel (often one forward pass over all prompt tokens). This phase builds the initial KV cache and dominates time to first token (TTFT).
- Decode — autoregressive generation: one token at a time, each step appending to the KV cache. Long answers spend most wall-clock time here.
- Detokenization and streaming — IDs convert back to UTF-8; servers typically stream Server-Sent Events or chunked HTTP so clients see partial output before generation finishes.
Prefill is compute-bound matrix math across many tokens at once. Decode is bandwidth-bound — the GPU loads weights repeatedly for single-token steps. Serving systems optimize both phases differently: larger batches help prefill utilization; continuous batching and KV paging help decode when sequences finish at different times.
Batching: static, dynamic, and continuous
GPUs waste silicon on batch size 1. Static batching waits until N requests arrive, pads them to equal length, runs one forward pass, and returns all outputs together. Simple to implement — terrible for chat, because the batch cannot advance until the longest sequence finishes generating.
Continuous batching (also called iteration-level or dynamic batching) solves this. Each decode step forms a new micro-batch from whichever sequences are still active. When one user’s answer ends, their slot frees immediately; a new request can join on the next iteration. vLLM, TensorRT-LLM, and TGI all implement variants of this pattern. Throughput under mixed-length traffic can jump 5–10x versus static padding.
Trade-off: scheduling complexity and memory fragmentation. That is why PagedAttention treats KV cache like virtual memory — non-contiguous blocks allocated on demand — instead of one giant contiguous tensor per sequence. Without paging, continuous batching exhausts VRAM on fragmentation long before compute saturates.
Popular serving engines and when to use them
You rarely write a decode loop from scratch. Common choices in 2026:
- vLLM — open-source, PagedAttention, strong continuous batching, OpenAI-compatible HTTP API, broad model support. Default choice for self-hosted chat APIs on NVIDIA hardware.
- Hugging Face TGI (Text Generation Inference) — Rust core, good Hugging Face Hub integration, flash-attention support, enterprise deployments on AWS and GCP marketplaces.
- TensorRT-LLM — NVIDIA-optimized kernels, aggressive fusion, best when you need maximum perf on specific datacenter GPUs and can invest in engine build times per model revision.
- llama.cpp / Ollama — CPU and Apple Silicon friendly, GGUF quantization, ideal for local dev and edge prototypes rather than multi-tenant cloud scale.
- Managed APIs — OpenAI, Anthropic, Google, Together, Fireworks. You outsource serving entirely; optimize via model routing, prompt caching, and distilled fallbacks for cheap queries.
Engine choice interacts with quantization format: GPTQ/AWQ weights load cleanly in vLLM; GGUF suits llama.cpp. Mismatch between training export format and serving runtime is a common week-one integration trap.
Scaling beyond one GPU: parallelism modes
When one card cannot hold the model:
- Tensor parallelism (TP) — split individual layers across GPUs; high-bandwidth NVLink required; typical for 70B+ models on 2–8 GPUs in one node.
- Pipeline parallelism (PP) — different layers on different devices; micro-batches pipeline through stages; adds bubble overhead but works across nodes.
- Data parallelism — duplicate full model copies, route requests round-robin; simplest scale-out when each replica fits in VRAM.
Most production chat stacks use data parallel replicas behind a load balancer, each replica running continuous batching internally. Tensor parallelism is reserved for models that cannot fit otherwise. Cross-node pipeline parallelism is rare outside frontier labs due to latency and operational cost.
Metrics that define user experience
Vanity FLOPs do not matter. Track these instead:
- TTFT (time to first token) — perceived responsiveness; dominated by prefill length and queue wait. RAG apps with 20K-token contexts can have multi-second TTFT even on fast hardware.
- Tokens per second (TPS) — generation speed after streaming starts; users notice sustained slowness on long answers.
- Inter-token latency (ITL) — gap between streamed chunks; jitter feels worse than slightly lower average TPS.
- Queue depth and wait time — under load, queuing dominates TTFT before GPU saturation.
- GPU utilization vs memory headroom — 95% util with 0% VRAM free means the next long context will OOM-kill a replica.
Set SLOs per product surface: interactive chat might target p95 TTFT under 800ms for prompts under 4K tokens; batch summarization jobs can tolerate minutes if throughput is high. Mixing both on one pool without priority queues guarantees unhappy chat users.
Optimizations beyond batching
Speculative decoding
A small draft model proposes several tokens; the large model verifies them in parallel. Accepted runs advance multiple tokens per full forward pass, boosting decode TPS when draft and target distributions align. Works best when you already distilled a student from the same teacher.
Prefix and prompt caching
Stable system prompts and RAG document prefixes repeat across requests. Caching their KV tensors avoids recomputing prefill. OpenAI and Anthropic expose this as discounted cached-input pricing; self-hosted stacks implement radix-tree or hash-keyed prefix stores.
Structured output and tool calls
JSON schema constraints and function-calling add CPU-side validation loops. Serving latency includes retries when the model emits invalid tokens — budget for validate-repair cycles in agent workloads.
Production architecture patterns
A minimal reliable stack looks like:
- API gateway — auth, rate limits, request ID tracing.
- Router — sends easy queries to a small model, hard queries to a frontier model; can cut cost 60–80% with acceptable quality loss on routine tasks.
- Inference replicas — stateless GPU workers behind a queue or direct load balancer.
- Observability — log prompt token count, output tokens, TTFT, TPS, GPU OOM events; alert on queue depth growth.
Autoscaling GPU workloads is slow (cold starts measured in minutes if models load from network storage). Prefer queue-aware scaling with headroom replicas before traffic spikes — product launches, marketing emails, or viral agent demos. Scale-to-zero saves money but guarantees bad p99 latency on the first wave.
For regulated or privacy-sensitive workloads, hybrid routing sends PII-heavy prompts to on-prem replicas while offloading bulk summarization to cloud APIs — see edge AI patterns for the client-side half of that split.
Common failure modes
- OOM on long contexts — one 128K prompt evicts an entire replica; cap max input tokens per tier and reject early with clear errors.
- Head-of-line blocking — one huge batch job monopolizes GPUs; separate pools or priority queues for interactive vs batch.
- Stale model versions — replicas on different checkpoints after rolling deploy; use version tags in responses for debugging.
- Eval-free deploys — quantization or engine upgrades without rerunning golden sets; quality cliffs show up in support tickets, not dashboards.
- Ignoring tokenizer drift — swapping models without updating chat templates breaks tool calling and inflates token counts silently.
Production checklist
- Measure TTFT, TPS, and queue wait separately — optimize the actual bottleneck.
- Size VRAM as weights + peak concurrent KV cache + paging overhead, not weights alone.
- Use continuous batching with PagedAttention (or equivalent) for multi-user chat.
- Set per-request max input/output tokens; fail fast before OOM.
- Run product eval suites after every quantization, engine, or model version change.
- Implement model routing: small model first, escalate on confidence or user tier.
- Cache stable prompt prefixes; monitor cache hit rate and cost savings.
- Stream responses; clients should handle disconnects and partial outputs gracefully.
- Keep at least one warm replica; document cold-start time honestly in capacity plans.
- Log token counts per request for cost attribution and abuse detection.
Key takeaways
- Inference serving is scheduling and memory management, not just loading weights.
- Prefill sets TTFT; decode sets streaming speed — optimize each differently.
- Continuous batching is the biggest throughput win for mixed chat workloads.
- Choose engines (vLLM, TGI, TensorRT-LLM, managed APIs) based on hardware, quant format, and ops capacity.
- SLOs, routing, and eval discipline matter as much as raw tokens per second.
Related reading
- LLM KV cache explained — prefill vs decode memory and PagedAttention foundations
- LLM quantization and inference explained — INT4 weights and bandwidth trade-offs on serve paths
- LLM model distillation explained — smaller models for routing and speculative decode drafts
- LLM evaluation and benchmarking explained — golden sets to run after every serving change