Guide
LLM continuous batching explained
Harbor Analytics moved its internal chat API from a naive Hugging Face
generate() loop to
vLLM.
Peak concurrent users barely changed — about forty analysts during month-end
close — but p95 time-to-first-token dropped from 4.2 s to 680 ms and
tokens-per-second per GPU rose 3.1×. The model weights were identical. The win
was continuous batching: a scheduler that adds and removes requests
from the GPU batch on every forward pass instead of waiting for an entire static
batch to finish decoding.
Continuous batching (also called iteration-level or dynamic batching) is the serving technique that lets one GPU run dozens of chat sessions in parallel without padding every sequence to the length of the slowest one. It pairs with efficient KV cache management — especially PagedAttention in vLLM — and is the default in production engines like vLLM, TensorRT-LLM, and Hugging Face TGI. This guide covers static vs continuous batching, prefill and decode scheduling, memory and fairness trade-offs, configuration knobs, the Harbor Analytics gateway refactor, a technique decision table, pitfalls, and a production checklist.
What continuous batching decides
Autoregressive LLM inference is a loop: each step appends one token and runs another
forward pass. In a naive server, one request monopolizes the GPU until its
max_tokens limit or EOS. A static batch groups N
requests and runs them together, but when the shortest sequence finishes, the GPU
still computes padded steps for the remaining slots until the longest sequence
completes — wasted FLOPs and idle memory bandwidth.
Continuous batching fixes that at the scheduler layer:
- Join — new requests enter the batch as soon as a slot frees (another sequence hit EOS or was cancelled).
- Leave — finished sequences exit immediately; their KV cache blocks are recycled.
- Mix phases — some sequences may be in long prefill (processing the prompt) while others are in decode (one token at a time), depending on engine design.
The result is higher GPU utilization and more tokens per second at a given hardware budget. The trade-off is scheduler complexity, potential head-of-line blocking when one huge prefill arrives, and harder reasoning about per-request latency unless you configure limits and priorities.
Static, dynamic, and iteration-level batching
Static batching
Collect B requests, pad to the longest sequence, run until all finish. Simple for offline jobs (nightly summarization of fixed corpora) but poor for interactive chat where sequence lengths vary 10× within a batch.
Dynamic batching (time-window)
Wait up to T milliseconds to accumulate requests, then form a batch. Better than pure static for APIs, but still holds the batch until the slowest member completes the current window’s decode cycle. Common in early Triton and custom gRPC servers.
Iteration-level (continuous) batching
After each forward pass, the scheduler recomputes membership. Finished
sequences leave; waiting sequences enter if KV blocks are available. vLLM’s
scheduler also enforces max_num_seqs (concurrent sequences) and
max_num_batched_tokens (total tokens in one iteration) to cap memory
and kernel launch size. This is the pattern behind most 2025–2026
self-hosted chat APIs.
Prefill vs decode and why scheduling splits them
Prefill processes the entire prompt in one or few forward passes — compute-bound, high parallelism across prompt tokens. Decode generates one token per step per sequence — memory-bandwidth-bound because each step reads full weights and growing KV cache.
Mixed batches (some prefilling, some decoding) improve utilization but complicate kernels. Many engines use:
- Chunked prefill — split long prompts across iterations so decode-heavy slots are not starved.
- Prefill prioritization — bound time-to-first-token (TTFT) for new chats by reserving batch token budget for prefill.
- Disaggregated prefill/decode — separate GPU pools when prompt lengths dominate; see prefill/decode disaggregation for when to split clusters.
Continuous batching without chunked prefill can let one 32K-token RAG prompt block dozens of short decode-only chats. Production configs almost always set explicit prefill token caps per iteration.
Memory: why batching needs PagedAttention
Each active sequence stores key/value tensors for every layer and token in the KV cache. Continuous batching constantly allocates and frees cache for arriving and departing sequences. Contiguous per-sequence buffers fragment GPU memory the way OS heap fragmentation wastes RAM.
PagedAttention (vLLM) stores KV in fixed-size blocks mapped by a block table, analogous to virtual memory pages. When a sequence finishes, its blocks return to a free list; a new request reuses them. Without paged or similarly pooled allocation, continuous batching either leaks memory or caps concurrency far below theoretical FLOPs.
Scheduler limits tie directly to memory:
gpu_memory_utilization— fraction of VRAM vLLM may claim.max_model_len— upper bound on prompt + completion tokens.max_num_seqs— max concurrent sequences given block pool size.max_num_batched_tokens— tokens scheduled per forward step.
Raising concurrency without raising block pool size causes admission rejections or queue growth at the gateway — coordinate with multi-tenant quotas.
Scheduler policies and fairness
Default schedulers are often FCFS (first-come, first-served) within priority tiers. Under load, a single long completion can delay TTFT for everyone behind it unless you add:
- Priority lanes — P0 interactive vs P2 batch summarization; common in gateway rate limiting.
- Preemption — pause or swap low-priority KV blocks (advanced; not all engines support it).
- Per-tenant concurrency caps — one tenant cannot occupy all
max_num_seqsslots. - Timeout and cancel propagation — client disconnect should free scheduler slots immediately.
Streaming does not change batching mechanics — tokens still decode one step at a time — but it changes user-perceived latency. Measure TTFT and inter-token latency separately from aggregate throughput when tuning batch caps.
Throughput, latency, and config tuning
Continuous batching optimizes system throughput (total tokens/s across all users). Per-user latency can improve (more concurrent capacity) or worsen (queueing behind large prefills) depending on knobs:
| Knob | Higher value effect | Risk |
|---|---|---|
max_num_seqs |
More parallel chats, better GPU util | KV OOM, longer queues, noisy-neighbor latency |
max_num_batched_tokens |
Bigger prefills per step, higher throughput | Decode starvation, TTFT spikes for short chats |
| Chunked prefill size | Smoother mixed batches | Slightly higher TTFT on very short prompts |
| Speculative decoding | Lower effective decode steps | Draft model memory; see speculative decoding |
Load-test with a realistic mix: 70% short prompts (<512 tokens), 20% medium RAG (2–8K), 10% long documents. Synthetic uniform-length tests mis-tune schedulers. Pair with observability on batch size, cache usage, and prefill/decode time per iteration.
Harbor Analytics gateway refactor
Harbor’s migration from a single-threaded generate loop to continuous batching followed a measured path:
- Baseline metrics — TTFT, tokens/s, GPU util, and queue depth at month-end peak; identified padding waste in static batches of eight.
- vLLM pilot — same 8B instruct checkpoint, OpenAI-compatible
port behind existing
LiteLLM
gateway;
max_num_seqs=64,max_num_batched_tokens=8192. - Chunked prefill — enabled after RAG tickets with 12K-token contexts caused 3 s TTFT regressions for short queries.
- Priority tiers — month-end batch reports tagged P2; interactive dashboards P0; gateway enforces per-tenant seq caps.
- FP8 weights — freed VRAM for larger block pools; see FP8 inference for quality validation steps.
- Cancel handling — SSE client disconnect aborts vLLM sequences within one decode step, reclaiming slots.
Result: 3.1× throughput on one A100, p95 TTFT under 700 ms at 40 concurrent users, zero OOM after block-pool tuning. Cost avoided a second GPU for six months.
Technique decision table
| Your situation | Prefer | Avoid |
|---|---|---|
| Offline nightly batch, fixed input sizes | Static or async batch API jobs | Continuous batching complexity for one-shot jobs |
| Interactive multi-user chat API | Continuous batching (vLLM, TGI, TRT-LLM) | Per-request generate() loops on one GPU |
| Mostly long RAG prefills | Chunked prefill + token budget; consider disaggregated prefill | Unlimited prefill in mixed batches |
| Strict TTFT SLO (<500 ms) | Lower max_num_batched_tokens, P0 lanes, prefill caps |
Maxing max_num_seqs without latency testing |
| Cost-sensitive self-host | Continuous batching + quantization + right-sized GPU | Over-provisioning GPUs because the scheduler is naive |
| Provider API only (OpenAI, Anthropic) | Gateway queueing and async batch for bulk work | Reimplementing continuous batching client-side |
| Multi-tenant SaaS inference | Per-tenant seq caps + WFQ at gateway + continuous batching engine | Shared pool with no isolation on max_num_seqs |
Common pitfalls
- Confusing throughput with per-user latency. High tokens/s can coexist with multi-second TTFT under load.
- No chunked prefill on RAG workloads. One long prompt starves the batch.
- Ignoring KV block pool limits. OOM or admission failures at peak concurrency.
- Load tests with uniform lengths. Misleading scheduler tuning vs production traffic.
- Not propagating client cancel. Ghost sequences waste slots for hundreds of decode steps.
- Maxing batch size on latency-sensitive tiers. Interactive and batch jobs need separate pools or priorities.
- Skipping observability on scheduler metrics. You cannot tune what you do not measure per iteration.
- Assuming provider APIs expose batching knobs. Continuous batching is an engine concern; gateways queue externally.
Production checklist
- Replace naive generate loops with a continuous-batching engine for concurrent APIs.
- Set
max_num_seqsandmax_num_batched_tokensfrom load tests with realistic length mix. - Enable chunked prefill when median prompt exceeds ~2K tokens.
- Pair scheduling with PagedAttention or equivalent KV block pooling.
- Instrument TTFT, inter-token latency, tokens/s, batch size, and KV cache usage.
- Implement priority lanes and per-tenant concurrency caps at the gateway.
- Propagate client disconnect to free scheduler slots within one step.
- Validate quality after FP8/INT4 changes that enlarge block pools.
- Document rollback to static offline jobs for disaster-batch scenarios.
- Re-tune after model swaps (context length and KV bytes per token change).
- Load-test month-end or campaign peaks, not average midday traffic.
- Review disaggregated prefill when TTFT SLO breaks despite chunked prefill.
Key takeaways
- Continuous batching adds and removes sequences every forward pass, eliminating padding waste from static batches.
- Prefill and decode have different compute profiles; chunked prefill and token budgets protect TTFT for short chats.
- PagedAttention-style KV pooling is the memory partner that makes high concurrency practical.
- Harbor Analytics tripled effective throughput by scheduling, not by buying a larger model or more weights.
- Tune with realistic length distributions and separate interactive from batch priority tiers.
Related reading
- vLLM fundamentals explained — PagedAttention, API server, and parallelism
- LLM inference serving explained — engines, autoscaling, and SLO checklist
- LLM KV cache explained — memory growth and cache precision
- LLM multi-tenant isolation explained — fair queues and noisy-neighbor control