Guide

LLM continuous batching explained

Harbor Analytics moved its internal chat API from a naive Hugging Face generate() loop to vLLM. Peak concurrent users barely changed — about forty analysts during month-end close — but p95 time-to-first-token dropped from 4.2 s to 680 ms and tokens-per-second per GPU rose 3.1×. The model weights were identical. The win was continuous batching: a scheduler that adds and removes requests from the GPU batch on every forward pass instead of waiting for an entire static batch to finish decoding.

Continuous batching (also called iteration-level or dynamic batching) is the serving technique that lets one GPU run dozens of chat sessions in parallel without padding every sequence to the length of the slowest one. It pairs with efficient KV cache management — especially PagedAttention in vLLM — and is the default in production engines like vLLM, TensorRT-LLM, and Hugging Face TGI. This guide covers static vs continuous batching, prefill and decode scheduling, memory and fairness trade-offs, configuration knobs, the Harbor Analytics gateway refactor, a technique decision table, pitfalls, and a production checklist.

What continuous batching decides

Autoregressive LLM inference is a loop: each step appends one token and runs another forward pass. In a naive server, one request monopolizes the GPU until its max_tokens limit or EOS. A static batch groups N requests and runs them together, but when the shortest sequence finishes, the GPU still computes padded steps for the remaining slots until the longest sequence completes — wasted FLOPs and idle memory bandwidth.

Continuous batching fixes that at the scheduler layer:

  • Join — new requests enter the batch as soon as a slot frees (another sequence hit EOS or was cancelled).
  • Leave — finished sequences exit immediately; their KV cache blocks are recycled.
  • Mix phases — some sequences may be in long prefill (processing the prompt) while others are in decode (one token at a time), depending on engine design.

The result is higher GPU utilization and more tokens per second at a given hardware budget. The trade-off is scheduler complexity, potential head-of-line blocking when one huge prefill arrives, and harder reasoning about per-request latency unless you configure limits and priorities.

Static, dynamic, and iteration-level batching

Static batching

Collect B requests, pad to the longest sequence, run until all finish. Simple for offline jobs (nightly summarization of fixed corpora) but poor for interactive chat where sequence lengths vary 10× within a batch.

Dynamic batching (time-window)

Wait up to T milliseconds to accumulate requests, then form a batch. Better than pure static for APIs, but still holds the batch until the slowest member completes the current window’s decode cycle. Common in early Triton and custom gRPC servers.

Iteration-level (continuous) batching

After each forward pass, the scheduler recomputes membership. Finished sequences leave; waiting sequences enter if KV blocks are available. vLLM’s scheduler also enforces max_num_seqs (concurrent sequences) and max_num_batched_tokens (total tokens in one iteration) to cap memory and kernel launch size. This is the pattern behind most 2025–2026 self-hosted chat APIs.

Prefill vs decode and why scheduling splits them

Prefill processes the entire prompt in one or few forward passes — compute-bound, high parallelism across prompt tokens. Decode generates one token per step per sequence — memory-bandwidth-bound because each step reads full weights and growing KV cache.

Mixed batches (some prefilling, some decoding) improve utilization but complicate kernels. Many engines use:

  • Chunked prefill — split long prompts across iterations so decode-heavy slots are not starved.
  • Prefill prioritization — bound time-to-first-token (TTFT) for new chats by reserving batch token budget for prefill.
  • Disaggregated prefill/decode — separate GPU pools when prompt lengths dominate; see prefill/decode disaggregation for when to split clusters.

Continuous batching without chunked prefill can let one 32K-token RAG prompt block dozens of short decode-only chats. Production configs almost always set explicit prefill token caps per iteration.

Memory: why batching needs PagedAttention

Each active sequence stores key/value tensors for every layer and token in the KV cache. Continuous batching constantly allocates and frees cache for arriving and departing sequences. Contiguous per-sequence buffers fragment GPU memory the way OS heap fragmentation wastes RAM.

PagedAttention (vLLM) stores KV in fixed-size blocks mapped by a block table, analogous to virtual memory pages. When a sequence finishes, its blocks return to a free list; a new request reuses them. Without paged or similarly pooled allocation, continuous batching either leaks memory or caps concurrency far below theoretical FLOPs.

Scheduler limits tie directly to memory:

  • gpu_memory_utilization — fraction of VRAM vLLM may claim.
  • max_model_len — upper bound on prompt + completion tokens.
  • max_num_seqs — max concurrent sequences given block pool size.
  • max_num_batched_tokens — tokens scheduled per forward step.

Raising concurrency without raising block pool size causes admission rejections or queue growth at the gateway — coordinate with multi-tenant quotas.

Scheduler policies and fairness

Default schedulers are often FCFS (first-come, first-served) within priority tiers. Under load, a single long completion can delay TTFT for everyone behind it unless you add:

  • Priority lanes — P0 interactive vs P2 batch summarization; common in gateway rate limiting.
  • Preemption — pause or swap low-priority KV blocks (advanced; not all engines support it).
  • Per-tenant concurrency caps — one tenant cannot occupy all max_num_seqs slots.
  • Timeout and cancel propagation — client disconnect should free scheduler slots immediately.

Streaming does not change batching mechanics — tokens still decode one step at a time — but it changes user-perceived latency. Measure TTFT and inter-token latency separately from aggregate throughput when tuning batch caps.

Throughput, latency, and config tuning

Continuous batching optimizes system throughput (total tokens/s across all users). Per-user latency can improve (more concurrent capacity) or worsen (queueing behind large prefills) depending on knobs:

Knob Higher value effect Risk
max_num_seqs More parallel chats, better GPU util KV OOM, longer queues, noisy-neighbor latency
max_num_batched_tokens Bigger prefills per step, higher throughput Decode starvation, TTFT spikes for short chats
Chunked prefill size Smoother mixed batches Slightly higher TTFT on very short prompts
Speculative decoding Lower effective decode steps Draft model memory; see speculative decoding

Load-test with a realistic mix: 70% short prompts (<512 tokens), 20% medium RAG (2–8K), 10% long documents. Synthetic uniform-length tests mis-tune schedulers. Pair with observability on batch size, cache usage, and prefill/decode time per iteration.

Harbor Analytics gateway refactor

Harbor’s migration from a single-threaded generate loop to continuous batching followed a measured path:

  1. Baseline metrics — TTFT, tokens/s, GPU util, and queue depth at month-end peak; identified padding waste in static batches of eight.
  2. vLLM pilot — same 8B instruct checkpoint, OpenAI-compatible port behind existing LiteLLM gateway; max_num_seqs=64, max_num_batched_tokens=8192.
  3. Chunked prefill — enabled after RAG tickets with 12K-token contexts caused 3 s TTFT regressions for short queries.
  4. Priority tiers — month-end batch reports tagged P2; interactive dashboards P0; gateway enforces per-tenant seq caps.
  5. FP8 weights — freed VRAM for larger block pools; see FP8 inference for quality validation steps.
  6. Cancel handling — SSE client disconnect aborts vLLM sequences within one decode step, reclaiming slots.

Result: 3.1× throughput on one A100, p95 TTFT under 700 ms at 40 concurrent users, zero OOM after block-pool tuning. Cost avoided a second GPU for six months.

Technique decision table

Your situation Prefer Avoid
Offline nightly batch, fixed input sizes Static or async batch API jobs Continuous batching complexity for one-shot jobs
Interactive multi-user chat API Continuous batching (vLLM, TGI, TRT-LLM) Per-request generate() loops on one GPU
Mostly long RAG prefills Chunked prefill + token budget; consider disaggregated prefill Unlimited prefill in mixed batches
Strict TTFT SLO (<500 ms) Lower max_num_batched_tokens, P0 lanes, prefill caps Maxing max_num_seqs without latency testing
Cost-sensitive self-host Continuous batching + quantization + right-sized GPU Over-provisioning GPUs because the scheduler is naive
Provider API only (OpenAI, Anthropic) Gateway queueing and async batch for bulk work Reimplementing continuous batching client-side
Multi-tenant SaaS inference Per-tenant seq caps + WFQ at gateway + continuous batching engine Shared pool with no isolation on max_num_seqs

Common pitfalls

  • Confusing throughput with per-user latency. High tokens/s can coexist with multi-second TTFT under load.
  • No chunked prefill on RAG workloads. One long prompt starves the batch.
  • Ignoring KV block pool limits. OOM or admission failures at peak concurrency.
  • Load tests with uniform lengths. Misleading scheduler tuning vs production traffic.
  • Not propagating client cancel. Ghost sequences waste slots for hundreds of decode steps.
  • Maxing batch size on latency-sensitive tiers. Interactive and batch jobs need separate pools or priorities.
  • Skipping observability on scheduler metrics. You cannot tune what you do not measure per iteration.
  • Assuming provider APIs expose batching knobs. Continuous batching is an engine concern; gateways queue externally.

Production checklist

  • Replace naive generate loops with a continuous-batching engine for concurrent APIs.
  • Set max_num_seqs and max_num_batched_tokens from load tests with realistic length mix.
  • Enable chunked prefill when median prompt exceeds ~2K tokens.
  • Pair scheduling with PagedAttention or equivalent KV block pooling.
  • Instrument TTFT, inter-token latency, tokens/s, batch size, and KV cache usage.
  • Implement priority lanes and per-tenant concurrency caps at the gateway.
  • Propagate client disconnect to free scheduler slots within one step.
  • Validate quality after FP8/INT4 changes that enlarge block pools.
  • Document rollback to static offline jobs for disaster-batch scenarios.
  • Re-tune after model swaps (context length and KV bytes per token change).
  • Load-test month-end or campaign peaks, not average midday traffic.
  • Review disaggregated prefill when TTFT SLO breaks despite chunked prefill.

Key takeaways

  • Continuous batching adds and removes sequences every forward pass, eliminating padding waste from static batches.
  • Prefill and decode have different compute profiles; chunked prefill and token budgets protect TTFT for short chats.
  • PagedAttention-style KV pooling is the memory partner that makes high concurrency practical.
  • Harbor Analytics tripled effective throughput by scheduling, not by buying a larger model or more weights.
  • Tune with realistic length distributions and separate interactive from batch priority tiers.

Related reading