Guide

LLM chunked prefill explained

Harbor Analytics turned on continuous batching in vLLM and cut p95 time-to-first-token (TTFT) from 4.2 s to 680 ms for short analyst queries. Two weeks later, a new RAG pipeline shipped 14K-token earnings-call transcripts as system context. TTFT for those tickets was fine — but p95 TTFT for “quick lookup” chats under 200 tokens jumped to 2.8 s. GPU utilization looked healthy. The culprit was not the model; it was monolithic prefill: one long prompt consumed the entire per-iteration token budget and starved decode slots for everyone else in the batch.

Chunked prefill splits long prompt processing across multiple forward passes, interleaving partial prefills with ongoing decode steps for other sequences. It is the standard fix in vLLM, TensorRT-LLM, and SGLang when RAG, agent tool dumps, or document QA push median prompt length past a few thousand tokens. This guide covers prefill vs decode compute profiles, how chunked scheduling interacts with max_num_batched_tokens, configuration knobs, the Harbor Analytics RAG gateway refactor, a technique decision table, pitfalls, and a production checklist.

What chunked prefill decides

Autoregressive inference has two phases. Prefill ingests the prompt: one or more forward passes compute attention over all prompt tokens at once — highly parallel, compute-bound. Decode generates output one token per step per sequence — memory-bandwidth-bound because each step reads full weights and an ever-growing KV cache.

In a continuous batch, the scheduler runs one forward pass per “iteration” that may include multiple sequences. Without chunking, a single 20K-token prefill can occupy the entire iteration — every other sequence waits, even if they only need one decode token. Chunked prefill caps how many new prompt tokens any sequence may process per iteration, leaving budget for decode work on other slots.

  • Fairness — short chats keep getting tokens while a long RAG prompt prefills in the background across iterations.
  • Predictable TTFT — new requests join the batch without waiting for one monster prefill to finish monolithically.
  • Stable mixed batches — prefill and decode coexist in the same kernel launch without head-of-line blocking.

The trade-off is slightly higher total prefill FLOPs (more kernel launches) and marginally longer absolute TTFT on very long prompts compared to a dedicated prefill GPU — acceptable on single-GPU or cost-constrained clusters.

Monolithic prefill vs chunked prefill

Monolithic (single-pass) prefill

Process the entire prompt in one forward pass before any decode begins. Optimal when only one sequence is active or prompts are uniformly short. In multi-tenant APIs with variable lengths, one 32K RAG context can freeze the batch for tens of milliseconds per layer — multiplied across queue depth, users perceive multi-second TTFT spikes.

Chunked prefill

Split the prompt into segments of at most max_num_batched_tokens (or a dedicated chunk size) per iteration. After each chunk, the scheduler may schedule decode tokens for other sequences, then resume prefill on the long prompt. vLLM exposes this via enable_chunked_prefill=True (default in recent versions) paired with max_num_batched_tokens.

Disaggregated prefill/decode

When prompts routinely exceed 16–32K tokens and TTFT SLOs still break despite chunking, split workloads across GPU pools: prefill nodes optimized for compute, decode nodes for bandwidth. See prefill/decode disaggregation for cluster topology. Chunked prefill is the lighter-weight fix; disaggregation is the scale-out step.

Token budgets and scheduler interaction

Chunked prefill is not a separate subsystem — it is a policy inside the iteration scheduler. The critical knobs in vLLM and similar engines:

  • max_num_batched_tokens — upper bound on total tokens (prefill + decode) scheduled in one forward pass. Acts as the chunk ceiling.
  • max_num_seqs — max concurrent sequences; interacts with KV block pool size from PagedAttention.
  • enable_chunked_prefill — toggles splitting long prefills across iterations.
  • max_model_len — hard cap on prompt + completion; must fit in KV memory regardless of chunking.

Example: max_num_batched_tokens=4096 with chunked prefill enabled. A 12K-token RAG prompt prefills in roughly three iterations (~4K tokens each). Between chunks, the scheduler can run decode steps for eight short chats already in the batch. Without chunking, those eight chats wait until all 12K prompt tokens complete in one pass.

Tuning is workload-specific. RAG-heavy traffic wants lower per-iteration prefill share (smaller effective chunks or priority lanes). Batch summarization overnight may disable chunking or raise max_num_batched_tokens to maximize throughput when TTFT does not matter.

When chunked prefill matters most

Chunked prefill pays off when prompt length variance is high and interactive latency matters:

  • RAG pipelines — retrieved chunks plus system prompts routinely 4–20K tokens.
  • Agent tool loops — large JSON observations appended each turn; see tool result compression to reduce need, but chunking handles what remains.
  • Code assistants — multi-file context windows with FIM layouts still produce long prefills.
  • Multi-tenant chat APIs — one tenant’s document upload must not block another tenant’s 50-token question.

It matters less when all prompts are short (<1K tokens), offline batch jobs dominate, or you already run disaggregated prefill clusters with dedicated hardware.

Latency, throughput, and observability

Measure three metrics separately when tuning chunked prefill:

  • TTFT — arrival to first output token; sensitive to prefill queue position and chunk size.
  • Inter-token latency (ITL) — gap between streamed tokens during decode; sensitive to decode starvation from oversized prefills.
  • System throughput — aggregate tokens/s; may dip slightly with chunking due to extra kernel launches.

Dashboard per-iteration breakdowns: prefill tokens scheduled, decode tokens scheduled, batch occupancy, and KV block utilization. Spikes where prefill tokens equal max_num_batched_tokens while decode tokens are zero signal monolithic behavior or chunk size set too high. Pair with inference observability and realistic load mixes (70% short, 20% medium RAG, 10% long documents).

Harbor Analytics RAG gateway refactor

Harbor’s regression after the earnings-transcript RAG launch followed a structured fix:

  1. Reproduce under load — synthetic mix: 60% queries <256 tokens, 30% RAG 8–16K, 10% full transcripts 14K+. Confirmed TTFT inversion: short queries queued behind monolithic prefills.
  2. Enable chunked prefill — vLLM enable_chunked_prefill=True, reduced max_num_batched_tokens from 8192 to 4096 for the interactive tier.
  3. Priority lanes — LiteLLM tags: P0 for <512-token queries, P1 for standard RAG, P2 for batch reports. P0 gets reserved decode budget via gateway concurrency caps per tag.
  4. Context hygiene — reranker top-6 chunks instead of top-12; median RAG prompt dropped from 14K to 7K tokens without recall loss on golden eval.
  5. Validation — p95 TTFT for P0 restored to 720 ms; P1 RAG p95 TTFT 1.4 s (acceptable for document QA); system throughput down 4% vs monolithic prefill but within SLO.

The lesson: continuous batching solves padding waste; chunked prefill solves prefill/decode interference inside the batch. Both are required for mixed RAG + chat traffic on shared GPUs.

Technique decision table

Your situation Prefer Avoid
Uniform short prompts (<1K), single-tenant Monolithic prefill; higher max_num_batched_tokens Chunking overhead for no fairness gain
Mixed RAG + interactive chat on one GPU Chunked prefill + priority lanes Unlimited monolithic prefills in shared batch
Median prompt >4K, TTFT SLO <1 s Smaller chunk budget; context compression upstream Raising max_num_seqs without chunking
Routine 32K+ prompts, strict TTFT Disaggregated prefill cluster Chunking alone on one bandwidth-limited GPU
Offline batch summarization Large batches, chunking optional Interactive-tier chunk sizes tuned for batch jobs
Provider API only (no self-host) Prompt truncation, retrieval limits, async batch for bulk Assuming you control engine prefill scheduling
Multi-tenant SaaS Chunked prefill + per-tenant token caps at gateway One tenant uploading 50K context without admission control

Common pitfalls

  • Enabling continuous batching without chunked prefill on RAG traffic. The classic TTFT regression Harbor hit.
  • Setting max_num_batched_tokens too high. Defeats chunking; behaves like monolithic prefill.
  • Setting chunk budget too low. TTFT on long prompts stretches; throughput drops from kernel launch overhead.
  • Tuning on uniform-length load tests. Hides starvation patterns visible only with long-tail prompts.
  • Ignoring KV block pool limits. More concurrent partial prefills still consume cache blocks.
  • Same config for interactive and batch tiers. Batch jobs want throughput; chat wants fairness.
  • Skipping context compression. Chunking helps scheduling; it does not replace retrieval quality or token budgets.
  • Assuming chunking fixes disaggregated-scale problems. At extreme context lengths, split clusters remain necessary.

Production checklist

  • Enable chunked prefill when median prompt exceeds ~2K tokens or RAG is in production.
  • Set max_num_batched_tokens from load tests with realistic long-tail length mix.
  • Separate interactive and batch inference tiers with different token budgets.
  • Instrument per-iteration prefill vs decode token counts and TTFT by priority lane.
  • Cap per-request prompt length at gateway; reject or summarize oversize uploads.
  • Pair chunking with PagedAttention or equivalent KV block pooling.
  • Validate RAG recall after reducing retrieved chunk count to shrink prefills.
  • Load-test month-end peaks with concurrent long-document and short-query traffic.
  • Document rollback: disable chunking and raise batch tokens for emergency batch-only mode.
  • Re-tune after model context-length changes (KV bytes per token shift).
  • Evaluate disaggregated prefill when p95 TTFT breaks SLO despite chunked config.
  • Propagate client cancel so partial prefills release scheduler slots immediately.

Key takeaways

  • Chunked prefill splits long prompts across iterations so decode work for other sequences is not starved.
  • max_num_batched_tokens is the primary chunk ceiling; it must be tuned with workload variance in mind.
  • Continuous batching eliminates padding waste; chunked prefill eliminates prefill head-of-line blocking inside the batch.
  • Harbor Analytics restored sub-second TTFT for short queries by chunking prefills and tagging priority lanes after RAG launched.
  • At extreme context lengths, graduate from chunking to disaggregated prefill/decode clusters.

Related reading