Guide
LLM chunked prefill explained
Harbor Analytics turned on continuous batching in vLLM and cut p95 time-to-first-token (TTFT) from 4.2 s to 680 ms for short analyst queries. Two weeks later, a new RAG pipeline shipped 14K-token earnings-call transcripts as system context. TTFT for those tickets was fine — but p95 TTFT for “quick lookup” chats under 200 tokens jumped to 2.8 s. GPU utilization looked healthy. The culprit was not the model; it was monolithic prefill: one long prompt consumed the entire per-iteration token budget and starved decode slots for everyone else in the batch.
Chunked prefill splits long prompt processing across multiple forward
passes, interleaving partial prefills with ongoing decode steps for other sequences.
It is the standard fix in vLLM, TensorRT-LLM, and SGLang when RAG, agent tool dumps,
or document QA push median prompt length past a few thousand tokens. This guide covers
prefill vs decode compute profiles, how chunked scheduling interacts with
max_num_batched_tokens, configuration knobs, the Harbor Analytics RAG
gateway refactor, a technique decision table, pitfalls, and a production checklist.
What chunked prefill decides
Autoregressive inference has two phases. Prefill ingests the prompt: one or more forward passes compute attention over all prompt tokens at once — highly parallel, compute-bound. Decode generates output one token per step per sequence — memory-bandwidth-bound because each step reads full weights and an ever-growing KV cache.
In a continuous batch, the scheduler runs one forward pass per “iteration” that may include multiple sequences. Without chunking, a single 20K-token prefill can occupy the entire iteration — every other sequence waits, even if they only need one decode token. Chunked prefill caps how many new prompt tokens any sequence may process per iteration, leaving budget for decode work on other slots.
- Fairness — short chats keep getting tokens while a long RAG prompt prefills in the background across iterations.
- Predictable TTFT — new requests join the batch without waiting for one monster prefill to finish monolithically.
- Stable mixed batches — prefill and decode coexist in the same kernel launch without head-of-line blocking.
The trade-off is slightly higher total prefill FLOPs (more kernel launches) and marginally longer absolute TTFT on very long prompts compared to a dedicated prefill GPU — acceptable on single-GPU or cost-constrained clusters.
Monolithic prefill vs chunked prefill
Monolithic (single-pass) prefill
Process the entire prompt in one forward pass before any decode begins. Optimal when only one sequence is active or prompts are uniformly short. In multi-tenant APIs with variable lengths, one 32K RAG context can freeze the batch for tens of milliseconds per layer — multiplied across queue depth, users perceive multi-second TTFT spikes.
Chunked prefill
Split the prompt into segments of at most max_num_batched_tokens (or a
dedicated chunk size) per iteration. After each chunk, the scheduler may schedule
decode tokens for other sequences, then resume prefill on the long prompt. vLLM
exposes this via enable_chunked_prefill=True (default in recent
versions) paired with max_num_batched_tokens.
Disaggregated prefill/decode
When prompts routinely exceed 16–32K tokens and TTFT SLOs still break despite chunking, split workloads across GPU pools: prefill nodes optimized for compute, decode nodes for bandwidth. See prefill/decode disaggregation for cluster topology. Chunked prefill is the lighter-weight fix; disaggregation is the scale-out step.
Token budgets and scheduler interaction
Chunked prefill is not a separate subsystem — it is a policy inside the iteration scheduler. The critical knobs in vLLM and similar engines:
max_num_batched_tokens— upper bound on total tokens (prefill + decode) scheduled in one forward pass. Acts as the chunk ceiling.max_num_seqs— max concurrent sequences; interacts with KV block pool size from PagedAttention.enable_chunked_prefill— toggles splitting long prefills across iterations.max_model_len— hard cap on prompt + completion; must fit in KV memory regardless of chunking.
Example: max_num_batched_tokens=4096 with chunked prefill enabled.
A 12K-token RAG prompt prefills in roughly three iterations (~4K tokens each).
Between chunks, the scheduler can run decode steps for eight short chats already
in the batch. Without chunking, those eight chats wait until all 12K prompt tokens
complete in one pass.
Tuning is workload-specific. RAG-heavy traffic wants lower per-iteration prefill
share (smaller effective chunks or priority lanes). Batch summarization overnight
may disable chunking or raise max_num_batched_tokens to maximize
throughput when TTFT does not matter.
When chunked prefill matters most
Chunked prefill pays off when prompt length variance is high and interactive latency matters:
- RAG pipelines — retrieved chunks plus system prompts routinely 4–20K tokens.
- Agent tool loops — large JSON observations appended each turn; see tool result compression to reduce need, but chunking handles what remains.
- Code assistants — multi-file context windows with FIM layouts still produce long prefills.
- Multi-tenant chat APIs — one tenant’s document upload must not block another tenant’s 50-token question.
It matters less when all prompts are short (<1K tokens), offline batch jobs dominate, or you already run disaggregated prefill clusters with dedicated hardware.
Latency, throughput, and observability
Measure three metrics separately when tuning chunked prefill:
- TTFT — arrival to first output token; sensitive to prefill queue position and chunk size.
- Inter-token latency (ITL) — gap between streamed tokens during decode; sensitive to decode starvation from oversized prefills.
- System throughput — aggregate tokens/s; may dip slightly with chunking due to extra kernel launches.
Dashboard per-iteration breakdowns: prefill tokens scheduled, decode tokens scheduled,
batch occupancy, and KV block utilization. Spikes where prefill tokens equal
max_num_batched_tokens while decode tokens are zero signal monolithic
behavior or chunk size set too high. Pair with
inference observability
and realistic load mixes (70% short, 20% medium RAG, 10% long documents).
Harbor Analytics RAG gateway refactor
Harbor’s regression after the earnings-transcript RAG launch followed a structured fix:
- Reproduce under load — synthetic mix: 60% queries <256 tokens, 30% RAG 8–16K, 10% full transcripts 14K+. Confirmed TTFT inversion: short queries queued behind monolithic prefills.
- Enable chunked prefill — vLLM
enable_chunked_prefill=True, reducedmax_num_batched_tokensfrom 8192 to 4096 for the interactive tier. - Priority lanes — LiteLLM tags: P0 for <512-token queries, P1 for standard RAG, P2 for batch reports. P0 gets reserved decode budget via gateway concurrency caps per tag.
- Context hygiene — reranker top-6 chunks instead of top-12; median RAG prompt dropped from 14K to 7K tokens without recall loss on golden eval.
- Validation — p95 TTFT for P0 restored to 720 ms; P1 RAG p95 TTFT 1.4 s (acceptable for document QA); system throughput down 4% vs monolithic prefill but within SLO.
The lesson: continuous batching solves padding waste; chunked prefill solves prefill/decode interference inside the batch. Both are required for mixed RAG + chat traffic on shared GPUs.
Technique decision table
| Your situation | Prefer | Avoid |
|---|---|---|
| Uniform short prompts (<1K), single-tenant | Monolithic prefill; higher max_num_batched_tokens |
Chunking overhead for no fairness gain |
| Mixed RAG + interactive chat on one GPU | Chunked prefill + priority lanes | Unlimited monolithic prefills in shared batch |
| Median prompt >4K, TTFT SLO <1 s | Smaller chunk budget; context compression upstream | Raising max_num_seqs without chunking |
| Routine 32K+ prompts, strict TTFT | Disaggregated prefill cluster | Chunking alone on one bandwidth-limited GPU |
| Offline batch summarization | Large batches, chunking optional | Interactive-tier chunk sizes tuned for batch jobs |
| Provider API only (no self-host) | Prompt truncation, retrieval limits, async batch for bulk | Assuming you control engine prefill scheduling |
| Multi-tenant SaaS | Chunked prefill + per-tenant token caps at gateway | One tenant uploading 50K context without admission control |
Common pitfalls
- Enabling continuous batching without chunked prefill on RAG traffic. The classic TTFT regression Harbor hit.
- Setting
max_num_batched_tokenstoo high. Defeats chunking; behaves like monolithic prefill. - Setting chunk budget too low. TTFT on long prompts stretches; throughput drops from kernel launch overhead.
- Tuning on uniform-length load tests. Hides starvation patterns visible only with long-tail prompts.
- Ignoring KV block pool limits. More concurrent partial prefills still consume cache blocks.
- Same config for interactive and batch tiers. Batch jobs want throughput; chat wants fairness.
- Skipping context compression. Chunking helps scheduling; it does not replace retrieval quality or token budgets.
- Assuming chunking fixes disaggregated-scale problems. At extreme context lengths, split clusters remain necessary.
Production checklist
- Enable chunked prefill when median prompt exceeds ~2K tokens or RAG is in production.
- Set
max_num_batched_tokensfrom load tests with realistic long-tail length mix. - Separate interactive and batch inference tiers with different token budgets.
- Instrument per-iteration prefill vs decode token counts and TTFT by priority lane.
- Cap per-request prompt length at gateway; reject or summarize oversize uploads.
- Pair chunking with PagedAttention or equivalent KV block pooling.
- Validate RAG recall after reducing retrieved chunk count to shrink prefills.
- Load-test month-end peaks with concurrent long-document and short-query traffic.
- Document rollback: disable chunking and raise batch tokens for emergency batch-only mode.
- Re-tune after model context-length changes (KV bytes per token shift).
- Evaluate disaggregated prefill when p95 TTFT breaks SLO despite chunked config.
- Propagate client cancel so partial prefills release scheduler slots immediately.
Key takeaways
- Chunked prefill splits long prompts across iterations so decode work for other sequences is not starved.
max_num_batched_tokensis the primary chunk ceiling; it must be tuned with workload variance in mind.- Continuous batching eliminates padding waste; chunked prefill eliminates prefill head-of-line blocking inside the batch.
- Harbor Analytics restored sub-second TTFT for short queries by chunking prefills and tagging priority lanes after RAG launched.
- At extreme context lengths, graduate from chunking to disaggregated prefill/decode clusters.
Related reading
- LLM continuous batching explained — iteration-level scheduling and mixed batches
- vLLM fundamentals explained — PagedAttention, API server, and parallelism
- LLM prefill/decode disaggregation explained — when to split GPU pools
- LLM KV cache explained — memory growth across prefill chunks