Guide
LLM prefill-decode disaggregation explained
Harbor Support's tier-1 chat gateway ran a 70B model on eight A100s with vLLM continuous batching. Short prompts felt snappy, but RAG tickets with 12K-token context routinely queued behind decode-heavy chats — P99 time-to-first-token spiked to 4.2 seconds even though decode throughput was healthy. The team had already tuned speculative decoding and GQA; the remaining bottleneck was architectural: prefill and decode share one GPU pool but want opposite hardware profiles. Prefill-decode disaggregation splits them — a prefill fleet optimized for compute-bound matrix work, a decode fleet optimized for memory-bandwidth-bound autoregression — with KV cache tensors handed off between pools. Median TTFT on long prompts dropped 31% and sustained decode tok/s rose 18% on the same GPU count. Prefill-decode disaggregation is an inference architecture that runs prompt ingestion and token generation on separate worker pools, transferring the populated KV cache between them so each phase can scale independently. This guide covers why the phases differ, colocated vs split serving, KV transfer mechanics, scheduling and load balancing, framework support, the Harbor Support refactor, an architecture decision table, pitfalls, and a production checklist.
Why prefill and decode want different GPUs
LLM inference has two distinct phases, described in depth in our KV cache guide. Prefill processes the entire prompt in parallel: batch dimensions are large, tensor cores stay saturated, and the workload is compute-bound. Decode generates one token at a time per sequence: matrix dimensions shrink to batch-size-one skinny matmuls, and the GPU spends most of its time reading and writing KV tensors from HBM — memory-bandwidth-bound.
The interference problem
When both phases share a GPU pool, they interfere. A burst of long-prefill RAG requests monopolizes SM cycles and evicts decode batches from cache, inflating inter-token latency for everyone. Conversely, a decode-heavy load leaves tensor cores idle during gaps between token steps — wasted compute if prefill jobs are queued. Colocated serving is simple and works when prompt lengths are uniform and traffic is moderate. Disaggregation pays off when your workload mixes long prefill (documents, agent tool outputs) with chatty decode (multi-turn support threads).
Resource sizing intuition
A rough starting ratio for mixed chat workloads: allocate 30–40% of GPUs to prefill and 60–70% to decode. Document-heavy APIs may flip toward 50/50. Monitor queue depth per pool and adjust weekly — the ratio is not static across product launches.
Disaggregated architecture
A disaggregated stack has three logical components:
- Router / scheduler — accepts requests, assigns a prefill worker, then routes the completed KV state to a decode worker.
- Prefill workers — run the forward pass over the prompt, populate per-layer K and V tensors, and serialize or DMA them to decode.
- Decode workers — receive KV state, attach to their own PagedAttention blocks, and run autoregressive generation (optionally with speculative decoding).
KV cache handoff
The expensive part of disaggregation is moving KV tensors between machines. For a 70B GQA model at FP16, KV cache size scales roughly as:
kv_bytes ≈ 2 × num_layers × num_kv_heads × head_dim × seq_len × bytes_per_elem
A 16K-token context on Llama-3-70B can exceed 2 GB of KV data per request. Transfer options, fastest to slowest:
- NVLink / NVSwitch within a node — 300–900 GB/s; handoff latency in single-digit milliseconds for typical contexts.
- RDMA over InfiniBand across nodes — 100–400 Gb/s per NIC; add 5–20 ms for multi-node clusters depending on size.
- TCP over Ethernet — workable for prototyping; production clusters at Harbor scale moved off TCP after P99 handoff exceeded 80 ms.
Some frameworks support prefix-aware routing: if two requests share an identical system prompt, the prefill worker computes KV once and broadcasts to multiple decode workers — a natural extension of prompt caching.
Scheduling and load balancing
Disaggregation introduces a two-stage queue. Poor scheduling erases the gains.
Prefill scheduling
Batch long prompts together to maximize tensor-core utilization. Chunked prefill (processing very long contexts in segments) reduces peak memory but adds coordination overhead — enable only when single-pass prefill OOMs.
Decode scheduling
Continuous batching still applies on decode workers. The scheduler should reserve decode capacity before starting prefill when TTFT SLOs are tight — otherwise a finished prefill waits while decode queues drain.
Back-pressure and cancellation
If decode pools are saturated, the router should delay accepting new prefills rather than completing prefills that cannot decode. Client-side timeouts must cancel both stages atomically; orphaned KV blocks leak GPU memory on decode nodes.
Framework support
Disaggregated serving moved from research papers to production frameworks in 2024–2025:
- SGLang — native prefill-decode disaggregation with RadixAttention prefix sharing; popular for multi-node clusters.
- vLLM — disaggregated prefill support via external connectors; pairs with existing PagedAttention and speculative decode paths.
- TensorRT-LLM — inflight batching with optional split serving on NVIDIA Triton; strong on single-vendor H100 deployments.
- DistServe / Splitwise — academic reference designs that influenced production implementations; useful for capacity-planning models.
All assume identical model weights on both pools (or weight streaming from a shared object store). Quantization level and tokenizer must match exactly or KV tensors will be invalid.
Harbor Support gateway refactor
Harbor Support's RAG pipeline injects retrieved chunks into a 70B instruct model. Before disaggregation, a single eight-GPU vLLM pool served 40 concurrent chats. Long-ticket prefills (median 9K tokens) caused decode batches to stall; short chats saw elevated inter-token gaps during RAG spikes.
Changes made
- Split into three prefill GPUs and five decode GPUs on one NVLink domain.
- Deployed SGLang router with prefix-aware routing for the shared 2K-token system prompt used across all tiers.
- Reserved one decode GPU worth of batch slots before launching prefill on tickets flagged > 6K tokens.
- Added handoff latency histograms; alert when P95 KV transfer > 25 ms.
Results
P99 TTFT on > 6K-token tickets: 4.2 s → 2.9 s (−31%). Sustained decode throughput: 19 → 22.4 tok/s (+18%). No additional GPUs. Short-chat TTFT unchanged (< 200 ms) because the router still colocates tiny prefills on decode nodes when prefill queue depth is zero.
Architecture decision table
| Approach | Best when | Tradeoff |
|---|---|---|
| Colocated (single pool) | Uniform prompt lengths, < 20 concurrent users, prototype stage | Simplest ops; long prefills starve decode |
| Disaggregated (split pools) | Mixed short chat + long RAG/agent context, production SLOs | KV handoff latency; dual pool monitoring |
| Prefill-only burst scaling | Batch document summarization with rare decode | Scale prefill horizontally; decode pool can be tiny |
| Speculative decode (colocated) | Decode-bound, aligned draft model available | Does not fix prefill-decode interference alone |
| Disaggregation + speculative decode | High-traffic production chat at scale | Maximum complexity; best throughput per dollar at scale |
Common pitfalls
- Ignoring handoff cost — on 4K-token contexts over TCP, transfer can exceed prefill savings; profile before committing.
- Mismatched quantization — FP8 prefill with FP16 decode produces garbage logits; enforce one precision profile per deployment.
- Orphaned KV blocks — client disconnect after prefill but before decode must free PagedAttention slots on both pools.
- Static GPU ratios — launching a document product without shifting prefill capacity causes decode GPUs to idle.
- Skipping prefix routing — identical system prompts recomputed on every request waste prefill cycles disaggregation was meant to optimize.
- Disaggregating too early — below ~4 GPUs total, router overhead and handoff latency often exceed interference costs.
Production checklist
- Profile prefill vs decode time split on production prompt-length distribution.
- Measure KV bytes per request at P50 and P99 context lengths.
- Benchmark handoff latency on your network (NVLink, RDMA, or TCP) before splitting pools.
- Deploy separate queue-depth metrics and alerts for prefill and decode workers.
- Implement atomic cancel that frees KV on both pools when clients timeout.
- Enable prefix-aware routing for shared system prompts and tool schemas.
- Reserve decode capacity before long prefills when TTFT SLOs are strict.
- Match tokenizer, quantization, and model revision across all pools.
- Re-evaluate GPU ratio monthly against traffic mix changes.
- Document fallback to colocated mode if handoff error rate spikes.
Key takeaways
- Prefill is compute-bound; decode is memory-bandwidth-bound — sharing one GPU pool creates interference under mixed workloads.
- Disaggregation splits phases onto separate fleets and hands off KV cache tensors between them.
- Handoff latency over NVLink or RDMA must be budgeted; TCP-only clusters may not benefit.
- Scheduling matters as much as hardware split — reserve decode slots and enable prefix routing.
- Pair with speculative decoding on decode pools for maximum throughput at production scale.
Related reading
- LLM KV cache explained — prefill vs decode and cache memory scaling
- vLLM fundamentals explained — PagedAttention and continuous batching
- Speculative decoding explained — speeding decode on disaggregated pools
- LLM prompt caching explained — prefix reuse across requests