Guide
LLM pipeline parallelism for inference explained
Harbor Analytics runs a nightly batch job that summarizes 40,000 support tickets through a 405B-class model. Each ticket is independent, latency per document matters less than total wall-clock throughput, and the model cannot fit on fewer than eight H100s even with FP8 weights. The team first deployed with tensor parallelism (TP) across all eight GPUs. Decode worked, but sustained tokens per second plateaued: every token step triggered all-reduce collectives across the full mesh while only a fraction of each GPU's SRAM was active during skinny decode matmuls. Switching to pipeline parallelism (PP) — four stages of ~25 transformer blocks each, fed by micro-batches of 16 sequences through continuous batching — raised nightly throughput 2.1× on the same hardware. Interactive chat on the same PP cluster was unusable: pipeline bubbles at batch depth one added 180 ms per token. PP is a parallelism mode for inference, not a universal upgrade.
Pipeline parallelism assigns contiguous layer ranges to different GPUs. Activations flow stage-to-stage like an assembly line; intermediate tensors cross NVLink between neighbors instead of all-reducing across every device each layer. The cost is pipeline bubbles — idle GPU cycles when stages wait for upstream or downstream work — which shrink only when batch depth or micro-batch count rises. This guide explains stage partitioning, 1F1B and interleaved schedules, how PP differs from TP and data parallelism at serving time, vLLM and framework knobs, the Harbor Analytics batch refactor, a technique decision table, pitfalls, and a production checklist.
What pipeline parallelism does at inference
Pipeline parallelism splits the model vertically by depth: GPU 0 runs layers 0–31, GPU 1 runs layers 32–63, and so on. A forward pass becomes a relay: stage 0 computes on a micro-batch and sends activations to stage 1 while stage 0 can start the next micro-batch (in pipelined schedules). Memory per GPU drops roughly by the pipeline degree because each device stores only its layer block's weights and a sliding window of activations for in-flight micro-batches.
Contrast with other parallelism modes:
- Data parallelism — full model replica per GPU, different requests per replica. Best when the model fits on one card.
- Tensor parallelism — each layer is split horizontally across GPUs; all-reduce synchronizes every layer. Low bubble overhead at batch 1 but high communication per token step.
- Pipeline parallelism — each stage owns a contiguous layer block; point-to-point sends between neighbors. Communication volume per token is smaller than TP, but bubbles appear unless multiple micro-batches fill the pipe.
At inference, PP is most common for very large models (70B+) where neither single-GPU memory nor pure TP efficiency is acceptable, and for offline or batch workloads where you can afford micro-batch depth. Real deployments often combine PP with TP inside each stage (hybrid 3D parallelism), described in our distributed training guide — the same math applies at inference, but latency SLOs tighten the acceptable bubble fraction.
Stages, micro-batches, and bubble math
Stage partitioning
Given pipeline degree p and L transformer layers, each
stage typically holds L/p blocks (embedding and LM head placement
vary by framework). Uneven splits — putting more layers on middle stages
because attention+MLP FLOPs dominate — reduce straggler effects. vLLM and
Megatron-style stacks expose pipeline_parallel_size alongside
tensor_parallel_size; the product of PP × TP must equal GPUs
assigned to one model replica.
Micro-batches and schedules
A micro-batch is a subset of the global batch that enters the
pipeline independently. With m micro-batches and p
stages, naive GPipe scheduling leaves the first and last stages idle roughly
(p-1)/m of the time at steady state — the classic
pipeline bubble. 1F1B (one forward, one
backward in training; forward-only variants at inference) and
interleaved pipeline (multiple layer chunks per stage, rotating
through virtual stages) reduce bubble fraction toward (p-1)/(m+p-1)
when m is large.
For decode-heavy serving, each new token is effectively a micro-batch of size one traversing all stages sequentially unless the framework batches multiple sequences' decode steps together. That is why PP shines behind continuous batching with tens of concurrent sequences and struggles for single-user chat unless paired with aggressive batching or prefill-decode disaggregation.
Activation and KV memory
Unlike TP, PP does not shard the KV cache across all GPUs by default — each stage that contains attention layers must store KV heads relevant to its layers. In hybrid TP+PP, KV shards follow the TP group within each stage. Budget peak memory as: weights per stage + in-flight activations × micro-batch count + KV for all sequences assigned to that replica.
PP vs TP for inference: when each wins
| Dimension | Pipeline parallel | Tensor parallel |
|---|---|---|
| Communication pattern | Neighbor send/recv between stages | All-reduce / all-gather every layer |
| Batch size 1 latency | Poor — serial stage traversal + bubbles | Better — all GPUs cooperate on one token |
| Large batch throughput | Strong when micro-batches fill the pipe | Strong when matmuls amortize collectives |
| Very deep models (100+ layers) | Natural fit — few stages, many layers each | Many all-reduces per token — costly |
| Cross-node scaling | Easier — stage boundaries map to NIC hops | Hard — every layer needs fast all-reduce |
| Framework maturity at inference | Growing; more batch-oriented paths | Mature in vLLM, TGI, TensorRT-LLM |
Rule of thumb: if your SLO is interactive chat at concurrency 1–4, prefer the smallest TP degree that fits memory before adding PP. If your SLO is overnight document ingestion at concurrency 64+, PP (often PP+TP hybrid) deserves a benchmark.
Framework and vLLM configuration
In
vLLM,
set pipeline_parallel_size and tensor_parallel_size
so their product matches GPUs per replica. Example: eight GPUs as PP=2, TP=4
means two pipeline stages, each stage tensor-parallel across four GPUs. Launch
flags must align with physical topology — stage boundaries should sit on
NVLink islands when possible.
- Batch-oriented engines (vLLM offline, TensorRT-LLM inflight batching with PP) expose micro-batch and max-num-seqs knobs that directly affect bubble fill.
- Disaggregated serving can run PP on a prefill pool (fat prompts, high micro-batch) and TP-only on a decode pool (low latency) — see our prefill-decode guide for handoff patterns.
- Quantization reduces per-stage weight footprint, sometimes eliminating one PP stage entirely; profile before adding hardware.
Pair PP deployments with admission control so interactive traffic cannot drain micro-batch depth needed by batch workers on a shared cluster.
Harbor Analytics batch summarization refactor
Harbor's 405B nightly job originally used TP=8 on one eight-GPU node. Profiling showed NCCL all-reduce consumed 34% of decode wall time at batch 32 while GPU tensor-core utilization averaged 41%. The refactor:
- Repartitioned to PP=4, TP=2 — four stages of ~24 layers, each stage TP=2 across a NVLink pair.
- Routed the job through a dedicated offline queue with
max_num_seqs=64and micro-batch scheduling tuned for 1F1B-style fill. - Left interactive 70B chat on a separate TP=4 replica pool without PP.
- Added KV-aware backpressure so partial pipeline stalls did not OOM stage 2 when upstream prefill surged.
Results on the same eight H100s: nightly job wall time fell from 6.1 hours to 2.9 hours (2.1× throughput). P50 per-document latency in the batch queue dropped because higher GPU utilization cleared the backlog faster. The interactive tier saw no regression because it never shared the PP cluster. Lesson: physically separate PP batch replicas from TP latency replicas when one organization runs both workloads.
Technique decision table
| Scenario | Prefer | Avoid |
|---|---|---|
| 405B+ model, offline batch | PP with high micro-batch depth; hybrid TP within stage | TP-only across many GPUs for decode-heavy batch |
| Interactive chat, concurrency < 8 | Minimal TP; quantization; smaller model tier | PP without batch depth |
| Multi-node model serving | PP across nodes, TP within node | Cross-node TP all-reduce on every layer |
| Mixed prefill/decode on one cluster | Disaggregated pools; PP on prefill only | Single PP replica for both phases |
| Model fits on 2× GPUs with INT4 | TP=2 or single GPU quant | PP=2 “for throughput” at batch 1 |
| Latency SLO < 100 ms/token | TP or single-GPU speculative decode | Deep pipelines without concurrent sequences |
Common pitfalls
- PP on interactive traffic without batch depth. Bubbles dominate; latency spikes feel random.
- Uneven stage layer counts. One overweight stage becomes a straggler; profile FLOPs per block before splitting.
- Ignoring KV memory per stage. Weights fit; attention cache on middle stages still OOMs under long context.
- Sharing PP cluster with chat and batch. Chat drains micro-batches; batch throughput collapses.
- Cross-PCIe stage boundaries. Neighbor activation transfers on slow links erase PP comms advantage over TP.
- Benchmarking prefill only. Decode pipeline fill differs; measure end-to-end tok/s at realistic concurrency.
- Copying training PP config. Training tolerates bubbles with large gradient accumulation; inference SLOs may not.
- No failover plan. A dead stage halts the entire pipeline; health checks must be stage-aware.
Production checklist
- Calculate minimum PP degree from per-stage weight bytes + KV + activation window.
- Balance layers per stage by measured FLOPs, not layer count alone.
- Benchmark throughput at target micro-batch count before locking PP degree.
- Measure bubble fraction: idle GPU time / total time per stage at steady state.
- Set
pipeline_parallel_sizeandtensor_parallel_sizeexplicitly. - Place stage boundaries on NVLink pairs or nodes with highest bisection bandwidth.
- Isolate batch PP replicas from interactive TP replicas when workloads differ.
- Pair with admission control and queue priorities per workload class.
- Validate quantized PP paths against BF16 on representative long-context prompts.
- Monitor per-stage latency histograms — stragglers show up as tail skew.
- Test stage failure: pipeline should fail fast with clear alerts, not hang.
- Re-run sizing after context-length or model architecture changes.
Key takeaways
- Pipeline parallelism shards layers across stages so very large models can serve without every GPU participating in every layer's collective.
- Bubbles are the central tradeoff — PP needs micro-batch depth or high concurrency to keep all stages busy.
- Interactive low-concurrency chat favors TP; offline high-concurrency batch favors PP or hybrid PP+TP.
- Harbor Analytics doubled nightly 405B throughput by moving batch work from TP=8 to PP=4 / TP=2 on the same eight H100s.
- Separate replica pools per workload class — do not mix PP batch and TP chat on one pipeline without careful scheduling.
Related reading
- LLM tensor parallelism for inference explained — horizontal sharding and all-reduce tradeoffs
- vLLM fundamentals explained — PP and TP launch flags, continuous batching
- Distributed LLM training explained — 3D parallelism context for hybrid PP+TP
- LLM prefill-decode disaggregation explained — running PP on prefill pools only