Guide

LLM pipeline parallelism for inference explained

Harbor Analytics runs a nightly batch job that summarizes 40,000 support tickets through a 405B-class model. Each ticket is independent, latency per document matters less than total wall-clock throughput, and the model cannot fit on fewer than eight H100s even with FP8 weights. The team first deployed with tensor parallelism (TP) across all eight GPUs. Decode worked, but sustained tokens per second plateaued: every token step triggered all-reduce collectives across the full mesh while only a fraction of each GPU's SRAM was active during skinny decode matmuls. Switching to pipeline parallelism (PP) — four stages of ~25 transformer blocks each, fed by micro-batches of 16 sequences through continuous batching — raised nightly throughput 2.1× on the same hardware. Interactive chat on the same PP cluster was unusable: pipeline bubbles at batch depth one added 180 ms per token. PP is a parallelism mode for inference, not a universal upgrade.

Pipeline parallelism assigns contiguous layer ranges to different GPUs. Activations flow stage-to-stage like an assembly line; intermediate tensors cross NVLink between neighbors instead of all-reducing across every device each layer. The cost is pipeline bubbles — idle GPU cycles when stages wait for upstream or downstream work — which shrink only when batch depth or micro-batch count rises. This guide explains stage partitioning, 1F1B and interleaved schedules, how PP differs from TP and data parallelism at serving time, vLLM and framework knobs, the Harbor Analytics batch refactor, a technique decision table, pitfalls, and a production checklist.

What pipeline parallelism does at inference

Pipeline parallelism splits the model vertically by depth: GPU 0 runs layers 0–31, GPU 1 runs layers 32–63, and so on. A forward pass becomes a relay: stage 0 computes on a micro-batch and sends activations to stage 1 while stage 0 can start the next micro-batch (in pipelined schedules). Memory per GPU drops roughly by the pipeline degree because each device stores only its layer block's weights and a sliding window of activations for in-flight micro-batches.

Contrast with other parallelism modes:

  • Data parallelism — full model replica per GPU, different requests per replica. Best when the model fits on one card.
  • Tensor parallelism — each layer is split horizontally across GPUs; all-reduce synchronizes every layer. Low bubble overhead at batch 1 but high communication per token step.
  • Pipeline parallelism — each stage owns a contiguous layer block; point-to-point sends between neighbors. Communication volume per token is smaller than TP, but bubbles appear unless multiple micro-batches fill the pipe.

At inference, PP is most common for very large models (70B+) where neither single-GPU memory nor pure TP efficiency is acceptable, and for offline or batch workloads where you can afford micro-batch depth. Real deployments often combine PP with TP inside each stage (hybrid 3D parallelism), described in our distributed training guide — the same math applies at inference, but latency SLOs tighten the acceptable bubble fraction.

Stages, micro-batches, and bubble math

Stage partitioning

Given pipeline degree p and L transformer layers, each stage typically holds L/p blocks (embedding and LM head placement vary by framework). Uneven splits — putting more layers on middle stages because attention+MLP FLOPs dominate — reduce straggler effects. vLLM and Megatron-style stacks expose pipeline_parallel_size alongside tensor_parallel_size; the product of PP × TP must equal GPUs assigned to one model replica.

Micro-batches and schedules

A micro-batch is a subset of the global batch that enters the pipeline independently. With m micro-batches and p stages, naive GPipe scheduling leaves the first and last stages idle roughly (p-1)/m of the time at steady state — the classic pipeline bubble. 1F1B (one forward, one backward in training; forward-only variants at inference) and interleaved pipeline (multiple layer chunks per stage, rotating through virtual stages) reduce bubble fraction toward (p-1)/(m+p-1) when m is large.

For decode-heavy serving, each new token is effectively a micro-batch of size one traversing all stages sequentially unless the framework batches multiple sequences' decode steps together. That is why PP shines behind continuous batching with tens of concurrent sequences and struggles for single-user chat unless paired with aggressive batching or prefill-decode disaggregation.

Activation and KV memory

Unlike TP, PP does not shard the KV cache across all GPUs by default — each stage that contains attention layers must store KV heads relevant to its layers. In hybrid TP+PP, KV shards follow the TP group within each stage. Budget peak memory as: weights per stage + in-flight activations × micro-batch count + KV for all sequences assigned to that replica.

PP vs TP for inference: when each wins

Dimension Pipeline parallel Tensor parallel
Communication pattern Neighbor send/recv between stages All-reduce / all-gather every layer
Batch size 1 latency Poor — serial stage traversal + bubbles Better — all GPUs cooperate on one token
Large batch throughput Strong when micro-batches fill the pipe Strong when matmuls amortize collectives
Very deep models (100+ layers) Natural fit — few stages, many layers each Many all-reduces per token — costly
Cross-node scaling Easier — stage boundaries map to NIC hops Hard — every layer needs fast all-reduce
Framework maturity at inference Growing; more batch-oriented paths Mature in vLLM, TGI, TensorRT-LLM

Rule of thumb: if your SLO is interactive chat at concurrency 1–4, prefer the smallest TP degree that fits memory before adding PP. If your SLO is overnight document ingestion at concurrency 64+, PP (often PP+TP hybrid) deserves a benchmark.

Framework and vLLM configuration

In vLLM, set pipeline_parallel_size and tensor_parallel_size so their product matches GPUs per replica. Example: eight GPUs as PP=2, TP=4 means two pipeline stages, each stage tensor-parallel across four GPUs. Launch flags must align with physical topology — stage boundaries should sit on NVLink islands when possible.

  • Batch-oriented engines (vLLM offline, TensorRT-LLM inflight batching with PP) expose micro-batch and max-num-seqs knobs that directly affect bubble fill.
  • Disaggregated serving can run PP on a prefill pool (fat prompts, high micro-batch) and TP-only on a decode pool (low latency) — see our prefill-decode guide for handoff patterns.
  • Quantization reduces per-stage weight footprint, sometimes eliminating one PP stage entirely; profile before adding hardware.

Pair PP deployments with admission control so interactive traffic cannot drain micro-batch depth needed by batch workers on a shared cluster.

Harbor Analytics batch summarization refactor

Harbor's 405B nightly job originally used TP=8 on one eight-GPU node. Profiling showed NCCL all-reduce consumed 34% of decode wall time at batch 32 while GPU tensor-core utilization averaged 41%. The refactor:

  1. Repartitioned to PP=4, TP=2 — four stages of ~24 layers, each stage TP=2 across a NVLink pair.
  2. Routed the job through a dedicated offline queue with max_num_seqs=64 and micro-batch scheduling tuned for 1F1B-style fill.
  3. Left interactive 70B chat on a separate TP=4 replica pool without PP.
  4. Added KV-aware backpressure so partial pipeline stalls did not OOM stage 2 when upstream prefill surged.

Results on the same eight H100s: nightly job wall time fell from 6.1 hours to 2.9 hours (2.1× throughput). P50 per-document latency in the batch queue dropped because higher GPU utilization cleared the backlog faster. The interactive tier saw no regression because it never shared the PP cluster. Lesson: physically separate PP batch replicas from TP latency replicas when one organization runs both workloads.

Technique decision table

Scenario Prefer Avoid
405B+ model, offline batch PP with high micro-batch depth; hybrid TP within stage TP-only across many GPUs for decode-heavy batch
Interactive chat, concurrency < 8 Minimal TP; quantization; smaller model tier PP without batch depth
Multi-node model serving PP across nodes, TP within node Cross-node TP all-reduce on every layer
Mixed prefill/decode on one cluster Disaggregated pools; PP on prefill only Single PP replica for both phases
Model fits on 2× GPUs with INT4 TP=2 or single GPU quant PP=2 “for throughput” at batch 1
Latency SLO < 100 ms/token TP or single-GPU speculative decode Deep pipelines without concurrent sequences

Common pitfalls

  • PP on interactive traffic without batch depth. Bubbles dominate; latency spikes feel random.
  • Uneven stage layer counts. One overweight stage becomes a straggler; profile FLOPs per block before splitting.
  • Ignoring KV memory per stage. Weights fit; attention cache on middle stages still OOMs under long context.
  • Sharing PP cluster with chat and batch. Chat drains micro-batches; batch throughput collapses.
  • Cross-PCIe stage boundaries. Neighbor activation transfers on slow links erase PP comms advantage over TP.
  • Benchmarking prefill only. Decode pipeline fill differs; measure end-to-end tok/s at realistic concurrency.
  • Copying training PP config. Training tolerates bubbles with large gradient accumulation; inference SLOs may not.
  • No failover plan. A dead stage halts the entire pipeline; health checks must be stage-aware.

Production checklist

  • Calculate minimum PP degree from per-stage weight bytes + KV + activation window.
  • Balance layers per stage by measured FLOPs, not layer count alone.
  • Benchmark throughput at target micro-batch count before locking PP degree.
  • Measure bubble fraction: idle GPU time / total time per stage at steady state.
  • Set pipeline_parallel_size and tensor_parallel_size explicitly.
  • Place stage boundaries on NVLink pairs or nodes with highest bisection bandwidth.
  • Isolate batch PP replicas from interactive TP replicas when workloads differ.
  • Pair with admission control and queue priorities per workload class.
  • Validate quantized PP paths against BF16 on representative long-context prompts.
  • Monitor per-stage latency histograms — stragglers show up as tail skew.
  • Test stage failure: pipeline should fail fast with clear alerts, not hang.
  • Re-run sizing after context-length or model architecture changes.

Key takeaways

  • Pipeline parallelism shards layers across stages so very large models can serve without every GPU participating in every layer's collective.
  • Bubbles are the central tradeoff — PP needs micro-batch depth or high concurrency to keep all stages busy.
  • Interactive low-concurrency chat favors TP; offline high-concurrency batch favors PP or hybrid PP+TP.
  • Harbor Analytics doubled nightly 405B throughput by moving batch work from TP=8 to PP=4 / TP=2 on the same eight H100s.
  • Separate replica pools per workload class — do not mix PP batch and TP chat on one pipeline without careful scheduling.

Related reading