Guide
LLM data parallel inference explained
Harbor Analytics runs a 70B chat assistant on four H100s configured as tensor parallel (TP=2) across two GPUs per replica. Black Friday traffic doubled concurrent sessions, but P95 time-to-first-token (TTFT) stayed flat while tokens per second per user collapsed: each replica was saturated. The team considered widening TP to four GPUs per replica, which would have halved per-token latency for a single user but left aggregate throughput unchanged on the same four cards. Instead they deployed a second identical TP=2 replica and put a gateway in front that routes new sessions to the least-loaded worker. Total chat throughput nearly doubled; per-user latency at moderate concurrency matched the single-replica baseline. That is data parallel (DP) inference — the simplest scaling lever when the model already fits on one replica.
Data parallelism at inference means running independent full copies of the model, each handling different requests. Unlike pipeline or tensor parallelism, DP does not split layers; it multiplies capacity by adding replicas. The hard parts are routing, session stickiness, KV cache isolation, and knowing when another replica helps more than deeper sharding. This guide covers replica pools, load-balancing policies, DP combined with TP/PP, vLLM and gateway patterns, the Harbor Analytics refactor, a technique decision table, pitfalls, and a production checklist.
What data parallel inference does
Data parallel inference replicates the entire model (including any internal TP or PP groups) on separate GPU sets. Request A goes to replica 0, request B to replica 1; no cross-replica communication occurs during forward passes. Throughput scales roughly linearly with replica count until a bottleneck appears elsewhere — gateway CPU, network ingress, embedding lookups, or downstream databases.
Compare the three model-parallel modes:
- Data parallelism — N identical replicas; each serves different requests. Scales aggregate throughput; does not reduce per-request latency on an unloaded replica.
- Tensor parallelism — one logical model split across GPUs within a replica. Reduces per-replica memory and can improve single-request latency when matmuls were GPU-bound.
- Pipeline parallelism — layers staged across GPUs within a replica. Fits models too large for one card; needs batch depth to avoid bubbles.
Production stacks almost always combine them: a replica might be TP=2 on two GPUs, and the cluster runs four such replicas for DP=4 effective throughput. Our distributed training guide covers DDP and 3D parallelism during fine-tuning; at inference the same replica math applies, but routing and KV cache lifetime dominate operations.
Replica pools and request routing
When each request needs a home
Stateless completion APIs (single-turn, no server-side history) can use simple
round-robin or least-connections routing: any replica is equally valid. Multi-turn
chat is stateful: the
KV cache
for prior tokens lives on the GPU that handled earlier turns. Re-routing turn two
to a different replica forces a full prefill replay — wasted compute and
higher TTFT. Gateways therefore use session affinity (sticky
routing by session_id, user hash, or conversation key) for the
lifetime of a conversation, with optional cache migration only when a replica
fails.
Load-balancing policies
Common policies, from simplest to most SLO-aware:
- Round-robin — fine for stateless workloads; poor for sticky chat if sessions have uneven lengths.
- Least connections / least in-flight — routes to the replica with fewest active sequences; works well with continuous batching when each engine exposes queue depth metrics.
- Weighted / capacity-aware — assigns weights by GPU count, max context, or measured tok/s; needed when replicas are heterogeneous (one TP=4 replica vs two TP=2 replicas).
- SLO-aware shedding — when all replicas exceed wait thresholds, return 429 or queue at the gateway per admission control policy instead of overloading workers.
KV isolation and memory
Each replica owns its KV blocks independently. Total cluster KV capacity is the sum of per-replica limits, not shared. Hot sessions on one replica can OOM that worker while siblings sit idle — a sign to improve affinity hashing (spread long conversations) or add replicas. PagedAttention pools in vLLM apply per process; DP does not unify memory across replicas.
DP vs deeper model parallelism
| Dimension | Add data parallel replicas | Increase TP / PP within one replica |
|---|---|---|
| Primary goal | More concurrent users / aggregate tok/s | Fit larger model or cut per-token latency |
| Single-user latency (low load) | Unchanged vs one replica | Often improves if one GPU was saturated |
| Memory per request | Same as one replica | PP/TP spread weights; KV rules vary |
| Hardware efficiency at high concurrency | High — replicas fill independently | TP all-reduce can limit scaling |
| Operational complexity | Routing, stickiness, health checks | NCCL topology, stage boundaries |
| Failure blast radius | One replica down — partial capacity loss | One GPU down — entire replica dead |
Rule of thumb: if profiling shows GPUs underutilized at target concurrency and the model fits comfortably on one replica, add DP replicas before widening TP. If a single request exceeds memory or misses latency SLO on one replica, fix TP/PP/quantization first; DP alone will not help.
Deployment patterns and vLLM
Most engines run one model process (or Ray worker group) per replica. In vLLM,
launch N independent servers with identical tensor_parallel_size
flags, each binding to disjoint GPU sets, then place nginx, Envoy, or a custom
gateway in front. Kubernetes patterns use one Deployment per replica size class
or a single Deployment with one pod per replica and anti-affinity on GPU nodes.
- Homogeneous pools — all replicas same model, same TP; simplest routing and autoscaling.
- Tiered pools — small model DP pool for cheap queries, large model pool for escalations; pairs with model routing.
- Prefill/decode split — not DP in the classic sense, but multiple decode replicas behind a router after a shared prefill stage; see prefill-decode disaggregation.
- Speculative decode — draft and target models can each be DP-scaled independently; acceptance rate affects effective load per replica.
Autoscaling signals should use queue wait and GPU batch utilization, not CPU. Scale replicas when P95 wait exceeds SLO despite healthy per-replica tok/s. Pair with multi-tenant isolation so one tenant cannot pin all sticky sessions on one replica.
Harbor Analytics chat gateway refactor
Harbor's 70B assistant initially ran one vLLM instance with TP=2 on GPUs 0–1.
Peak holiday traffic pushed num_running_seqs to the
max_num_seqs ceiling; new sessions queued 4–8 seconds at the
gateway. Profiling showed GPU tensor-core utilization at 78% — not memory-bound,
concurrency-bound. The refactor:
- Added a second TP=2 replica on GPUs 2–3 with identical
model weights and quantization (
fp8per our FP8 guide). - Deployed an Envoy gateway with consistent-hash routing on
conversation_idfor sticky sessions and least-request fallback for new sessions. - Exposed per-replica queue depth to the gateway via health endpoints; shed load at 429 when both replicas exceeded 200 ms wait budget.
- Kept batch summarization on a separate PP cluster so chat DP scaling did not steal GPUs from offline jobs.
Results: aggregate chat tok/s rose 1.9× at the same P95 TTFT for sessions under 8k context. Long-context sessions (>32k tokens) still skewed load; consistent hashing by user ID instead of conversation ID spread heavy users more evenly. Lesson: measure whether you are replica-bound or shard-bound before widening TP — Harbor was replica-bound.
Technique decision table
| Scenario | Prefer | Avoid |
|---|---|---|
| Model fits one GPU; high concurrent chat | DP replicas (one GPU each) | TP=2 “for throughput” |
| 70B on 2× H100; queue depth maxed | Second TP=2 replica (DP=2) | TP=4 on four GPUs without traffic increase |
| Single-user latency SLO missed; low concurrency | TP, quantization, or smaller model tier | More DP replicas |
| 405B model; cannot fit one replica | PP+TP within replica; then DP of replicas | DP-only thinking before model fits |
| Stateless embedding batch API | Round-robin DP; no stickiness | Session affinity overhead |
| Uneven conversation lengths | Least-in-flight + user-hash spread | Naive round-robin on sticky chat |
| Regional latency requirements | Geo-distributed DP pools | One global pool with cross-region stickiness |
Common pitfalls
- Sticky sessions without failover plan. Replica death evicts all pinned conversations unless you replay history on another worker.
- Adding replicas when GPU utilization is low. Fix batching, chunked prefill, or admission control first.
- Ignoring weight load time. N replicas means N copies loaded from disk or object storage at startup — stagger rollouts.
- Heterogeneous replicas behind one pool. Mixing TP=2 and TP=4 without weighted routing skews latency.
- Autoscaling on CPU. LLM workers idle CPU while GPUs saturate; scale on queue depth and GPU metrics.
- Shared rate limits per replica. Per-tenant caps must be cluster-wide, not per process, or tenants multiply quotas by replica count.
- DP for memory relief. Replicas duplicate weights; DP does not shrink per-replica footprint.
- No cross-replica observability. Debugging tail latency requires per-replica tok/s, batch size, and KV usage dashboards.
Production checklist
- Confirm the model fits on one replica with headroom for peak context and batch depth.
- Profile GPU utilization at target concurrency — distinguish replica-bound vs shard-bound.
- Define routing: stateless (round-robin) vs sticky (conversation or user hash).
- Implement least-in-flight or capacity-aware balancing with per-replica metrics.
- Plan replica failure: replay prefill on new worker or return graceful retry to client.
- Isolate replica pools by workload (chat vs batch vs embedding).
- Align autoscaling triggers with queue wait and GPU batch utilization.
- Apply cluster-wide tenant quotas and admission control, not per-replica silos.
- Stagger replica startup to avoid simultaneous weight downloads hammering storage.
- Validate quantization and TP settings identically across replicas.
- Load-test sticky hash distribution with realistic long/short session mixes.
- Document replica count × TP degree × PP degree = total GPUs required.
Key takeaways
- Data parallel inference adds full model replicas to scale aggregate throughput, not single-request latency.
- Multi-turn chat requires sticky routing so KV caches stay on the worker that started the session.
- When GPUs are concurrency-saturated but not memory-bound, another replica usually beats wider tensor parallelism.
- Harbor Analytics nearly doubled chat capacity by running two TP=2 replicas instead of one TP=4 group on four H100s.
- Combine DP at the cluster level with TP/PP inside each replica for large models and high traffic.
Related reading
- LLM tensor parallelism for inference explained — sharding within a replica
- LLM pipeline parallelism for inference explained — stage sharding for very large models
- LLM inference serving explained — end-to-end serving stack overview
- LLM admission control explained — gateway queues when all replicas are hot