Guide

LLM data parallel inference explained

Harbor Analytics runs a 70B chat assistant on four H100s configured as tensor parallel (TP=2) across two GPUs per replica. Black Friday traffic doubled concurrent sessions, but P95 time-to-first-token (TTFT) stayed flat while tokens per second per user collapsed: each replica was saturated. The team considered widening TP to four GPUs per replica, which would have halved per-token latency for a single user but left aggregate throughput unchanged on the same four cards. Instead they deployed a second identical TP=2 replica and put a gateway in front that routes new sessions to the least-loaded worker. Total chat throughput nearly doubled; per-user latency at moderate concurrency matched the single-replica baseline. That is data parallel (DP) inference — the simplest scaling lever when the model already fits on one replica.

Data parallelism at inference means running independent full copies of the model, each handling different requests. Unlike pipeline or tensor parallelism, DP does not split layers; it multiplies capacity by adding replicas. The hard parts are routing, session stickiness, KV cache isolation, and knowing when another replica helps more than deeper sharding. This guide covers replica pools, load-balancing policies, DP combined with TP/PP, vLLM and gateway patterns, the Harbor Analytics refactor, a technique decision table, pitfalls, and a production checklist.

What data parallel inference does

Data parallel inference replicates the entire model (including any internal TP or PP groups) on separate GPU sets. Request A goes to replica 0, request B to replica 1; no cross-replica communication occurs during forward passes. Throughput scales roughly linearly with replica count until a bottleneck appears elsewhere — gateway CPU, network ingress, embedding lookups, or downstream databases.

Compare the three model-parallel modes:

  • Data parallelism — N identical replicas; each serves different requests. Scales aggregate throughput; does not reduce per-request latency on an unloaded replica.
  • Tensor parallelism — one logical model split across GPUs within a replica. Reduces per-replica memory and can improve single-request latency when matmuls were GPU-bound.
  • Pipeline parallelism — layers staged across GPUs within a replica. Fits models too large for one card; needs batch depth to avoid bubbles.

Production stacks almost always combine them: a replica might be TP=2 on two GPUs, and the cluster runs four such replicas for DP=4 effective throughput. Our distributed training guide covers DDP and 3D parallelism during fine-tuning; at inference the same replica math applies, but routing and KV cache lifetime dominate operations.

Replica pools and request routing

When each request needs a home

Stateless completion APIs (single-turn, no server-side history) can use simple round-robin or least-connections routing: any replica is equally valid. Multi-turn chat is stateful: the KV cache for prior tokens lives on the GPU that handled earlier turns. Re-routing turn two to a different replica forces a full prefill replay — wasted compute and higher TTFT. Gateways therefore use session affinity (sticky routing by session_id, user hash, or conversation key) for the lifetime of a conversation, with optional cache migration only when a replica fails.

Load-balancing policies

Common policies, from simplest to most SLO-aware:

  • Round-robin — fine for stateless workloads; poor for sticky chat if sessions have uneven lengths.
  • Least connections / least in-flight — routes to the replica with fewest active sequences; works well with continuous batching when each engine exposes queue depth metrics.
  • Weighted / capacity-aware — assigns weights by GPU count, max context, or measured tok/s; needed when replicas are heterogeneous (one TP=4 replica vs two TP=2 replicas).
  • SLO-aware shedding — when all replicas exceed wait thresholds, return 429 or queue at the gateway per admission control policy instead of overloading workers.

KV isolation and memory

Each replica owns its KV blocks independently. Total cluster KV capacity is the sum of per-replica limits, not shared. Hot sessions on one replica can OOM that worker while siblings sit idle — a sign to improve affinity hashing (spread long conversations) or add replicas. PagedAttention pools in vLLM apply per process; DP does not unify memory across replicas.

DP vs deeper model parallelism

Dimension Add data parallel replicas Increase TP / PP within one replica
Primary goal More concurrent users / aggregate tok/s Fit larger model or cut per-token latency
Single-user latency (low load) Unchanged vs one replica Often improves if one GPU was saturated
Memory per request Same as one replica PP/TP spread weights; KV rules vary
Hardware efficiency at high concurrency High — replicas fill independently TP all-reduce can limit scaling
Operational complexity Routing, stickiness, health checks NCCL topology, stage boundaries
Failure blast radius One replica down — partial capacity loss One GPU down — entire replica dead

Rule of thumb: if profiling shows GPUs underutilized at target concurrency and the model fits comfortably on one replica, add DP replicas before widening TP. If a single request exceeds memory or misses latency SLO on one replica, fix TP/PP/quantization first; DP alone will not help.

Deployment patterns and vLLM

Most engines run one model process (or Ray worker group) per replica. In vLLM, launch N independent servers with identical tensor_parallel_size flags, each binding to disjoint GPU sets, then place nginx, Envoy, or a custom gateway in front. Kubernetes patterns use one Deployment per replica size class or a single Deployment with one pod per replica and anti-affinity on GPU nodes.

  • Homogeneous pools — all replicas same model, same TP; simplest routing and autoscaling.
  • Tiered pools — small model DP pool for cheap queries, large model pool for escalations; pairs with model routing.
  • Prefill/decode split — not DP in the classic sense, but multiple decode replicas behind a router after a shared prefill stage; see prefill-decode disaggregation.
  • Speculative decode — draft and target models can each be DP-scaled independently; acceptance rate affects effective load per replica.

Autoscaling signals should use queue wait and GPU batch utilization, not CPU. Scale replicas when P95 wait exceeds SLO despite healthy per-replica tok/s. Pair with multi-tenant isolation so one tenant cannot pin all sticky sessions on one replica.

Harbor Analytics chat gateway refactor

Harbor's 70B assistant initially ran one vLLM instance with TP=2 on GPUs 0–1. Peak holiday traffic pushed num_running_seqs to the max_num_seqs ceiling; new sessions queued 4–8 seconds at the gateway. Profiling showed GPU tensor-core utilization at 78% — not memory-bound, concurrency-bound. The refactor:

  1. Added a second TP=2 replica on GPUs 2–3 with identical model weights and quantization (fp8 per our FP8 guide).
  2. Deployed an Envoy gateway with consistent-hash routing on conversation_id for sticky sessions and least-request fallback for new sessions.
  3. Exposed per-replica queue depth to the gateway via health endpoints; shed load at 429 when both replicas exceeded 200 ms wait budget.
  4. Kept batch summarization on a separate PP cluster so chat DP scaling did not steal GPUs from offline jobs.

Results: aggregate chat tok/s rose 1.9× at the same P95 TTFT for sessions under 8k context. Long-context sessions (>32k tokens) still skewed load; consistent hashing by user ID instead of conversation ID spread heavy users more evenly. Lesson: measure whether you are replica-bound or shard-bound before widening TP — Harbor was replica-bound.

Technique decision table

Scenario Prefer Avoid
Model fits one GPU; high concurrent chat DP replicas (one GPU each) TP=2 “for throughput”
70B on 2× H100; queue depth maxed Second TP=2 replica (DP=2) TP=4 on four GPUs without traffic increase
Single-user latency SLO missed; low concurrency TP, quantization, or smaller model tier More DP replicas
405B model; cannot fit one replica PP+TP within replica; then DP of replicas DP-only thinking before model fits
Stateless embedding batch API Round-robin DP; no stickiness Session affinity overhead
Uneven conversation lengths Least-in-flight + user-hash spread Naive round-robin on sticky chat
Regional latency requirements Geo-distributed DP pools One global pool with cross-region stickiness

Common pitfalls

  • Sticky sessions without failover plan. Replica death evicts all pinned conversations unless you replay history on another worker.
  • Adding replicas when GPU utilization is low. Fix batching, chunked prefill, or admission control first.
  • Ignoring weight load time. N replicas means N copies loaded from disk or object storage at startup — stagger rollouts.
  • Heterogeneous replicas behind one pool. Mixing TP=2 and TP=4 without weighted routing skews latency.
  • Autoscaling on CPU. LLM workers idle CPU while GPUs saturate; scale on queue depth and GPU metrics.
  • Shared rate limits per replica. Per-tenant caps must be cluster-wide, not per process, or tenants multiply quotas by replica count.
  • DP for memory relief. Replicas duplicate weights; DP does not shrink per-replica footprint.
  • No cross-replica observability. Debugging tail latency requires per-replica tok/s, batch size, and KV usage dashboards.

Production checklist

  • Confirm the model fits on one replica with headroom for peak context and batch depth.
  • Profile GPU utilization at target concurrency — distinguish replica-bound vs shard-bound.
  • Define routing: stateless (round-robin) vs sticky (conversation or user hash).
  • Implement least-in-flight or capacity-aware balancing with per-replica metrics.
  • Plan replica failure: replay prefill on new worker or return graceful retry to client.
  • Isolate replica pools by workload (chat vs batch vs embedding).
  • Align autoscaling triggers with queue wait and GPU batch utilization.
  • Apply cluster-wide tenant quotas and admission control, not per-replica silos.
  • Stagger replica startup to avoid simultaneous weight downloads hammering storage.
  • Validate quantization and TP settings identically across replicas.
  • Load-test sticky hash distribution with realistic long/short session mixes.
  • Document replica count × TP degree × PP degree = total GPUs required.

Key takeaways

  • Data parallel inference adds full model replicas to scale aggregate throughput, not single-request latency.
  • Multi-turn chat requires sticky routing so KV caches stay on the worker that started the session.
  • When GPUs are concurrency-saturated but not memory-bound, another replica usually beats wider tensor parallelism.
  • Harbor Analytics nearly doubled chat capacity by running two TP=2 replicas instead of one TP=4 group on four H100s.
  • Combine DP at the cluster level with TP/PP inside each replica for large models and high traffic.

Related reading