Guide

LLM tensor parallelism for inference explained

Harbor Analytics upgraded its internal summarization tier from a quantized 13B model on one A100 to a full-precision 70B Llama-class model for legal contract review. The quality jump was obvious — fewer missed indemnity clauses, better citation fidelity. The deployment problem was immediate: 70B weights in BF16 need roughly 140 GB of GPU memory before KV cache blocks are allocated. A single 80 GB card cannot hold the model, let alone serve concurrent RAG requests with continuous batching. The fix was tensor parallelism (TP): shard each transformer layer across four GPUs on one node so every device holds a slice of the weight matrices and activations flow through synchronized forward passes.

TP is not free parallelism. Every layer that spans devices triggers all-reduce communication across NVLink or PCIe. At batch size one, that overhead can dominate decode latency. At batch size thirty-two, the same TP group often wins on throughput because matmul FLOPs amortize the comms tax. This guide explains how tensor parallelism shards attention and MLP blocks, how TP differs from pipeline and data parallelism at inference time, how to pick tensor_parallel_size in vLLM, the Harbor Analytics 70B refactor, a technique decision table, pitfalls, and a production checklist.

What tensor parallelism does at inference

Tensor parallelism splits individual weight tensors — typically the query/key/value projections and the MLP up/down matrices — across a group of GPUs that cooperate on one forward pass for one batch of sequences. Each GPU stores a fraction of each layer's parameters and computes a partial result. Before the next sub-layer can run, partial activations must be combined via collective communication (most often all-reduce or all-gather, depending on whether the split is column-parallel or row-parallel).

Contrast this with:

Data parallelism (replica serving) — each GPU holds a full copy of the model and processes different requests. Scales throughput when the model fits on one card; does not help when it does not.
Pipeline parallelism (PP) — different layers live on different GPUs. Reduces per-device memory but introduces pipeline bubbles unless batch depth is high; more common in training than low-latency interactive inference.
Tensor parallelism (TP) — different columns or rows of the same layer live on different GPUs. Cuts per-device weight and activation memory roughly by the TP degree; pays per-layer comms each token step.

For models that exceed single-GPU memory, TP is often the first knob teams turn because it keeps latency predictable for interactive serving — provided the TP group sits on a high-bandwidth interconnect (NVLink within a DGX/HGX node).

Column-parallel vs row-parallel splits

Megatron-style TP (used by vLLM, TensorRT-LLM, and most production stacks) alternates split directions to minimize communication:

Column-parallel linear — weight matrix W is split along output dimension. Each GPU computes part of the output; an all-gather (or equivalent) reassembles the full activation before the next op.
Row-parallel linear — W is split along input dimension. Each GPU multiplies its shard; an all-reduce sums partial outputs into the final activation.

Attention blocks follow the same pattern: QKV projections are column-parallel; the output projection is row-parallel. MLP blocks column-parallel the up-projection and row-parallel the down-projection. The net effect: two collective ops per transformer layer in a standard implementation, each proportional to hidden size and batch token count.

Wider hidden dimensions and longer contexts increase activation sizes moving across the bus — not just weight footprint. That is why TP on PCIe-only multi-GPU desktops often disappoints for decode-heavy chat workloads even when the model technically fits.

TP degree, memory math, and the latency tradeoff

TP degree (often tensor_parallel_size) is the number of GPUs in the TP group. Degree 4 on four A100-80GB cards roughly quarters per-device weight memory for each layer shard. KV cache is typically replicated or partitioned depending on engine implementation; in vLLM, TP shards weights while the scheduler still manages PagedAttention blocks across the group.

Sizing rules of thumb:

Minimum degree to fit — smallest TP such that model weights + peak KV + CUDA overhead fit under GPU memory with headroom (10–15%).
Do not over-shard — TP 8 on an 8B model that fits on one GPU adds comms with no memory benefit.
Batch size interacts with TP — small batches pay nearly full all-reduce cost per token; large batches hide it behind bigger matmuls.
NVLink vs PCIe — TP groups should be confined to one high-bandwidth domain; cross-node TP without InfiniBand is rarely worth it for latency-sensitive APIs.

Teams chasing sub-second time-to-first-token on 70B models sometimes run TP 2 with aggressive FP8 quantization instead of TP 4 in BF16 — trading numeric range for fewer GPUs and less cross-device traffic.

vLLM and production configuration

In vLLM, tensor parallelism is enabled at launch:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --dtype bfloat16 \
  --max-model-len 8192

Key interactions:

tensor_parallel_size must divide the GPU count on the node (or match the visible device set).
Combine with chunked prefill when long prompts contend with decode on the same TP group.
Pair with admission control — TP does not increase KV capacity linearly with GPU count; overload still OOMs the group.
For multi-node, some stacks support TP across nodes; verify NCCL topology and expect higher tail latency unless workloads are throughput-bound.

Monitor per-GPU SM utilization and NCCL kernel time. If NCCL dominates decode steps, either reduce TP degree, quantize, or increase batch/concurrency so matmul work grows relative to comms.

Harbor Analytics 70B refactor (worked example)

Harbor's legal summarization pod ran on two DGX nodes. Initial attempt: TP 8 across eight GPUs for Llama-3.1-70B BF16 with 32K context ceiling. Model fit, but p50 decode latency for 200-token completions was 94 ms/token at batch 1 — unacceptable for interactive review UI.

Refactor steps:

Reduced max context to 8K for the interactive tier (batch jobs kept a separate 32K endpoint).
Dropped to TP 4 with FP8 weights (still four GPUs, ~half the weight bytes, tighter memory for KV).
Enabled chunked prefill and capped concurrent sequences via admission control at 24 per TP group.
Routed overflow traffic to the existing 13B replica tier with a latency SLO router.

Result: p50 decode improved to 38 ms/token at batch 1; p95 TTFT under 1.2 s for 6K-token contract prompts. Throughput at batch 16 held 2,400 output tokens/s across the TP group — sufficient for the legal team's peak afternoon load without provisioning a second 70B replica.

Technique decision table

Scenario	Prefer	Avoid
Model exceeds single-GPU memory	TP on NVLink node; quantize if close to fitting	Loading full weights on CPU offload for production APIs
8B–13B fits on one GPU	Single GPU + data-parallel replicas for throughput	TP 2+ “because we have two GPUs”
Interactive chat, batch 1–4	Lowest TP degree that fits; FP8/INT8 if needed	TP 8 on PCIe without batch depth
Offline batch summarization	Higher batch + moderate TP; pipeline parallel for very deep models	Optimizing for TTFT
Multi-tenant SaaS gateway	TP groups behind admission control; replica pools per model size	One oversized TP cluster with no shedding
Training vs inference confusion	Inference TP sized for latency + KV headroom	Copying training 3D-parallel config without re-benchmarking decode

Common pitfalls

TP without fast interconnect. PCIe all-reduces erase decode gains.
Over-sharding small models. Communication dominates; use replicas instead.
Ignoring KV memory at TP degree. Weights fit; cache still OOMs under concurrent long prompts.
Mixing TP groups across models on shared GPUs. MIG and fractional GPU + TP rarely compose cleanly.
Benchmarking prefill only. Decode-heavy chat sees different NCCL costs per token.
Assuming TP doubles throughput. It mainly enables larger models and batch depth; replica scaling adds QPS more linearly.
Skipping quantization before adding GPUs. FP8/AWQ may remove an entire TP hop.
No NCCL topology checks. Wrong NIC binding silently routes collectives through slow paths.

Production checklist

Calculate minimum TP degree from weight bytes + peak KV + safety margin.
Keep TP groups on NVLink-connected GPUs within one host when possible.
Benchmark decode ms/token at batch 1 and batch 16 before locking TP size.
Profile NCCL/all-reduce time per layer alongside matmul utilization.
Set tensor_parallel_size explicitly in vLLM/TGI launch configs.
Pair TP with admission control and chunked prefill for mixed workloads.
Document fallback route to smaller model tier when 70B queue depth spikes.
Validate FP8/quantized TP paths against BF16 quality on golden prompts.
Monitor per-GPU memory; TP does not eliminate KV exhaustion under burst.
Re-run sizing after context-length or model upgrades.
Test failure modes: single GPU drop in TP group should fail fast, not hang.
Compare TP vs multi-replica smaller model on cost-per-quality metric.

Key takeaways

Tensor parallelism shards layers across GPUs so models too large for one card can still serve inference.
Each layer pays collective communication cost — TP hurts more at batch 1 than at high batch throughput.
Pick the smallest TP degree that fits memory; prefer quantization before over-sharding.
NVLink locality matters; cross-PCIe TP is a common latency trap.
Harbor Analytics cut 70B decode latency 60% by moving from TP 8 BF16 to TP 4 FP8 with tighter admission and context caps.