Guide
LLM multi-LoRA serving explained
Harbor Analytics ran twelve separate 7B chat replicas — one per product line
— because each team had shipped a
LoRA fine-tune
merged into its own checkpoint. GPU memory on the inference cluster sat at 94%
utilization, cold starts took 38 seconds per replica, and a single tenant spike
could not borrow spare capacity from quiet neighbors. The platform team
consolidated to one frozen base model with
47 dynamically loaded LoRA adapters on a shared
vLLM
fleet. VRAM use fell 68%, median time-to-first-token on cold tenants dropped
41%, and adapter routing errors went to zero once the gateway published explicit
lora_request metadata instead of guessing from API keys alone.
Multi-LoRA serving keeps base transformer weights in GPU memory
once while applying different low-rank delta matrices per request or per batch
slot. It is the production answer to “we have thirty tone adapters but
cannot afford thirty full models.” This guide covers adapter routing and
registry design, how
continuous batching
mixes LoRA IDs in one forward pass, vLLM configuration knobs
(max_loras, max_lora_rank, lora_modules),
memory and eviction policies, interaction with
multi-tenant isolation,
the Harbor Analytics gateway refactor, a technique decision table versus merged
adapters and full replicas, pitfalls, and a production checklist.
Why multi-LoRA instead of merged weights or replicas
After fine-tuning, teams typically choose among three deployment shapes:
| Approach | VRAM per variant | Swap latency | Best when |
|---|---|---|---|
| Full model replica | 100% of base (e.g. 14 GB for 7B FP16) | Process restart or second GPU | Adapters diverge so far that LoRA rank cannot fit |
| Merged LoRA into base weights | 100% of base per merged checkpoint | Load new checkpoint file | One stable production adapter; no per-request switching |
| Multi-LoRA sidecar serving | Base once + small delta per active adapter | Microseconds to milliseconds (in-GPU) | Many tenants, frequent A/B adapters, shared base |
Multi-LoRA wins when adapter count is high, each delta is small (rank 8–64), and traffic is bursty across tenants. It loses when adapters need ranks above engine limits, target different base model revisions, or require incompatible quantization schemes. The math is simple: one 7B base at 14 GB plus forty rank-16 adapters at roughly 50–120 MB each beats forty merged 14 GB checkpoints by an order of magnitude — if the scheduler can batch mixed adapters efficiently.
Adapter registry and request routing
Serving starts with a registry that maps stable adapter IDs to
artifact paths, base model hashes, rank, alpha, and target module lists. The
inference gateway — not the client SDK — should resolve tenant
identity to adapter_id. Letting callers pass arbitrary filesystem
paths is an injection and isolation failure.
A minimal OpenAI-compatible extension passes LoRA metadata on each completion:
POST /v1/chat/completions
{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [...],
"extra_body": {
"lora_request": { "lora_name": "harbor-legal-v3", "lora_int_id": 12 }
}
}
vLLM maps lora_int_id to loaded GPU tensors. Your gateway should
maintain the int ID table, validate that the requested adapter was trained
against the served base revision (hash match), and reject cross-base requests
before they hit the scheduler. Version pins belong in the registry: deploying
harbor-legal-v3 while harbor-legal-v2 still receives
5% canary traffic requires two slots in max_loras, not silent
overwrite.
vLLM multi-LoRA configuration
vLLM loads a frozen base through its standard engine path, then attaches LoRA layers dynamically. Key server flags:
--enable-lora— turns on LoRA kernels in the worker.--max-loras— maximum distinct adapters resident in GPU memory simultaneously (default 1). Set to peak concurrent adapter count plus headroom for canaries.--max-lora-rank— compile-time ceiling; requests above rank fail fast. Must be ≥ the largest adapter you register.--lora-modules name=path ...— pre-register adapters at startup so first request avoids disk I/O.--max-cpu-loras— CPU-staged adapters for swap-in when GPU slots are full; trades host RAM for fewer OOM kills.
Pair multi-LoRA with PagedAttention and CUDA graphs on decode: adapter switching adds gather-scatter work in attention projections, so shape buckets should include LoRA ID dimensions where the engine supports it. After enabling graphs, run regression tests on mixed-adapter batches — a graph captured for adapter A only silently wrong-results on adapter B if bucket keys omit LoRA state.
Batch scheduling with mixed adapters
Continuous batching adds and removes sequences every decode
step. With LoRA, each sequence carries an lora_id tensor parallel
to its KV cache block table. The scheduler can mix adapters in one batch when
the backend implements grouped GEMM: sequences sharing an adapter run one fused
kernel; sequences with different adapters execute back-to-back micro-groups
within the same CUDA stream.
Throughput implications:
- Homogeneous batches (one adapter) achieve near-native speed — ideal for large tenant spikes.
- Heterogeneous batches pay adapter-switch overhead proportional to the number of distinct IDs in the batch. Coalescing helps: a short queue wait (1–3 ms) to batch two `harbor-support` requests beats immediate dispatch into a batch already running `harbor-legal`.
- Adapter locality matters for CPU-staged LoRAs: thrashing
fifty adapters with
max_loras=8causes PCIe reload storms. LRU eviction of GPU slots should consider request rate, not just last use.
Instrument lora_swaps_per_second, distinct_loras_per_batch,
and P50/P99 decode latency stratified by adapter count. Harbor Analytics found
P99 decode blew out when distinct adapters per batch exceeded four on A10G
hardware — their fix was tenant-aware micro-batching windows, not more
GPUs.
Memory accounting
Budget VRAM as:
total = base_weights + kv_pool + Σ(active_adapter_i) + cuda_graph_overhead
Base weights dominate (quantized INT4/FP8 bases shrink this term). KV pool scales with concurrent sequences and context length — adapters do not duplicate KV cache per tenant. Each active LoRA adds roughly:
adapter_mb ≈ 2 × rank × (d_model × num_target_layers × bytes_per_param)
For rank 16 on all attention projections in a 32-layer 7B model, expect
40–90 MB per adapter in FP16. max_loras is therefore a
hard capacity plan: eight rank-16 adapters might cost under 1 GB — cheap
compared to an extra base replica. Watch max_lora_rank padding:
vLLM allocates for the configured maximum rank even when the loaded adapter
uses rank 8, wasting slack if ranks are heterogeneous.
Technique decision table
| Your situation | Prefer | Avoid |
|---|---|---|
| 40+ tenant tone adapters on one base | Multi-LoRA serving with registry + gateway routing | Full replica per tenant |
| Single production adapter, no A/B | Merge LoRA into base; simplest ops | Runtime adapter lookup overhead |
| Adapters on different base revisions | Separate engine per base hash | One multi-LoRA pool |
| Rank > 128 or full-layer fine-tune | Small dedicated model or merged checkpoint | Low-rank serving assumptions |
| Strict tenant isolation audits | Per-tenant engine or crypto-separated keys + quota | Shared batch without audit logs |
| Latency-sensitive, one hot adapter | Pin hot adapter + homogeneous batching | Low max_loras with constant churn |
Harbor Analytics refactor
The gateway consolidation shipped these changes:
- Retired twelve merged 7B replicas; one FP8 base on four A10G workers.
- Central adapter registry in Postgres: ID, path, base hash, rank, owner team, canary weight.
- Envoy ext_authz resolves API key to
lora_int_id; clients never send raw paths. max_loras=16,max_lora_rank=64,max_cpu_loras=64with LRU GPU promotion.- Tenant-aware 2 ms batching window reduced distinct adapters per batch from 5.1 to 2.4 average.
- Dashboards: swap rate, adapter residency time, decode latency by heterogeneity bucket.
Post-migration, cluster GPU memory utilization stabilized at 61% with headroom for traffic spikes; support tickets tagged “wrong tone/model” fell 83% after routing fixes alone — no retraining required.
Pitfalls
- Base hash mismatch. Adapter trained on Llama 3.1 revision A applied to revision B produces fluent nonsense; enforce hash checks at registry ingest.
- Client-supplied adapter paths. Path traversal and cross-tenant leakage; gateway must own ID resolution.
- max_loras too low for canaries. Blue/green deploy evicts production adapters mid-request.
- CUDA graphs without LoRA bucket keys. Silent wrong-token generation on mixed batches.
- Ignoring heterogeneous batch penalty. P50 looks fine while P99 kills SLAs; stratify metrics by adapter count.
- Merging for serving then also loading sidecar. Double application of deltas amplifies weights past training scale.
- No audit log per adapter invocation. Compliance teams cannot reconstruct which tenant model answered a given prompt.
Production checklist
- Publish adapter registry with base model hash and rank metadata.
- Resolve tenant to
lora_int_idin gateway; block raw path passthrough. - Set
max_lora_rankto fleet maximum; pad consciously. - Size
max_lorasfor peak concurrent adapters plus canary slots. - Pre-register hot adapters via
--lora-modulesat worker start. - Enable CPU staging (
max_cpu_loras) before hard OOM on swap. - Instrument swaps, residency time, and latency by batch heterogeneity.
- Regression-test CUDA graphs on mixed-adapter decode batches.
- Document merge-vs-serve policy; never apply both to the same request.
- Log adapter ID on every completion for audit and debug.
Key takeaways
- Multi-LoRA serving loads one base and many small adapter deltas — far cheaper than full replicas when ranks stay low.
- Gateway-owned adapter routing prevents wrong-tenant responses and path injection.
- Continuous batching can mix adapters, but heterogeneous batches tax P99 decode latency.
- vLLM
max_loras, rank ceilings, and CPU staging are capacity-planning knobs, not afterthoughts. - Harbor Analytics cut GPU memory 68% by consolidating twelve replicas into one multi-LoRA fleet.
Related reading
- LoRA fine-tuning explained — training adapters you later serve sidecar
- vLLM fundamentals explained — engine baseline for multi-LoRA workers
- LLM continuous batching explained — mixed-adapter batch scheduling
- LLM multi-tenant isolation explained — quota and audit patterns atop shared bases