Guide

LLM multi-LoRA serving explained

Harbor Analytics ran twelve separate 7B chat replicas — one per product line — because each team had shipped a LoRA fine-tune merged into its own checkpoint. GPU memory on the inference cluster sat at 94% utilization, cold starts took 38 seconds per replica, and a single tenant spike could not borrow spare capacity from quiet neighbors. The platform team consolidated to one frozen base model with 47 dynamically loaded LoRA adapters on a shared vLLM fleet. VRAM use fell 68%, median time-to-first-token on cold tenants dropped 41%, and adapter routing errors went to zero once the gateway published explicit lora_request metadata instead of guessing from API keys alone.

Multi-LoRA serving keeps base transformer weights in GPU memory once while applying different low-rank delta matrices per request or per batch slot. It is the production answer to “we have thirty tone adapters but cannot afford thirty full models.” This guide covers adapter routing and registry design, how continuous batching mixes LoRA IDs in one forward pass, vLLM configuration knobs (max_loras, max_lora_rank, lora_modules), memory and eviction policies, interaction with multi-tenant isolation, the Harbor Analytics gateway refactor, a technique decision table versus merged adapters and full replicas, pitfalls, and a production checklist.

Why multi-LoRA instead of merged weights or replicas

After fine-tuning, teams typically choose among three deployment shapes:

Approach	VRAM per variant	Swap latency	Best when
Full model replica	100% of base (e.g. 14 GB for 7B FP16)	Process restart or second GPU	Adapters diverge so far that LoRA rank cannot fit
Merged LoRA into base weights	100% of base per merged checkpoint	Load new checkpoint file	One stable production adapter; no per-request switching
Multi-LoRA sidecar serving	Base once + small delta per active adapter	Microseconds to milliseconds (in-GPU)	Many tenants, frequent A/B adapters, shared base

Multi-LoRA wins when adapter count is high, each delta is small (rank 8–64), and traffic is bursty across tenants. It loses when adapters need ranks above engine limits, target different base model revisions, or require incompatible quantization schemes. The math is simple: one 7B base at 14 GB plus forty rank-16 adapters at roughly 50–120 MB each beats forty merged 14 GB checkpoints by an order of magnitude — if the scheduler can batch mixed adapters efficiently.

Adapter registry and request routing

Serving starts with a registry that maps stable adapter IDs to artifact paths, base model hashes, rank, alpha, and target module lists. The inference gateway — not the client SDK — should resolve tenant identity to adapter_id. Letting callers pass arbitrary filesystem paths is an injection and isolation failure.

A minimal OpenAI-compatible extension passes LoRA metadata on each completion:

POST /v1/chat/completions
{
  "model": "meta-llama/Llama-3.1-8B-Instruct",
  "messages": [...],
  "extra_body": {
    "lora_request": { "lora_name": "harbor-legal-v3", "lora_int_id": 12 }
  }
}

vLLM maps lora_int_id to loaded GPU tensors. Your gateway should maintain the int ID table, validate that the requested adapter was trained against the served base revision (hash match), and reject cross-base requests before they hit the scheduler. Version pins belong in the registry: deploying harbor-legal-v3 while harbor-legal-v2 still receives 5% canary traffic requires two slots in max_loras, not silent overwrite.

vLLM multi-LoRA configuration

vLLM loads a frozen base through its standard engine path, then attaches LoRA layers dynamically. Key server flags:

--enable-lora — turns on LoRA kernels in the worker.
--max-loras — maximum distinct adapters resident in GPU memory simultaneously (default 1). Set to peak concurrent adapter count plus headroom for canaries.
--max-lora-rank — compile-time ceiling; requests above rank fail fast. Must be ≥ the largest adapter you register.
--lora-modules name=path ... — pre-register adapters at startup so first request avoids disk I/O.
--max-cpu-loras — CPU-staged adapters for swap-in when GPU slots are full; trades host RAM for fewer OOM kills.

Pair multi-LoRA with PagedAttention and CUDA graphs on decode: adapter switching adds gather-scatter work in attention projections, so shape buckets should include LoRA ID dimensions where the engine supports it. After enabling graphs, run regression tests on mixed-adapter batches — a graph captured for adapter A only silently wrong-results on adapter B if bucket keys omit LoRA state.

Batch scheduling with mixed adapters

Continuous batching adds and removes sequences every decode step. With LoRA, each sequence carries an lora_id tensor parallel to its KV cache block table. The scheduler can mix adapters in one batch when the backend implements grouped GEMM: sequences sharing an adapter run one fused kernel; sequences with different adapters execute back-to-back micro-groups within the same CUDA stream.

Throughput implications:

Homogeneous batches (one adapter) achieve near-native speed — ideal for large tenant spikes.
Heterogeneous batches pay adapter-switch overhead proportional to the number of distinct IDs in the batch. Coalescing helps: a short queue wait (1–3 ms) to batch two `harbor-support` requests beats immediate dispatch into a batch already running `harbor-legal`.
Adapter locality matters for CPU-staged LoRAs: thrashing fifty adapters with max_loras=8 causes PCIe reload storms. LRU eviction of GPU slots should consider request rate, not just last use.

Instrument lora_swaps_per_second, distinct_loras_per_batch, and P50/P99 decode latency stratified by adapter count. Harbor Analytics found P99 decode blew out when distinct adapters per batch exceeded four on A10G hardware — their fix was tenant-aware micro-batching windows, not more GPUs.

Memory accounting

Budget VRAM as:

total = base_weights + kv_pool + Σ(active_adapter_i) + cuda_graph_overhead

Base weights dominate (quantized INT4/FP8 bases shrink this term). KV pool scales with concurrent sequences and context length — adapters do not duplicate KV cache per tenant. Each active LoRA adds roughly:

adapter_mb ≈ 2 × rank × (d_model × num_target_layers × bytes_per_param)

For rank 16 on all attention projections in a 32-layer 7B model, expect 40–90 MB per adapter in FP16. max_loras is therefore a hard capacity plan: eight rank-16 adapters might cost under 1 GB — cheap compared to an extra base replica. Watch max_lora_rank padding: vLLM allocates for the configured maximum rank even when the loaded adapter uses rank 8, wasting slack if ranks are heterogeneous.

Technique decision table

Your situation	Prefer	Avoid
40+ tenant tone adapters on one base	Multi-LoRA serving with registry + gateway routing	Full replica per tenant
Single production adapter, no A/B	Merge LoRA into base; simplest ops	Runtime adapter lookup overhead
Adapters on different base revisions	Separate engine per base hash	One multi-LoRA pool
Rank > 128 or full-layer fine-tune	Small dedicated model or merged checkpoint	Low-rank serving assumptions
Strict tenant isolation audits	Per-tenant engine or crypto-separated keys + quota	Shared batch without audit logs
Latency-sensitive, one hot adapter	Pin hot adapter + homogeneous batching	Low max_loras with constant churn

Harbor Analytics refactor

The gateway consolidation shipped these changes:

Retired twelve merged 7B replicas; one FP8 base on four A10G workers.
Central adapter registry in Postgres: ID, path, base hash, rank, owner team, canary weight.
Envoy ext_authz resolves API key to lora_int_id; clients never send raw paths.
max_loras=16, max_lora_rank=64, max_cpu_loras=64 with LRU GPU promotion.
Tenant-aware 2 ms batching window reduced distinct adapters per batch from 5.1 to 2.4 average.
Dashboards: swap rate, adapter residency time, decode latency by heterogeneity bucket.

Post-migration, cluster GPU memory utilization stabilized at 61% with headroom for traffic spikes; support tickets tagged “wrong tone/model” fell 83% after routing fixes alone — no retraining required.

Pitfalls

Base hash mismatch. Adapter trained on Llama 3.1 revision A applied to revision B produces fluent nonsense; enforce hash checks at registry ingest.
Client-supplied adapter paths. Path traversal and cross-tenant leakage; gateway must own ID resolution.
max_loras too low for canaries. Blue/green deploy evicts production adapters mid-request.
CUDA graphs without LoRA bucket keys. Silent wrong-token generation on mixed batches.
Ignoring heterogeneous batch penalty. P50 looks fine while P99 kills SLAs; stratify metrics by adapter count.
Merging for serving then also loading sidecar. Double application of deltas amplifies weights past training scale.
No audit log per adapter invocation. Compliance teams cannot reconstruct which tenant model answered a given prompt.

Production checklist

Publish adapter registry with base model hash and rank metadata.
Resolve tenant to lora_int_id in gateway; block raw path passthrough.
Set max_lora_rank to fleet maximum; pad consciously.
Size max_loras for peak concurrent adapters plus canary slots.
Pre-register hot adapters via --lora-modules at worker start.
Enable CPU staging (max_cpu_loras) before hard OOM on swap.
Instrument swaps, residency time, and latency by batch heterogeneity.
Regression-test CUDA graphs on mixed-adapter decode batches.
Document merge-vs-serve policy; never apply both to the same request.
Log adapter ID on every completion for audit and debug.

Key takeaways

Multi-LoRA serving loads one base and many small adapter deltas — far cheaper than full replicas when ranks stay low.
Gateway-owned adapter routing prevents wrong-tenant responses and path injection.
Continuous batching can mix adapters, but heterogeneous batches tax P99 decode latency.
vLLM max_loras, rank ceilings, and CPU staging are capacity-planning knobs, not afterthoughts.
Harbor Analytics cut GPU memory 68% by consolidating twelve replicas into one multi-LoRA fleet.