Guide
LLM multi-tenant isolation explained
Harbor Analytics ran one shared vLLM cluster for every internal product. Finance loved the utilization numbers until marketing launched an overnight embedding backfill: P99 latency on the customer-support chatbot jumped from 1.2 s to 14 s, and two enterprise tenants hit hard 429 rate limits because a single API key had no per-tenant ceiling. That incident was not a model quality problem — it was missing multi-tenant isolation: the set of policies, queues, and observability hooks that keep one team's workload from starving another's on shared GPUs.
Multi-tenant LLM serving differs from classic SaaS tenancy. You rarely spin up a dedicated GPU per customer at scale; instead you share tensor cores and KV-cache memory while enforcing fairness at the gateway. Isolation spans identity (who is calling), economics (how many tokens they may burn), scheduling (whose requests enter the batch first), and compliance (whether prompts leave a tenant's trust boundary). This guide covers tenant identity and API keys, rate limits and token budgets, queue fairness and priority tiers, noisy-neighbor mitigation on continuous batching runtimes, data and logging boundaries, the Harbor Analytics gateway refactor, a technique decision table, pitfalls, and a production checklist. Pair it with LLM observability and API rate limiting for the full operations picture.
What multi-tenant isolation means for LLMs
In a multi-tenant inference stack, multiple customers or internal teams share the same model weights and GPU pool. Isolation does not require separate hardware for every tenant; it requires enforceable boundaries so that:
- Capacity — one tenant cannot exhaust GPU memory or queue depth and block others.
- Latency — interactive chat SLOs survive batch jobs from other tenants.
- Billing — token usage attributes cleanly to the caller for chargeback or cost optimization.
- Security — prompts, completions, and cached prefixes do not leak across tenants; logs respect PII redaction policy per tenant.
Isolation sits in the gateway and scheduler, not inside the transformer. vLLM, TensorRT-LLM, and TGI batch requests from a shared queue; your control plane decides which requests enter that queue and at what priority.
Tenant identity and API key scoping
Every request should carry a resolvable tenant ID before it touches a GPU. Common patterns:
- Dedicated API keys per tenant — map key hash to tenant record in the gateway. Rotate keys independently; revoke one tenant without affecting others.
- JWT or mTLS service identity — internal microservices present
signed claims (
tenant_id,tier,allowed_models). Prefer this over shared service keys that obscure attribution. - Model and feature allowlists — enterprise tenant A may call
only
harbor-70b; sandbox tenants restricted to a smaller model to cap cost and risk.
Anti-pattern: one global API key embedded in twelve repos. You cannot rate-limit,
bill, or audit fairly. Harbor moved to per-product keys with a central registry synced
from their internal service catalog; deploys that omit X-Tenant-Id now fail
closed at the gateway.
Rate limits, token budgets, and concurrency caps
Requests per minute (RPM) and tokens per minute (TPM)
Standard rate limiting applies at the HTTP edge: RPM stops brute-force abuse; TPM enforces economic caps. LLM workloads need both because a single request can carry 100k input tokens (long RAG context) or generate 4k output tokens. Set limits per tenant and per model — a 70B endpoint deserves lower TPM than an 8B embedder.
Concurrency slots
Limit how many in-flight generations a tenant may hold. Without concurrency caps, one client opening 200 parallel streaming connections can monopolize continuous batching slots even when RPM/TPM headers look fine averaged over a minute.
Daily and monthly budgets
Hard stops for free-tier or trial tenants; soft alerts at 80% for paid accounts.
Budget exhaustion should return a structured 429 with retry-after and a
dashboard link — not a generic timeout after the request already consumed GPU time.
Queue fairness and priority tiers
Shared inference runtimes use a central request queue. Fair scheduling prevents a single tenant from filling the queue:
- Weighted fair queuing (WFQ) — each tenant gets a quantum of batch slots proportional to their tier. A premium tenant with weight 4 receives four times the decode opportunities of a sandbox tenant with weight 1.
- Separate queues by workload class — interactive chat, batch embedding, and offline summarization land in different queues bound to different replica pools or priority bands.
- Preemption policy — some stacks allow high-priority requests to jump the queue; document whether preemption cancels in-flight partial generations (bad for UX) or only affects not-yet-started work.
Harbor split replicas: three GPUs on a latency pool (max batch 16, no requests above 8k context), two GPUs on a throughput pool for batch jobs. Cross-pool routing is gateway-enforced — marketing backfills cannot target the latency pool even with a valid API key.
Noisy-neighbor mitigation on shared GPUs
Even with fair queuing, tenants on the same GPU share KV-cache memory and memory bandwidth. Mitigations:
- Context length caps per tier — reject or truncate prompts above tenant max context before prefill allocates cache blocks.
- Max batch token limits — vLLM's
max_num_batched_tokensprevents one enormous prefill from starving decode steps for others in the same batch. - Replica pinning for noisy tenants — isolate a known heavy batch customer to dedicated replicas while keeping smaller tenants on a shared pool. Cheaper than full single-tenant clusters.
- Autoscaling on queue depth per pool — scale replicas when latency-pool queue depth exceeds threshold, not when the throughput pool is busy. Mixing signals causes over-provisioning.
Pair with model routing so sandbox tenants hit smaller models by default, reducing per-request memory footprint on shared hardware.
Data, cache, and logging boundaries
Prompt and completion storage
If you log prompts for debugging, partition storage by tenant_id with
separate encryption keys for regulated customers. Retention TTLs differ: enterprise
contracts may require 7-day max; internal dev sandboxes may allow 30 days.
Semantic and prefix cache isolation
Semantic caches and prefix caches must key entries by tenant. Never return a cached completion generated for tenant A to tenant B even if the prompt text matches — subtle system-prompt differences or compliance rules may differ.
Fine-tuned and adapter weights
LoRA adapters per tenant load dynamically in multi-tenant serving. Ensure adapter selection is bound to authenticated tenant identity, not client-supplied model names alone. Adapter memory adds per-request overhead; cap how many distinct adapters a single replica hot-loads.
Harbor Analytics gateway refactor
After the marketing backfill incident, Harbor rebuilt the inference edge:
- Gateway service — Envoy sidecar validates API keys, injects
tenant_idandtier, enforces RPM/TPM/concurrency before proxying to vLLM. - Tier matrix — interactive (support, in-app chat), standard (internal tools), batch (embeddings, overnight jobs). Each tier maps to pool, limits, and allowed models.
- Token metering — stream usage chunks back to a Kafka topic; daily rollups per tenant feed FinOps dashboards and budget alerts.
- Observability — every trace span tags
tenant_id; P99 latency dashboards split by tier so regressions are visible before customers complain. - Incident runbook — on-call can throttle a single tenant to 10% TPM without draining the shared pool for everyone else.
Result: support chat P99 returned to 1.4 s during concurrent batch work; marketing jobs completed on the throughput pool without touching latency SLOs. Monthly GPU spend rose 8% from the extra replica — acceptable versus the prior invisible cross-team contention.
Technique decision table
| Your situation | Prefer | Avoid |
|---|---|---|
| Multiple internal teams on one GPU cluster | Per-tenant keys + WFQ + separate latency/throughput pools | One shared API key and best-effort FIFO queue |
| External SaaS API with free tier | Hard TPM/concurrency caps + daily budget + smaller default model | Same limits as enterprise on shared replicas |
| Regulated customer (HIPAA, finance) | Dedicated replicas or single-tenant cluster + partitioned logs | Shared semantic cache across tenants |
| Heavy batch plus real-time chat | Physically separate replica pools with gateway routing | Priority flag only, same GPU pool |
| Early prototype, one team | Simple per-key RPM limit; defer full WFQ | Over-building tenant registry before second consumer exists |
| Per-tenant LoRA adapters | Gateway-bound adapter ID + cap hot adapters per replica | Client-supplied adapter path without auth binding |
Common pitfalls
- Rate limits only at the edge, not at the scheduler. Requests that pass HTTP limits can still pile up inside vLLM's internal queue.
- TPM averaged globally. One tenant's spike hides in fleet-wide metrics; per-tenant dashboards are mandatory.
- Shared prefix cache keys. Cross-tenant cache hits are a privacy incident waiting to happen.
- Ignoring concurrency. TPM/RPM look healthy while 500 streams block the batcher.
- Same pool for 200k-token RAG and 500-token chat. Long prefills are the classic noisy neighbor; route or cap by context class.
- No tenant-level circuit breaker. A buggy retry loop in one service DDoS-es your own GPUs; per-tenant breakers contain blast radius.
- Logging prompts without tenant-scoped retention. Compliance audits fail when one bucket mixes all customers.
Production checklist
- Issue unique API keys or JWT claims per tenant; ban shared production keys.
- Enforce RPM, TPM, concurrency, and max context per tier at the gateway.
- Tag every request and observability span with
tenant_idandtier. - Split latency-sensitive and batch workloads across pools or replica groups.
- Implement weighted fair queuing or equivalent scheduler fairness.
- Key semantic and prefix caches by tenant; never cross-tenant cache hits.
- Stream token usage to a metering pipeline for billing and budget alerts.
- Expose per-tenant P50/P99 latency and 429 rates on a dashboard.
- Document a single-tenant throttle runbook for incident response.
- Bind LoRA or fine-tuned adapter selection to authenticated tenant identity.
- Apply per-tenant PII redaction rules before prompt logging.
- Load-test one tenant at max tier while others hold baseline traffic.
Key takeaways
- Multi-tenant LLM isolation is a gateway and scheduling problem: identity, quotas, queue fairness, and data boundaries — not separate GPUs for every user.
- RPM and TPM limits are necessary but not sufficient; concurrency caps and context-class routing stop noisy neighbors on shared tensor cores.
- Split latency and throughput pools when batch jobs and interactive chat share a fleet.
- Caches and logs must partition by tenant; a shared semantic cache is a compliance risk.
- Harbor Analytics restored support SLOs by tiered pools, per-tenant metering, and throttle runbooks — not by buying a GPU per team.
Related reading
- LLM observability explained — traces, token metrics, and production monitoring
- API rate limiting explained — token buckets, sliding windows, and 429 design
- vLLM fundamentals explained — continuous batching and PagedAttention
- LLM cost optimization explained — token budgets and chargeback models