Guide
LiteLLM fundamentals explained
Every LLM vendor ships a slightly different SDK, error shape, and rate-limit header. Your support bot calls OpenAI; the analytics team prefers Claude; compliance wants a fallback to Ollama on a private subnet. Without a gateway layer you end up with three client libraries, three retry policies, and three places to rotate API keys. LiteLLM is an open-source Python library and proxy server that normalizes completions, embeddings, and image calls behind a single OpenAI-compatible interface. It handles provider-specific auth, maps model names, tracks token spend, and can fail over when a vendor returns 429 or 503. This guide covers the Python SDK, the proxy deployment, routing and fallbacks, budgets and caching, a Harbor Analytics multi-provider gateway worked example, a tooling decision table, common pitfalls, and a production checklist. Pair it with model routing for application-level cascades and LLM observability for tracing what the gateway forwards.
What LiteLLM is (and is not)
LiteLLM is a unified LLM client and optional HTTP proxy. In library
mode you call litellm.completion() with a model string like
anthropic/claude-sonnet-4-20250514 or ollama/llama3.2 and
receive an OpenAI-shaped response object regardless of backend. In proxy mode you run
litellm --config config.yaml, point existing OpenAI SDKs at
http://localhost:4000, and let the proxy translate requests to the right
upstream.
It is not a vector database, a prompt management UI, or a full MLOps platform. RAG ingestion belongs in LlamaIndex or LangChain; LiteLLM is the transport layer those frameworks can call. Agent orchestration still lives in LangGraph or CrewAI. Reach for LiteLLM when you need one integration surface for many vendors, centralized API keys, spend caps, or hot-swapping models without redeploying every microservice.
Core concepts
- Model string —
provider/model-nametells LiteLLM which adapter and env vars to use. - Completion — chat, text, embedding, image, and audio endpoints normalized to OpenAI schemas where possible.
- Router — weighted deployments, latency-based selection, and cooldown after repeated failures.
- Fallbacks — ordered list of alternate models when the primary raises specific exceptions.
- Proxy config — YAML listing model groups, API keys, rate limits, and team budgets.
- Callbacks — hooks for logging spend, forwarding to Langfuse, or custom audit pipelines.
Installation and your first completion
Install LiteLLM as a project dependency so proxy and library versions stay aligned:
pip install litellm
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
python -c "
from litellm import completion
resp = completion(
model='gpt-4o-mini',
messages=[{'role': 'user', 'content': 'Summarize vector databases in one sentence.'}]
)
print(resp.choices[0].message.content)
"
Switch providers by changing the model string only:
completion(model='anthropic/claude-sonnet-4-20250514', messages=messages)
completion(model='gemini/gemini-2.0-flash', messages=messages)
completion(model='ollama/llama3.2', api_base='http://localhost:11434', messages=messages)
LiteLLM reads standard environment variable names per provider. For Ollama and vLLM you
pass api_base explicitly. Responses include usage token counts
LiteLLM can price against its built-in model cost table — useful before you wire
a full billing export.
Streaming and async
Streaming uses the same call with stream=True; chunks arrive in OpenAI SSE
format. For FastAPI services prefer acompletion() so concurrent requests do
not block the event loop. Set litellm.set_verbose=True only in development;
verbose logs print request bodies that may contain PII.
The LiteLLM proxy server
Library mode works for scripts; production teams usually deploy the proxy
so every service shares one key vault, budget policy, and model catalog. A minimal
config.yaml:
model_list:
- model_name: gpt-4o-mini
litellm_params:
model: openai/gpt-4o-mini
api_key: os.environ/OPENAI_API_KEY
- model_name: claude-sonnet
litellm_params:
model: anthropic/claude-sonnet-4-20250514
api_key: os.environ/ANTHROPIC_API_KEY
- model_name: local-llama
litellm_params:
model: ollama/llama3.2
api_base: http://ollama.internal:11434
litellm_settings:
drop_params: true
set_verbose: false
Start the server:
litellm --config config.yaml --port 4000
Point the official OpenAI Python client at the proxy:
from openai import OpenAI
client = OpenAI(base_url='http://localhost:4000', api_key='sk-anything')
client.chat.completions.create(model='claude-sonnet', messages=[...])
The proxy validates keys you configure in general_settings.master_key,
enforces per-team budgets, and writes structured spend logs. Containerize behind your
existing ingress with TLS termination; never expose an unauthenticated proxy to the
public internet.
Routing, fallbacks and reliability
Vendor outages are routine. LiteLLM supports fallback chains at both SDK and proxy level:
from litellm import completion
response = completion(
model='gpt-4o',
messages=messages,
fallbacks=['claude-sonnet', 'local-llama'],
num_retries=2,
timeout=30
)
When OpenAI returns rate-limit or server errors, LiteLLM retries with exponential backoff
then tries the next model in the list. In proxy config, group multiple deployments under
one model_name and assign rpm (requests per minute) limits per
deployment to spread load across keys.
Router strategies
- Simple shuffle — random pick among healthy deployments with the same alias.
- Lowest latency — track rolling latency per deployment; prefer the fastest.
- Lowest cost — route to the cheapest model that meets a quality bar (pair with evals).
- Cooldown — temporarily remove deployments after N consecutive failures.
Application-level routing in cascade classifiers decides which task tier to use; LiteLLM decides which vendor endpoint fulfills that tier. Keep the boundary clear so you do not duplicate fallback logic in both places.
Budgets, caching and observability
Runaway agent loops can burn thousands of dollars before anyone notices. LiteLLM proxy supports max budgets per API key, team, or user id passed in request metadata. When spend exceeds the cap, the proxy returns 429 with a clear message instead of forwarding to OpenAI.
Semantic caching (optional Redis backend) stores responses keyed by embedding similarity of the prompt. Enable only for idempotent read-heavy workloads; support tickets and personalized summaries should not cache across users without strict key scoping. For prefix-level savings on repeated system prompts, vendor-native prompt caching still wins on supported models.
Wire success_callback and failure_callback to Langfuse, Helicone,
or your own webhook. Each log line includes model, latency, token counts, and estimated
cost — the raw material for
production dashboards.
Redact message content in callbacks when GDPR or HIPAA applies; log metadata only.
Harbor Analytics: multi-provider gateway worked example
Harbor Analytics runs a nightly policy Q&A pipeline: analysts upload regulatory PDFs,
the system chunks and embeds them, and a chat endpoint answers questions with citations.
Early versions hard-coded gpt-4o; when OpenAI degraded during a US holiday,
the whole pipeline stalled.
The team deployed LiteLLM proxy on an internal VM with three model aliases:
- harbor-fast —
gpt-4o-miniwith Anthropic Haiku fallback for bulk summarization. - harbor-quality —
claude-sonnetprimary,gpt-4ofallback for citation-heavy answers. - harbor-local —
ollama/llama3.2on a GPU box for PII-tagged documents that cannot leave the VPC.
The FastAPI service kept using the OpenAI SDK; only base_url and model alias
changed. Per-team API keys mapped analysts to harbor-quality and batch jobs
to harbor-fast with a $50/day budget. Spend logs fed a Grafana panel; when
Haiku absorbed failover traffic, on-call saw the shift within minutes instead of discovering
it on the monthly cloud invoice.
The migration took one sprint: stand up proxy, mirror traffic in shadow mode, compare
answer quality with their existing eval set, then cut over. The highest-leverage config
was drop_params: true, which strips unsupported parameters instead of
failing when Claude rejects an OpenAI-only field.
Tooling decision table
| Need | Reach for | Why |
|---|---|---|
| One Python API for 100+ LLM vendors | LiteLLM library | Fastest integration; OpenAI response shape |
| Centralized keys, budgets, team ACLs | LiteLLM proxy | OpenAI-compatible drop-in for existing services |
| Application task routing (cheap vs quality) | Custom router + LiteLLM | LiteLLM handles vendor; your code handles intent |
| Local-only inference | Ollama via LiteLLM | Same client code for cloud and on-prem |
| Managed multi-model gateway (hosted) | Portkey, OpenRouter | Less ops; LiteLLM when you need self-host control |
| Full agent framework | LangChain / LangGraph | Use LiteLLM as the model backend inside the framework |
Common pitfalls
- Duplicating fallbacks — if LiteLLM already fails over, do not wrap the same chain in application retry loops.
- Exposing the proxy publicly — always require master key or SSO; an open proxy becomes a free LLM laundering endpoint.
- Ignoring
drop_params— cross-vendor calls fail mysteriously when you pass OpenAI-only fields to Anthropic. - Logging full prompts — verbose mode and default callbacks may persist regulated data; scrub or hash content.
- Stale model strings — vendors rename models frequently; pin aliases in proxy config, not model ids in every repo.
- Semantic cache across tenants — shared cache keys leak answers between customers; scope by tenant id.
- Skipping eval on failover models — Haiku answering like Sonnet is not guaranteed; test fallback quality offline.
Production checklist
- Pin
litellmversion in requirements; read changelog before upgrades. - Store provider API keys in a secret manager; reference via
os.environ/in proxy config. - Deploy proxy behind TLS with authentication on every route.
- Define model aliases (
harbor-fast) decoupled from vendor model ids. - Configure fallbacks and
num_retriesfor each alias based on SLO tier. - Set per-team or per-key
max_budgetwith alerting at 80% consumption. - Enable spend and latency callbacks to your observability stack.
- Run shadow traffic before switching production
base_url. - Document which parameters each alias supports; enable
drop_paramsfor mixed vendors. - Re-run quality evals whenever fallback ordering or model versions change.
Key takeaways
- LiteLLM normalizes LLM vendor APIs behind one OpenAI-shaped interface.
- The proxy centralizes keys, budgets, and model catalogs for many services.
- Fallbacks belong in the gateway; task-level routing stays in application code.
- Model aliases insulate apps from vendor renames and simplify failover.
- Pair LiteLLM with observability and evals — cheap failover is worthless if quality collapses.
Related reading
- Ollama fundamentals explained — local models as a LiteLLM backend
- LLM model routing explained — application cascades above the gateway
- LangChain fundamentals explained — agents that call LiteLLM as the LLM layer
- LLM observability explained — trace spend and latency from proxy callbacks