Guide

LiteLLM fundamentals explained

Every LLM vendor ships a slightly different SDK, error shape, and rate-limit header. Your support bot calls OpenAI; the analytics team prefers Claude; compliance wants a fallback to Ollama on a private subnet. Without a gateway layer you end up with three client libraries, three retry policies, and three places to rotate API keys. LiteLLM is an open-source Python library and proxy server that normalizes completions, embeddings, and image calls behind a single OpenAI-compatible interface. It handles provider-specific auth, maps model names, tracks token spend, and can fail over when a vendor returns 429 or 503. This guide covers the Python SDK, the proxy deployment, routing and fallbacks, budgets and caching, a Harbor Analytics multi-provider gateway worked example, a tooling decision table, common pitfalls, and a production checklist. Pair it with model routing for application-level cascades and LLM observability for tracing what the gateway forwards.

What LiteLLM is (and is not)

LiteLLM is a unified LLM client and optional HTTP proxy. In library mode you call litellm.completion() with a model string like anthropic/claude-sonnet-4-20250514 or ollama/llama3.2 and receive an OpenAI-shaped response object regardless of backend. In proxy mode you run litellm --config config.yaml, point existing OpenAI SDKs at http://localhost:4000, and let the proxy translate requests to the right upstream.

It is not a vector database, a prompt management UI, or a full MLOps platform. RAG ingestion belongs in LlamaIndex or LangChain; LiteLLM is the transport layer those frameworks can call. Agent orchestration still lives in LangGraph or CrewAI. Reach for LiteLLM when you need one integration surface for many vendors, centralized API keys, spend caps, or hot-swapping models without redeploying every microservice.

Core concepts

Model string — provider/model-name tells LiteLLM which adapter and env vars to use.
Completion — chat, text, embedding, image, and audio endpoints normalized to OpenAI schemas where possible.
Router — weighted deployments, latency-based selection, and cooldown after repeated failures.
Fallbacks — ordered list of alternate models when the primary raises specific exceptions.
Proxy config — YAML listing model groups, API keys, rate limits, and team budgets.
Callbacks — hooks for logging spend, forwarding to Langfuse, or custom audit pipelines.

Installation and your first completion

Install LiteLLM as a project dependency so proxy and library versions stay aligned:

pip install litellm

export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...

python -c "
from litellm import completion
resp = completion(
    model='gpt-4o-mini',
    messages=[{'role': 'user', 'content': 'Summarize vector databases in one sentence.'}]
)
print(resp.choices[0].message.content)
"

Switch providers by changing the model string only:

completion(model='anthropic/claude-sonnet-4-20250514', messages=messages)
completion(model='gemini/gemini-2.0-flash', messages=messages)
completion(model='ollama/llama3.2', api_base='http://localhost:11434', messages=messages)

LiteLLM reads standard environment variable names per provider. For Ollama and vLLM you pass api_base explicitly. Responses include usage token counts LiteLLM can price against its built-in model cost table — useful before you wire a full billing export.

Streaming and async

Streaming uses the same call with stream=True; chunks arrive in OpenAI SSE format. For FastAPI services prefer acompletion() so concurrent requests do not block the event loop. Set litellm.set_verbose=True only in development; verbose logs print request bodies that may contain PII.

The LiteLLM proxy server

Library mode works for scripts; production teams usually deploy the proxy so every service shares one key vault, budget policy, and model catalog. A minimal config.yaml:

model_list:
  - model_name: gpt-4o-mini
    litellm_params:
      model: openai/gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY
  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY
  - model_name: local-llama
    litellm_params:
      model: ollama/llama3.2
      api_base: http://ollama.internal:11434

litellm_settings:
  drop_params: true
  set_verbose: false

Start the server:

litellm --config config.yaml --port 4000

Point the official OpenAI Python client at the proxy:

from openai import OpenAI
client = OpenAI(base_url='http://localhost:4000', api_key='sk-anything')
client.chat.completions.create(model='claude-sonnet', messages=[...])

The proxy validates keys you configure in general_settings.master_key, enforces per-team budgets, and writes structured spend logs. Containerize behind your existing ingress with TLS termination; never expose an unauthenticated proxy to the public internet.

Routing, fallbacks and reliability

Vendor outages are routine. LiteLLM supports fallback chains at both SDK and proxy level:

from litellm import completion

response = completion(
    model='gpt-4o',
    messages=messages,
    fallbacks=['claude-sonnet', 'local-llama'],
    num_retries=2,
    timeout=30
)

When OpenAI returns rate-limit or server errors, LiteLLM retries with exponential backoff then tries the next model in the list. In proxy config, group multiple deployments under one model_name and assign rpm (requests per minute) limits per deployment to spread load across keys.

Router strategies

Simple shuffle — random pick among healthy deployments with the same alias.
Lowest latency — track rolling latency per deployment; prefer the fastest.
Lowest cost — route to the cheapest model that meets a quality bar (pair with evals).
Cooldown — temporarily remove deployments after N consecutive failures.

Application-level routing in cascade classifiers decides which task tier to use; LiteLLM decides which vendor endpoint fulfills that tier. Keep the boundary clear so you do not duplicate fallback logic in both places.

Budgets, caching and observability

Runaway agent loops can burn thousands of dollars before anyone notices. LiteLLM proxy supports max budgets per API key, team, or user id passed in request metadata. When spend exceeds the cap, the proxy returns 429 with a clear message instead of forwarding to OpenAI.

Semantic caching (optional Redis backend) stores responses keyed by embedding similarity of the prompt. Enable only for idempotent read-heavy workloads; support tickets and personalized summaries should not cache across users without strict key scoping. For prefix-level savings on repeated system prompts, vendor-native prompt caching still wins on supported models.

Wire success_callback and failure_callback to Langfuse, Helicone, or your own webhook. Each log line includes model, latency, token counts, and estimated cost — the raw material for production dashboards. Redact message content in callbacks when GDPR or HIPAA applies; log metadata only.

Harbor Analytics: multi-provider gateway worked example

Harbor Analytics runs a nightly policy Q&A pipeline: analysts upload regulatory PDFs, the system chunks and embeds them, and a chat endpoint answers questions with citations. Early versions hard-coded gpt-4o; when OpenAI degraded during a US holiday, the whole pipeline stalled.

The team deployed LiteLLM proxy on an internal VM with three model aliases:

harbor-fast — gpt-4o-mini with Anthropic Haiku fallback for bulk summarization.
harbor-quality — claude-sonnet primary, gpt-4o fallback for citation-heavy answers.
harbor-local — ollama/llama3.2 on a GPU box for PII-tagged documents that cannot leave the VPC.

The FastAPI service kept using the OpenAI SDK; only base_url and model alias changed. Per-team API keys mapped analysts to harbor-quality and batch jobs to harbor-fast with a $50/day budget. Spend logs fed a Grafana panel; when Haiku absorbed failover traffic, on-call saw the shift within minutes instead of discovering it on the monthly cloud invoice.

The migration took one sprint: stand up proxy, mirror traffic in shadow mode, compare answer quality with their existing eval set, then cut over. The highest-leverage config was drop_params: true, which strips unsupported parameters instead of failing when Claude rejects an OpenAI-only field.

Tooling decision table

Need	Reach for	Why
One Python API for 100+ LLM vendors	LiteLLM library	Fastest integration; OpenAI response shape
Centralized keys, budgets, team ACLs	LiteLLM proxy	OpenAI-compatible drop-in for existing services
Application task routing (cheap vs quality)	Custom router + LiteLLM	LiteLLM handles vendor; your code handles intent
Local-only inference	Ollama via LiteLLM	Same client code for cloud and on-prem
Managed multi-model gateway (hosted)	Portkey, OpenRouter	Less ops; LiteLLM when you need self-host control
Full agent framework	LangChain / LangGraph	Use LiteLLM as the model backend inside the framework

Common pitfalls

Duplicating fallbacks — if LiteLLM already fails over, do not wrap the same chain in application retry loops.
Exposing the proxy publicly — always require master key or SSO; an open proxy becomes a free LLM laundering endpoint.
Ignoring drop_params — cross-vendor calls fail mysteriously when you pass OpenAI-only fields to Anthropic.
Logging full prompts — verbose mode and default callbacks may persist regulated data; scrub or hash content.
Stale model strings — vendors rename models frequently; pin aliases in proxy config, not model ids in every repo.
Semantic cache across tenants — shared cache keys leak answers between customers; scope by tenant id.
Skipping eval on failover models — Haiku answering like Sonnet is not guaranteed; test fallback quality offline.

Production checklist

Pin litellm version in requirements; read changelog before upgrades.
Store provider API keys in a secret manager; reference via os.environ/ in proxy config.
Deploy proxy behind TLS with authentication on every route.
Define model aliases (harbor-fast) decoupled from vendor model ids.
Configure fallbacks and num_retries for each alias based on SLO tier.
Set per-team or per-key max_budget with alerting at 80% consumption.
Enable spend and latency callbacks to your observability stack.
Run shadow traffic before switching production base_url.
Document which parameters each alias supports; enable drop_params for mixed vendors.
Re-run quality evals whenever fallback ordering or model versions change.

Key takeaways

LiteLLM normalizes LLM vendor APIs behind one OpenAI-shaped interface.
The proxy centralizes keys, budgets, and model catalogs for many services.
Fallbacks belong in the gateway; task-level routing stays in application code.
Model aliases insulate apps from vendor renames and simplify failover.
Pair LiteLLM with observability and evals — cheap failover is worthless if quality collapses.