Guide
LLM cost optimization explained
A support chatbot that worked fine in staging can burn thousands of dollars a week in production — not because traffic exploded, but because every ticket stuffed 40 pages of documentation into the prompt and asked a frontier model to write a novel-length reply. LLM cost optimization is the discipline of measuring where inference spend actually goes, then applying targeted changes — model routing, prompt trimming, caching, quantization, and serving efficiency — without silently degrading the user experience. This guide covers token economics, the highest-leverage savings levers, a Harbor support bot worked example, a technique decision table, common pitfalls, and a production checklist. Pair cost work with prompt caching fundamentals and inference serving when you operate your own GPUs.
Where the money goes
Hosted APIs and self-hosted clusters both bill in the same dimensions — you just see them as invoices or GPU hours. Understanding the breakdown is step one before optimizing anything.
Input vs output tokens
Most providers price input tokens (the prompt, system message, retrieved context, tool results) separately from output tokens (generated completion). Output is often 2–4x more expensive per token because decode is serial and memory-heavy. A chat that sends 8K tokens of RAG context and receives a 200-token answer still pays heavily on the input side — trimming context is frequently the fastest win.
Prefill vs decode
During prefill, the model processes the entire prompt in parallel.
Long prompts increase time-to-first-token and GPU occupancy. During
decode, the model generates one token at a time until
max_tokens or stop sequences fire. Verbose models that ramble cost
more on every extra sentence — cap output length and use structured formats when
possible.
Hidden multipliers
- Retries — failed JSON parsing or tool calls double-bill the same user turn.
- Agent loops — each tool round-trip re-sends conversation history plus new observations.
- Embeddings — vectorizing large corpora on every query instead of caching document embeddings.
- Evaluation and logging — shadow traffic to a second model for quality checks.
Instrument every request with token counts split by role (system, user, assistant, tool) and by model tier. You cannot optimize what you do not measure.
Model selection and routing
Not every task needs the largest model. A three-tier cascade routes cheap classifiers or small models first, escalating only when confidence is low:
- Tier 0 — rules and retrieval — FAQ keyword match, cached answers, no LLM call at all.
- Tier 1 — small/fast model — 7B–13B class or a mini API tier for classification, extraction, and simple rewrites.
- Tier 2 — frontier model — complex reasoning, multi-step planning, or high-stakes drafting.
A lightweight router model (or even logistic regression on embedding similarity) can classify intent in under 50 ms. If 70% of tickets are password resets, never touch GPT-4-class pricing for them. For coding assistants, route syntax questions to a code-specialized mid-tier model and reserve frontier capacity for architecture discussions.
Fine-tuning a smaller model on your domain often beats prompting a frontier model — fewer tokens needed in the system prompt because behavior is baked into weights, and per-token rates drop with model size.
Prompt and context trimming
Prompt engineering is not only about quality — it is a cost lever. Every redundant instruction and duplicated example in the system prompt is billed on every request.
- Compress system prompts — replace three paragraphs of tone guidance with five bullet rules; move rarely used policies to retrieval.
- Limit few-shot examples — two strong examples beat six mediocre ones; use dynamic example selection instead of a static block.
- Structured outputs — JSON or tool schemas reduce rambling and retry loops (structured outputs guide).
- Summarize history — after N turns, compress older messages into a rolling summary instead of re-sending full transcripts.
- Right-size context windows — do not pay for 128K if median prompts are 2K (context windows guide).
For RAG pipelines, retrieval quality matters more than retrieval quantity. Retrieve top-3 chunks with a reranker instead of top-20 without reranking. Strip HTML boilerplate before embedding. Deduplicate overlapping chunks so the same paragraph is not billed three times under different headings.
Caching layers
Caching is the highest ROI technique when prompts share stable prefixes — support bots, legal review templates, and code assistants with fixed system instructions.
Provider prompt caching
Major APIs cache identical prompt prefixes across requests at reduced per-token rates. Layout prompts with static content first (system rules, tool definitions, document corpus headers) and variable content last (user question). See the dedicated prompt caching guide for breakpoint placement and TTL behavior.
Semantic response cache
Store (embedding, answer) pairs. When a new question is cosine-similar above a threshold to a prior query, return the cached answer without calling the LLM. Works well for FAQs; risky for time-sensitive facts — attach TTLs and invalidation on knowledge-base updates.
Self-hosted KV reuse
On your own stack, prefix caching and KV cache management in serving engines like vLLM avoid recomputing prefill for shared system prompts.
Infrastructure and algorithmic efficiency
When you control the stack, hardware choices compound:
- Quantization — INT8 or INT4 weights cut memory bandwidth per decode step, fitting larger batches on the same GPU (quantization guide).
- Continuous batching — mix short and long requests on one GPU instead of padding to the longest sequence.
- Speculative decoding — a small draft model proposes tokens; the large model verifies in parallel, raising tokens-per-second.
- Right GPU count — one under-provisioned GPU with a deep queue wastes user patience; two GPUs with good batching often beat four idle ones.
Compare cost per successful task, not cost per token. A cheaper model that fails 30% of the time and triggers retries can cost more than a reliable mid-tier model on the first attempt.
Worked example: Harbor support bot
Harbor's customer support assistant handled 12,000 tickets per week. Initial stack: one frontier model, full 50-page knowledge base injected per ticket, average 9,200 input tokens and 480 output tokens per conversation, three turns median.
Baseline weekly estimate: ~12,000 tickets × 3 turns × (9,200 input + 480 output) ≈ 346M tokens/week — dominated by repeated documentation dumps.
Changes applied over four weeks:
- Intent router sent 58% of tickets to a fine-tuned 8B model (Tier 1).
- RAG top-k reduced from 15 to 4 chunks with a cross-encoder reranker; median input dropped to 2,100 tokens.
- System prompt compressed from 1,800 to 420 tokens; static prefix marked for provider caching (90% cache hit rate on prefix).
- Output capped at 350 tokens with JSON schema for ticket metadata.
- Semantic cache answered 11% of Tier-1 queries with zero LLM calls.
Result: effective billed tokens fell ~74% with CSAT unchanged (+0.2 points). The reranker and router added ~40 ms latency — acceptable for async email support. Biggest single win: stopping the full-doc dump (input tokens −68% alone).
Technique decision table
| Technique | Best when | Typical savings | Risk |
|---|---|---|---|
| Model cascade / routing | Traffic mixes simple and complex intents | 40–70% on blended traffic | Mis-routed hard queries get weak answers |
| RAG chunk trimming + rerank | Large knowledge bases, long median prompts | 50–80% input tokens | Missed retrieval on edge-case docs |
| Prompt caching (API) | Stable system prompts, high QPS | 30–60% on input side | Cache invalidation on prompt edits |
| Semantic response cache | Repeated FAQ-style questions | 10–25% zero-call hits | Stale answers if TTL too long |
| Quantization + batching | Self-hosted inference at scale | 2–4x throughput per GPU | Quality loss on 4-bit without calibration |
| Fine-tune smaller model | Stable domain, high volume | 5–20x vs frontier API | Upfront training + eval investment |
| Output token caps + JSON | Verbose models, structured downstream | 20–50% output tokens | Truncated answers if cap too tight |
Common pitfalls
- Optimizing average tokens, ignoring p99 — one agent loop with twelve tool calls dominates the bill; cap iterations.
- Cheaper model, more retries — measure end-to-end cost per successful outcome, not per call.
- Cache without invalidation — prompt edits silently serve stale cached prefixes until TTL expires.
- Over-aggressive summarization — compressed history loses constraints the user stated three turns ago.
- Ignoring embedding costs — re-embedding the whole corpus nightly when only 2% of docs changed.
- No quality guardrails after cuts — run eval suites on a golden set before shipping routing changes.
- Shadow traffic at full price — sample 1–5% for quality checks, not 100% dual-model runs.
- Max tokens left at defaults — 4,096 output ceiling invites runaway generation on mis-parsed tool calls.
Production checklist
- Per-request logging: model, input tokens, output tokens, latency, success/fail.
- Dashboard: cost per day by feature, model tier, and team.
- Alerts when daily spend exceeds trailing 7-day average by 25%.
- Intent router or rules layer before any frontier model call.
- RAG pipeline audited for median retrieved token count.
- System prompt reviewed quarterly — delete instructions nobody follows.
- Prompt caching breakpoints documented and tested after deploys.
- Output
max_tokensset per use case, not globally at model max. - Agent and tool loops hard-capped (e.g., 5 iterations).
- Golden eval set re-run before routing or quantization changes ship.
Key takeaways
- Input tokens — especially bloated RAG context — often dominate bills more than output length.
- Model routing sends easy work to cheap tiers; frontier capacity is a scarce resource.
- Caching (prompt, semantic, KV) multiplies savings when prefixes repeat across traffic.
- Measure cost per successful task, not cost per token in isolation.
- Pair cost cuts with eval guardrails so savings do not become silent quality regressions.
Related reading
- LLM prompt caching explained — prefix reuse, cache breakpoints, and static-first layout patterns
- LLM inference serving explained — batching, vLLM, and GPU throughput for self-hosted stacks
- LLM model quantization explained — INT8/INT4 trade-offs and calibration for production
- RAG explained — retrieval pipelines where chunk budget directly drives token cost