Guide

LLM context windows explained: tokens, limits, and trade-offs

Every large language model has a context window — a hard cap on how many tokens it can process in a single request. That limit shapes everything from chat history length to how much source code an agent can read before answering. Understanding context windows is essential for anyone building with GPT, Claude, Gemini, or open-weight models: it drives cost, latency, and whether your app remembers what the user said three turns ago.

What is a context window?

A context window is the maximum number of tokens the model attends to at once. Tokens are not exactly words — English averages roughly three-quarters of a word per token, while code and JSON often tokenize less efficiently. A 128K context model can hold on the order of 90,000 words of mixed text, but that budget is shared across every piece of text in the request.

The window typically includes:

  • System instructions — persona, rules, tool definitions
  • Retrieved documents — RAG chunks, file uploads, search results
  • Conversation history — prior user and assistant messages
  • The current user message
  • Model output — the completion you are generating right now

Input and output both count toward the same ceiling on most APIs. If your window is 200K tokens and you send 180K tokens of repository files, you only have 20K left for the answer — and the next turn must fit everything again unless the provider caches prior key-value states (more on that below).

Why context length keeps growing

Early chat models shipped with 4K–8K token windows — enough for a short essay. By 2024–2026, frontier models advertise 128K, 200K, or even 1M+ token contexts. Longer windows let developers paste entire codebases, legal contracts, or multi-hour meeting transcripts without manual chunking.

The engineering cost is real. Attention scales quadratically with sequence length in a naive transformer, so vendors use sparse attention, sliding windows, and KV cache compression research to make long contexts affordable at inference time. Bigger windows do not automatically mean better recall: models can still lose track of facts buried in the middle of a huge prompt, a phenomenon researchers call the lost-in-the-middle problem.

Tokens, pricing, and the KV cache

API bills usually split input tokens and output tokens. Input is often cheaper per token, but a 100K-token prompt still dwarfs a 500-token reply. For agent loops that re-send the full transcript every turn, costs compound fast — which is why agent tokenomics studies find code review and re-read steps eating the majority of LLM spend.

During generation the model stores key-value (KV) cache tensors for each prior token so it does not recompute attention over the entire prefix on every new token. Reusing cached prefix tokens on follow-up turns (when the API supports it) cuts latency and input cost dramatically — but only for unchanged leading text. Edit one character in turn three and everything after it may need recomputation.

Practical cost levers

  • Summarize old turns instead of forwarding verbatim chat logs.
  • Retrieve selectively — embed search beats dumping whole repos.
  • Keep system prompts lean — giant instruction files burn budget every call.
  • Choose model tier by task — small models for classification, large for synthesis.

Context vs memory vs fine-tuning

Beginners often conflate three different ideas:

MechanismWhat it doesTypical lifetime
Context window Text the model sees this request One API call (or cached prefix across calls)
External memory Vector DB, notes, files the app injects later Persistent across sessions
Fine-tuning Weight updates teaching style or format Baked into model weights

Context is volatile RAM; fine-tuning is long-term skill. Most production agents combine a modest context window with retrieval and structured tool outputs rather than hoping the model memorizes a 400-page manual in one shot.

Design patterns that respect the limit

Retrieval-augmented generation (RAG)

Embed documents into chunks, search for the top-k relevant passages, and inject only those into the prompt. You trade perfect recall for predictable token use. Good RAG beats a naive "paste everything" approach on both cost and accuracy for large corpora.

Map-reduce and hierarchical summarization

For material that exceeds the window, split it into sections, summarize each chunk with a smaller model, then synthesize a final answer from the summaries. Legal and research workflows use this pattern daily.

Tool use instead of inline data

Rather than stuffing a database dump into context, expose tools: search_issues, read_file(path, lines), run_tests. The model pulls only what it needs. This is the core idea behind agent-first repositories where narrow, testable actions replace megabyte instruction files.

Structured state outside the model

Store user preferences, order IDs, and session flags in your database. Pass a compact JSON snapshot into context instead of narrating the entire history in prose.

Choosing a context size for your product

Bigger is not always better. Match window to workflow:

  • Customer support bot — 8K–32K is often enough with good retrieval.
  • Code assistant on one repo — 64K–128K helps; still prefer targeted file reads.
  • Document Q&A over PDFs — long window or map-reduce; test middle-of-doc accuracy.
  • Multi-agent orchestration — each sub-agent gets its own smaller window; a coordinator passes summaries.

Benchmark with real user prompts, not synthetic max-length tests. Measure dollars per resolved task, not tokens per request — the same philosophy as building for end-user utility first rather than demo-friendly specs.

Common mistakes

  • Assuming the model "remembers" prior sessions — unless you persist and re-inject history, each chat starts cold.
  • Stuffing duplicate content — repeating the system prompt and tool schemas every turn without caching.
  • Ignoring output budget — reserving no tokens for the answer guarantees truncated JSON or cut-off code.
  • Trusting ultra-long single-shot reasoning — chain-of-thought helps, but verify with tests and tools on critical paths.
  • Confusing characters with tokens — a 50K-character paste can exceed a 32K-token window.

Quick reference: token math

Rough rules of thumb (English prose; your mileage varies):

  • 1 token ≈ 4 characters or 0.75 words
  • 1,000 tokens ≈ 750 words ≈ 1.5 pages
  • 128K tokens ≈ 96,000 words — novel-length, but shared across input and output

Tokenizers differ by model family. Use the provider's tokenizer utility when estimating bills for non-English or code-heavy prompts.

Related reading