Guide
LLM agent context budget and token management explained
Harbor Support’s tier-1 agent concatenated every CRM note, full ticket thread, and raw JSON from five tool calls into each turn. Threads beyond 40 messages hit provider limits; shorter threads still cost $0.42 median per ticket while first-contact resolution stalled at 61%. The model wasn’t dumb — it was drowning in context: duplicate policy snippets, 12 KB stack traces, and CRM diffs the agent had already acted on. Introducing an explicit context budget — fixed token caps per tier, rolling summarization past 8 K tokens of history, and structured tool-result compression — cut median cost per resolved ticket 38% and lifted first-contact resolution to 72%. Context budgeting is the operational discipline of deciding what enters the context window each turn, what gets compressed into external memory, and what is dropped entirely.
This guide covers budget anatomy (system, tools, history, RAG, traces), allocation strategies, summarization triggers, tool-output shaping, prefill vs decode economics, multi-step agent loops, the Harbor Support refactor, a technique decision table, pitfalls, and a production checklist.
What a context budget is
A context budget is a per-turn token ceiling and allocation plan. Unlike a hard provider limit (128 K, 200 K), a budget is a product decision: you reserve capacity for the pieces that most affect the next action and refuse to spend tokens on low-value repetition.
Typical agent prompts are assembled from layers:
- System + policy — instructions, tone, safety rules (often 1–4 K tokens)
- Tool definitions — JSON schemas for every callable function (500–8 K+)
- Retrieved knowledge — RAG chunks, policy docs, user profile snippets
- Conversation history — user/assistant turns and prior tool calls
- Current turn — latest user message and pending tool results
- Scratch / reasoning — chain-of-thought or planner output if enabled
Without explicit caps, the noisiest layer wins: tool traces and retrieved docs expand until history is truncated mid-sentence or the model attends to irrelevant middle content — the same recall degradation documented in long-context research and production RAG systems.
Budget vs memory
Memory is durable storage across turns (summaries, vector stores, user profiles). Budgeting is what you pull from memory into this inference call. A 50 K-token window does not mean you should fill 50 K tokens; many teams target 60–75% utilization to leave headroom for tool replies and model output.
Allocation strategies
Fixed tier percentages
Harbor’s v2 allocator on a 32 K effective budget (after output reserve):
system + tools: max 6 K (pinned, never truncated)
retrieved docs: max 8 K (ranked, deduped)
rolling history: max 12 K (summarized beyond 6 K)
current turn + tools: remainder (hard cap 6 K per tool)
Percentages are tuned per agent class: coding agents bias toward larger tool caps; support agents bias toward retrieval and compact history.
Sliding window with anchor turns
Keep the last N user/assistant turns verbatim plus “anchor” turns: first message (intent), last successful tool outcome, and any turn where the user corrected the agent. Drop middle turns or replace them with a one-line summary. Safer than blind “last 10 messages” when turn 3 contained the account ID.
Summarization triggers
Run a background summarizer when history_tokens > threshold or
every K turns. Store the summary in episodic memory; inject only the summary +
recent verbatim tail into the next call. Triggers beat summarizing every turn
(latency + cost). Use a smaller/cheaper model for summaries; validate that
summaries preserve entities (ticket IDs, dates, amounts) with regex spot checks.
Priority eviction
Assign each chunk a priority score: user message = 100, system = 100 (pinned), failed tool = 80, successful tool = 40, stale RAG = 20. When over budget, evict lowest priority first. Never evict pinned policy or the active user question.
Tool results: where budgets break
A single search_logs or run_sql call can return tens of
thousands of tokens. Production agents need a tool result pipeline
before anything hits the model:
- Structured excerpt — return top-k rows, column subset, error code + message only
- Schema-aware truncation — keep headers, clip cell length, add “+412 more rows”
- Reference by ID — store full payload in object storage; pass handle + summary to the model
- Deterministic compression — diff against previous result when polling the same endpoint
- Relevance filter — embed tool output, keep sentences matching the user query
The same patterns in context compression apply, but tool JSON benefits from domain parsers (SQL result sets, stack traces, HTML) rather than generic LLM summarization alone — summarizers can drop error codes the agent needs to retry correctly.
Multi-tool loops
In ReAct-style loops, each iteration re-sends prior tool I/O. Budgeting must account for accumulated trace length. Options: compress older tool steps into a bullet ledger (“Step 2: found invoice #8842, status overdue”), or reset working context after subgoal completion while persisting state in external memory.
Prefill vs decode economics
Most providers price input (prefill) and output (decode) separately; long agent traces make input tokens the dominant cost. Every turn in a 15-step loop re-prefills the entire history unless you use prompt caching or prefix-stable layouts.
- Pin static system + tool schemas at the prefix for cache hits
- Append-only history helps some hosts reuse KV blocks on shared prefixes
- Shorter budgets directly reduce prefill dollars more than shaving max_tokens
- Route summarization to a cheap model; keep the planner on the capable model
Tie budget metrics to cost dashboards: tokens per resolved task, not tokens per API call.
Harbor Support refactor (worked example)
Baseline: 14 K system+tools, unbounded CRM fetch (median 9 K), full thread history (median 11 K), raw tool JSON (median 7 K per turn). P95 prompt size 41 K; 8% of tickets hit truncation errors.
Changes:
- CRM tool returns structured excerpt: status, owner, last 3 public notes (cap 800 tokens)
- Policy RAG: max 6 chunks, MMR dedupe, citation IDs not full paragraphs
- History: verbatim last 4 turns + rolling summary refreshed every 6 turns
- Hard stop at 28 K input; overflow triggers emergency summarize-and-retry once
- Logged
budget_evictionsper ticket for tuning
Results: Median input tokens 31 K → 19 K; cost per resolution $0.42 → $0.26; first-contact resolution 61% → 72%; truncation errors 8% → 0.3%. Escalations to humans fell because the model stopped losing account IDs buried in turn 6.
Technique decision table
| Scenario | Prefer | Avoid |
|---|---|---|
| Long support / sales threads | Rolling summary + anchor turns | Full verbatim history every turn |
| SQL / log tool agents | Structured excerpt + row cap | LLM summary of raw JSON only |
| Single-shot Q&A | Tight RAG top-k, minimal system | Loading entire knowledge base |
| 15+ step coding agent | Step ledger + external file state | Re-sending all file contents each step |
| Sub-100 ms latency SLA | Small budget, fewer tools exposed | On-the-fly summarization every turn |
| Regulated audit trail | Full logs in DB; budgeted view to model | Dropping history without archival |
Pitfalls
- Summarization amnesia — summaries drop ticket IDs, SKU codes, or negation (“do not refund”). Validate with entity retention tests.
- Over-compression before action — agent never sees the row it needs; cap compression only after confirming the model has enough to decide.
- Tool schema bloat — 40 tools × 400 tokens each consumes budget before any user text. Use dynamic tool routing or grouped tools.
- Duplicate RAG — same policy chunk retrieved every turn; dedupe by document ID and freshness.
- Budget without telemetry — you cannot tune caps you do not measure; log tokens per layer per turn.
- Confusing window size with budget — 200 K context invites slow, expensive, unfocused prompts.
- Evicting corrections — user fixes (“wrong account”) must be anchor-pinned or the agent repeats mistakes.
- Cache-unfriendly ordering — mutating system prompt breaks prefix cache; keep static content stable.
Production checklist
- Define max input tokens per agent class (include output reserve).
- Allocate fixed caps per layer: system, tools, RAG, history, current turn.
- Implement tool-result excerpting before model injection.
- Set summarization trigger on history token count or turn count.
- Pin anchor turns (intent, corrections, last tool success).
- Log tokens per layer, evictions, and truncation events every turn.
- Archive full traces externally for audit even when the model sees summaries.
- Test entity retention after summarization on real ticket samples.
- Expose dynamic tool subsets when tool schemas exceed 20% of budget.
- Align budget metrics with cost-per-resolved-task, not per-call averages.
- Load-test P95 prompt size under worst-case tool payloads.
- Document overflow behavior: fail closed, summarize-and-retry, or escalate.
Key takeaways
- Context budgeting is a product control, not just a provider limit.
- Tool results usually dominate growth; compress them before history.
- Rolling summaries plus anchor turns beat blind sliding windows.
- Harbor Support cut cost 38% and improved resolution 11 points with explicit caps.
- Measure tokens per layer; tune budgets with eviction telemetry.
Related reading
- Agent memory explained — durable tiers that feed the budget each turn
- Tool result compression explained — shaping JSON and logs before injection
- LLM cost optimization explained — routing, caching, and unit economics
- Context compression explained — RAG and transcript shrinking techniques