Guide

LLM agent context budget and token management explained

Harbor Support’s tier-1 agent concatenated every CRM note, full ticket thread, and raw JSON from five tool calls into each turn. Threads beyond 40 messages hit provider limits; shorter threads still cost $0.42 median per ticket while first-contact resolution stalled at 61%. The model wasn’t dumb — it was drowning in context: duplicate policy snippets, 12 KB stack traces, and CRM diffs the agent had already acted on. Introducing an explicit context budget — fixed token caps per tier, rolling summarization past 8 K tokens of history, and structured tool-result compression — cut median cost per resolved ticket 38% and lifted first-contact resolution to 72%. Context budgeting is the operational discipline of deciding what enters the context window each turn, what gets compressed into external memory, and what is dropped entirely.

This guide covers budget anatomy (system, tools, history, RAG, traces), allocation strategies, summarization triggers, tool-output shaping, prefill vs decode economics, multi-step agent loops, the Harbor Support refactor, a technique decision table, pitfalls, and a production checklist.

What a context budget is

A context budget is a per-turn token ceiling and allocation plan. Unlike a hard provider limit (128 K, 200 K), a budget is a product decision: you reserve capacity for the pieces that most affect the next action and refuse to spend tokens on low-value repetition.

Typical agent prompts are assembled from layers:

System + policy — instructions, tone, safety rules (often 1–4 K tokens)
Tool definitions — JSON schemas for every callable function (500–8 K+)
Retrieved knowledge — RAG chunks, policy docs, user profile snippets
Conversation history — user/assistant turns and prior tool calls
Current turn — latest user message and pending tool results
Scratch / reasoning — chain-of-thought or planner output if enabled

Without explicit caps, the noisiest layer wins: tool traces and retrieved docs expand until history is truncated mid-sentence or the model attends to irrelevant middle content — the same recall degradation documented in long-context research and production RAG systems.

Budget vs memory

Memory is durable storage across turns (summaries, vector stores, user profiles). Budgeting is what you pull from memory into this inference call. A 50 K-token window does not mean you should fill 50 K tokens; many teams target 60–75% utilization to leave headroom for tool replies and model output.

Allocation strategies

Fixed tier percentages

Harbor’s v2 allocator on a 32 K effective budget (after output reserve):

system + tools:     max 6 K  (pinned, never truncated)
retrieved docs:     max 8 K  (ranked, deduped)
rolling history:    max 12 K (summarized beyond 6 K)
current turn + tools: remainder (hard cap 6 K per tool)

Percentages are tuned per agent class: coding agents bias toward larger tool caps; support agents bias toward retrieval and compact history.

Sliding window with anchor turns

Keep the last N user/assistant turns verbatim plus “anchor” turns: first message (intent), last successful tool outcome, and any turn where the user corrected the agent. Drop middle turns or replace them with a one-line summary. Safer than blind “last 10 messages” when turn 3 contained the account ID.

Summarization triggers

Run a background summarizer when history_tokens > threshold or every K turns. Store the summary in episodic memory; inject only the summary + recent verbatim tail into the next call. Triggers beat summarizing every turn (latency + cost). Use a smaller/cheaper model for summaries; validate that summaries preserve entities (ticket IDs, dates, amounts) with regex spot checks.

Priority eviction

Assign each chunk a priority score: user message = 100, system = 100 (pinned), failed tool = 80, successful tool = 40, stale RAG = 20. When over budget, evict lowest priority first. Never evict pinned policy or the active user question.

Tool results: where budgets break

A single search_logs or run_sql call can return tens of thousands of tokens. Production agents need a tool result pipeline before anything hits the model:

Structured excerpt — return top-k rows, column subset, error code + message only
Schema-aware truncation — keep headers, clip cell length, add “+412 more rows”
Reference by ID — store full payload in object storage; pass handle + summary to the model
Deterministic compression — diff against previous result when polling the same endpoint
Relevance filter — embed tool output, keep sentences matching the user query

The same patterns in context compression apply, but tool JSON benefits from domain parsers (SQL result sets, stack traces, HTML) rather than generic LLM summarization alone — summarizers can drop error codes the agent needs to retry correctly.

Multi-tool loops

In ReAct-style loops, each iteration re-sends prior tool I/O. Budgeting must account for accumulated trace length. Options: compress older tool steps into a bullet ledger (“Step 2: found invoice #8842, status overdue”), or reset working context after subgoal completion while persisting state in external memory.

Prefill vs decode economics

Most providers price input (prefill) and output (decode) separately; long agent traces make input tokens the dominant cost. Every turn in a 15-step loop re-prefills the entire history unless you use prompt caching or prefix-stable layouts.

Pin static system + tool schemas at the prefix for cache hits
Append-only history helps some hosts reuse KV blocks on shared prefixes
Shorter budgets directly reduce prefill dollars more than shaving max_tokens
Route summarization to a cheap model; keep the planner on the capable model

Tie budget metrics to cost dashboards: tokens per resolved task, not tokens per API call.

Harbor Support refactor (worked example)

Baseline: 14 K system+tools, unbounded CRM fetch (median 9 K), full thread history (median 11 K), raw tool JSON (median 7 K per turn). P95 prompt size 41 K; 8% of tickets hit truncation errors.

Changes:

CRM tool returns structured excerpt: status, owner, last 3 public notes (cap 800 tokens)
Policy RAG: max 6 chunks, MMR dedupe, citation IDs not full paragraphs
History: verbatim last 4 turns + rolling summary refreshed every 6 turns
Hard stop at 28 K input; overflow triggers emergency summarize-and-retry once
Logged budget_evictions per ticket for tuning

Results: Median input tokens 31 K → 19 K; cost per resolution $0.42 → $0.26; first-contact resolution 61% → 72%; truncation errors 8% → 0.3%. Escalations to humans fell because the model stopped losing account IDs buried in turn 6.

Technique decision table

Scenario	Prefer	Avoid
Long support / sales threads	Rolling summary + anchor turns	Full verbatim history every turn
SQL / log tool agents	Structured excerpt + row cap	LLM summary of raw JSON only
Single-shot Q&A	Tight RAG top-k, minimal system	Loading entire knowledge base
15+ step coding agent	Step ledger + external file state	Re-sending all file contents each step
Sub-100 ms latency SLA	Small budget, fewer tools exposed	On-the-fly summarization every turn
Regulated audit trail	Full logs in DB; budgeted view to model	Dropping history without archival

Pitfalls

Summarization amnesia — summaries drop ticket IDs, SKU codes, or negation (“do not refund”). Validate with entity retention tests.
Over-compression before action — agent never sees the row it needs; cap compression only after confirming the model has enough to decide.
Tool schema bloat — 40 tools × 400 tokens each consumes budget before any user text. Use dynamic tool routing or grouped tools.
Duplicate RAG — same policy chunk retrieved every turn; dedupe by document ID and freshness.
Budget without telemetry — you cannot tune caps you do not measure; log tokens per layer per turn.
Confusing window size with budget — 200 K context invites slow, expensive, unfocused prompts.
Evicting corrections — user fixes (“wrong account”) must be anchor-pinned or the agent repeats mistakes.
Cache-unfriendly ordering — mutating system prompt breaks prefix cache; keep static content stable.

Production checklist

Define max input tokens per agent class (include output reserve).
Allocate fixed caps per layer: system, tools, RAG, history, current turn.
Implement tool-result excerpting before model injection.
Set summarization trigger on history token count or turn count.
Pin anchor turns (intent, corrections, last tool success).
Log tokens per layer, evictions, and truncation events every turn.
Archive full traces externally for audit even when the model sees summaries.
Test entity retention after summarization on real ticket samples.
Expose dynamic tool subsets when tool schemas exceed 20% of budget.
Align budget metrics with cost-per-resolved-task, not per-call averages.
Load-test P95 prompt size under worst-case tool payloads.
Document overflow behavior: fail closed, summarize-and-retry, or escalate.

Key takeaways

Context budgeting is a product control, not just a provider limit.
Tool results usually dominate growth; compress them before history.
Rolling summaries plus anchor turns beat blind sliding windows.
Harbor Support cut cost 38% and improved resolution 11 points with explicit caps.
Measure tokens per layer; tune budgets with eviction telemetry.