Guide
LLM tool result compression explained
Harbor Analytics' on-call agent investigated a payment spike by calling
list_logs with a one-hour window. The API returned 14,200 rows — roughly
180K tokens of JSON when serialized into the
ReAct
observation block. The gateway rejected the next turn outright. After engineers removed the
cap temporarily, the model fit the prompt but missed the root cause: it skimmed the first
hundred rows, never reached the error burst at offset 8,400, and blamed a routine deploy.
Accuracy on log-heavy incidents was 41%.
Tool result compression is the layer that shrinks tool outputs before they enter the next model turn. It sits after execution and before history assembly. Unlike document compression (one big file) or conversation pruning (chat turns), it targets ephemeral observations from function calls that can dwarf everything else in an agent loop. This guide covers compression taxonomy, per-tool budgets, the Harbor Analytics refactor, a technique decision table versus raw dumps, pitfalls, and a production checklist.
What tool result compression is
In agent architectures, each tool call produces an observation — the string (usually JSON) returned to the model as context for the next thought or action. APIs for logs, databases, CRM search, and file listings routinely return payloads far larger than a single model turn can use. Compression transforms that payload into a smaller representation that preserves decision-relevant signal.
Compression is not optional once agents chain more than a few tools. A typical production budget might allocate:
- 2–4K tokens for system policy and persona
- 4–8K tokens for user history and retrieved docs
- 1–3K tokens for tool definitions (after dynamic selection)
- 1–4K tokens for recent observations
One unbounded SELECT * or log dump can consume the entire window. Compression
enforces observation budgets: hard per-tool and per-turn caps with
documented fallback behavior when exceeded.
Compression taxonomy
1. Structural projection
Keep only fields the model needs for the current task. A
search_customers response might project to
{id, name, tier, open_ticket_count} and drop billing addresses, audit trails,
and marketing tags. Use JSONPath or server-side DTOs; document the projection per tool in
your registry alongside
schema definitions.
2. Truncation with structure
Naive head truncation loses tail signal (Harbor's missed error burst). Better patterns:
- Head + tail — first N and last M rows with a
... [{total - N - M} rows omitted] ...marker. - Stratified sampling — evenly spaced rows across the full range.
- Error-first filtering — rows matching severity or status predicates before cap.
3. Pagination and lazy follow-up
Return a page plus next_cursor and teach the model (via tool description) to
call get_page only when it needs more. The first observation stays small; depth
is opt-in. Pair with
parallel calls
carefully — four paginated reads can still overflow if each page is huge.
4. Summarization passes
Run a cheap model or rules engine to compress raw output into prose or bullet facts. Techniques from chain-of-density summarization work well for log clusters: iterate until a token target while preserving named entities (service names, error codes, trace IDs). Log the summary and retain a handle to full data for human review.
5. External storage handles
Store the full payload in object storage or a scratch table; return
{ref: "obs_7f3a", byte_size: 2400000, preview: [...]}. Expose a
fetch_observation_slice tool for targeted follow-up. This pattern scales to
megabyte responses without stuffing the context window.
6. Typed aggregates
For analytics tools, return histograms, top-K keys, and anomaly flags instead of raw events.
list_logs becomes {error_count_by_service, spike_window, sample_traces}.
The model reasons over statistics; humans drill down via dashboard links in metadata.
Where compression sits in the agent loop
A robust pipeline looks like this:
- Model emits
tool_calls. - Runtime executes against the real API (full fidelity server-side).
- Compression middleware applies tool-specific policy.
- Compressed observation appends to turn history.
- History manager applies turn-level budget (may further prune older observations).
- Next model call proceeds.
Compression belongs in middleware, not in prompts asking the model to “summarize this JSON.” Model-side summarization costs an extra turn, burns tokens, and drifts from ground truth. Middleware compression is deterministic, testable, and auditable.
On errors, return structured error observations (code, message, retry hint) rather than stack traces unless the agent is explicitly a debugging tool with a higher budget.
Per-tool budget design
Define budgets in your tool registry, not ad hoc per agent:
- max_observation_tokens — hard cap after compression.
- compression_strategy — enum: project, truncate, summarize, aggregate, handle.
- priority_fields — always included even under severe caps.
- full_retention_hours — how long external handles stay fetchable.
Read-heavy tools (get_order) can allow 800 tokens; search tools 1,500; log and
SQL tools 2,500 with mandatory aggregation. Write confirmations stay under 200 tokens.
Budgets should correlate with failure cost: a wrong refund hurts more than a missed log line,
so financial tools favor complete structured payloads over aggressive summarization.
Harbor Analytics refactor
Before compression, Harbor's incident agent passed raw API JSON into observations. Median observation size was 22K tokens; p95 exceeded 90K. Multi-tool turns failed 31% of the time at the gateway.
After refactor:
- Added
LogAggregateMiddleware: error counts by service, top stack traces, spike timestamps, five representative lines per cluster. - Capped
list_logsobservations at 1,800 tokens; full logs stored 24h withobs_refhandle. - SQL tool returns max 50 rows plus
row_countand column stats when truncated. - Registered compression strategy per tool; CI tests golden outputs against fixtures.
- Logged
raw_bytes,compressed_tokens, andstrategyper call for weekly drift review.
On a 500-incident holdout: root-cause accuracy rose from 41% to 78%, gateway rejections dropped from 31% to 0.4%, median agent latency fell from 6.1s to 2.4s (smaller prefill). False positives from over-summarization (missing rare errors) fell after adding error-first filtering before stratified sampling.
Technique decision table
| Approach | Best when | Skip when |
|---|---|---|
| Raw dump (no compression) | Payload <500 tokens, single-field reads | Search, logs, SQL, file listings |
| Schema projection | Verbose API objects with many nullable fields | Model must cite exact verbatim strings |
| Head + tail / stratified truncate | Ordered lists where signal may be anywhere | Need exact row counts for compliance proofs |
| LLM summarization pass | Unstructured text blobs, incident narratives | Latency-sensitive voice agents, numeric precision |
| External handle + slice tool | MB-scale payloads, multi-step investigations | Simple FAQ bots with one-shot lookups |
| Typed aggregates | Telemetry, metrics, high-cardinality logs | User expects full export in chat |
Common pitfalls
- Compressing before auth checks — leaking summarized PII from rows the agent should never see; filter by tenant and role first.
- Head-only truncation — misses tail anomalies; use stratified or error-first sampling.
- Non-deterministic summarization — same tool call yields different facts each turn; breaks regression tests and user trust.
- Dropping identifiers — summaries without trace IDs or order IDs prevent follow-up tool calls; always preserve join keys.
- One global cap — financial reads and log dumps need different policies; budget per tool.
- Compressing errors into emptiness — “No data” when the API timed out; distinguish empty results from failures.
- Orphan handles — external refs expire before the agent finishes; align TTL with max session length.
- Double compression — middleware summarizes, then history manager summarizes again, erasing detail; coordinate layers.
Production checklist
- Measure observation token share per tool; flag any median >1,500 tokens.
- Register
max_observation_tokensandcompression_strategyper tool. - Implement compression in middleware, not as an extra model turn.
- Preserve join keys (IDs, trace IDs) under all compression modes.
- Use error-first or stratified sampling for ordered lists, not head-only cuts.
- Store full payloads externally when raw size exceeds 10× the observation cap.
- Expose a slice or fetch tool for agents that need deeper inspection.
- Golden-test compressed output against fixtures when APIs change.
- Log raw size, compressed tokens, and strategy for every tool call.
- Coordinate with conversation history manager to avoid double summarization.
- Review summarization drift weekly on incident and support eval sets.
- Document which fields are never compressed (amounts, statuses, legal text).
Key takeaways
- Tool result compression shrinks observations before the next model turn — essential once agents call search, SQL, or log APIs.
- Combine projection, smart truncation, aggregates, and external handles; no single strategy fits every tool.
- Harbor Analytics raised incident root-cause accuracy from 41% to 78% by aggregating logs instead of dumping 180K-token JSON.
- Run compression in deterministic middleware, not via an extra LLM summarization turn.
- Per-tool observation budgets pair with dynamic tool selection and history management to keep agent loops inside context limits.
Related reading
- LLM ReAct agent loop explained — where observations feed the next thought-action cycle
- LLM context compression explained — shrinking large documents and long inputs
- LLM dynamic tool selection explained — pruning which tools enter the prompt each turn
- LLM conversation history management explained — turn-level budgets across the full thread