Guide

LLM tool result compression explained

Harbor Analytics' on-call agent investigated a payment spike by calling list_logs with a one-hour window. The API returned 14,200 rows — roughly 180K tokens of JSON when serialized into the ReAct observation block. The gateway rejected the next turn outright. After engineers removed the cap temporarily, the model fit the prompt but missed the root cause: it skimmed the first hundred rows, never reached the error burst at offset 8,400, and blamed a routine deploy. Accuracy on log-heavy incidents was 41%.

Tool result compression is the layer that shrinks tool outputs before they enter the next model turn. It sits after execution and before history assembly. Unlike document compression (one big file) or conversation pruning (chat turns), it targets ephemeral observations from function calls that can dwarf everything else in an agent loop. This guide covers compression taxonomy, per-tool budgets, the Harbor Analytics refactor, a technique decision table versus raw dumps, pitfalls, and a production checklist.

What tool result compression is

In agent architectures, each tool call produces an observation — the string (usually JSON) returned to the model as context for the next thought or action. APIs for logs, databases, CRM search, and file listings routinely return payloads far larger than a single model turn can use. Compression transforms that payload into a smaller representation that preserves decision-relevant signal.

Compression is not optional once agents chain more than a few tools. A typical production budget might allocate:

  • 2–4K tokens for system policy and persona
  • 4–8K tokens for user history and retrieved docs
  • 1–3K tokens for tool definitions (after dynamic selection)
  • 1–4K tokens for recent observations

One unbounded SELECT * or log dump can consume the entire window. Compression enforces observation budgets: hard per-tool and per-turn caps with documented fallback behavior when exceeded.

Compression taxonomy

1. Structural projection

Keep only fields the model needs for the current task. A search_customers response might project to {id, name, tier, open_ticket_count} and drop billing addresses, audit trails, and marketing tags. Use JSONPath or server-side DTOs; document the projection per tool in your registry alongside schema definitions.

2. Truncation with structure

Naive head truncation loses tail signal (Harbor's missed error burst). Better patterns:

  • Head + tail — first N and last M rows with a ... [{total - N - M} rows omitted] ... marker.
  • Stratified sampling — evenly spaced rows across the full range.
  • Error-first filtering — rows matching severity or status predicates before cap.

3. Pagination and lazy follow-up

Return a page plus next_cursor and teach the model (via tool description) to call get_page only when it needs more. The first observation stays small; depth is opt-in. Pair with parallel calls carefully — four paginated reads can still overflow if each page is huge.

4. Summarization passes

Run a cheap model or rules engine to compress raw output into prose or bullet facts. Techniques from chain-of-density summarization work well for log clusters: iterate until a token target while preserving named entities (service names, error codes, trace IDs). Log the summary and retain a handle to full data for human review.

5. External storage handles

Store the full payload in object storage or a scratch table; return {ref: "obs_7f3a", byte_size: 2400000, preview: [...]}. Expose a fetch_observation_slice tool for targeted follow-up. This pattern scales to megabyte responses without stuffing the context window.

6. Typed aggregates

For analytics tools, return histograms, top-K keys, and anomaly flags instead of raw events. list_logs becomes {error_count_by_service, spike_window, sample_traces}. The model reasons over statistics; humans drill down via dashboard links in metadata.

Where compression sits in the agent loop

A robust pipeline looks like this:

  1. Model emits tool_calls.
  2. Runtime executes against the real API (full fidelity server-side).
  3. Compression middleware applies tool-specific policy.
  4. Compressed observation appends to turn history.
  5. History manager applies turn-level budget (may further prune older observations).
  6. Next model call proceeds.

Compression belongs in middleware, not in prompts asking the model to “summarize this JSON.” Model-side summarization costs an extra turn, burns tokens, and drifts from ground truth. Middleware compression is deterministic, testable, and auditable.

On errors, return structured error observations (code, message, retry hint) rather than stack traces unless the agent is explicitly a debugging tool with a higher budget.

Per-tool budget design

Define budgets in your tool registry, not ad hoc per agent:

  • max_observation_tokens — hard cap after compression.
  • compression_strategy — enum: project, truncate, summarize, aggregate, handle.
  • priority_fields — always included even under severe caps.
  • full_retention_hours — how long external handles stay fetchable.

Read-heavy tools (get_order) can allow 800 tokens; search tools 1,500; log and SQL tools 2,500 with mandatory aggregation. Write confirmations stay under 200 tokens. Budgets should correlate with failure cost: a wrong refund hurts more than a missed log line, so financial tools favor complete structured payloads over aggressive summarization.

Harbor Analytics refactor

Before compression, Harbor's incident agent passed raw API JSON into observations. Median observation size was 22K tokens; p95 exceeded 90K. Multi-tool turns failed 31% of the time at the gateway.

After refactor:

  • Added LogAggregateMiddleware: error counts by service, top stack traces, spike timestamps, five representative lines per cluster.
  • Capped list_logs observations at 1,800 tokens; full logs stored 24h with obs_ref handle.
  • SQL tool returns max 50 rows plus row_count and column stats when truncated.
  • Registered compression strategy per tool; CI tests golden outputs against fixtures.
  • Logged raw_bytes, compressed_tokens, and strategy per call for weekly drift review.

On a 500-incident holdout: root-cause accuracy rose from 41% to 78%, gateway rejections dropped from 31% to 0.4%, median agent latency fell from 6.1s to 2.4s (smaller prefill). False positives from over-summarization (missing rare errors) fell after adding error-first filtering before stratified sampling.

Technique decision table

Approach Best when Skip when
Raw dump (no compression) Payload <500 tokens, single-field reads Search, logs, SQL, file listings
Schema projection Verbose API objects with many nullable fields Model must cite exact verbatim strings
Head + tail / stratified truncate Ordered lists where signal may be anywhere Need exact row counts for compliance proofs
LLM summarization pass Unstructured text blobs, incident narratives Latency-sensitive voice agents, numeric precision
External handle + slice tool MB-scale payloads, multi-step investigations Simple FAQ bots with one-shot lookups
Typed aggregates Telemetry, metrics, high-cardinality logs User expects full export in chat

Common pitfalls

  • Compressing before auth checks — leaking summarized PII from rows the agent should never see; filter by tenant and role first.
  • Head-only truncation — misses tail anomalies; use stratified or error-first sampling.
  • Non-deterministic summarization — same tool call yields different facts each turn; breaks regression tests and user trust.
  • Dropping identifiers — summaries without trace IDs or order IDs prevent follow-up tool calls; always preserve join keys.
  • One global cap — financial reads and log dumps need different policies; budget per tool.
  • Compressing errors into emptiness — “No data” when the API timed out; distinguish empty results from failures.
  • Orphan handles — external refs expire before the agent finishes; align TTL with max session length.
  • Double compression — middleware summarizes, then history manager summarizes again, erasing detail; coordinate layers.

Production checklist

  • Measure observation token share per tool; flag any median >1,500 tokens.
  • Register max_observation_tokens and compression_strategy per tool.
  • Implement compression in middleware, not as an extra model turn.
  • Preserve join keys (IDs, trace IDs) under all compression modes.
  • Use error-first or stratified sampling for ordered lists, not head-only cuts.
  • Store full payloads externally when raw size exceeds 10× the observation cap.
  • Expose a slice or fetch tool for agents that need deeper inspection.
  • Golden-test compressed output against fixtures when APIs change.
  • Log raw size, compressed tokens, and strategy for every tool call.
  • Coordinate with conversation history manager to avoid double summarization.
  • Review summarization drift weekly on incident and support eval sets.
  • Document which fields are never compressed (amounts, statuses, legal text).

Key takeaways

  • Tool result compression shrinks observations before the next model turn — essential once agents call search, SQL, or log APIs.
  • Combine projection, smart truncation, aggregates, and external handles; no single strategy fits every tool.
  • Harbor Analytics raised incident root-cause accuracy from 41% to 78% by aggregating logs instead of dumping 180K-token JSON.
  • Run compression in deterministic middleware, not via an extra LLM summarization turn.
  • Per-tool observation budgets pair with dynamic tool selection and history management to keep agent loops inside context limits.

Related reading