Guide

LLM agent tool result summarization and truncation explained

Harbor DevOps shipped an on-call agent that could query CloudWatch, pull recent deploy events, and draft incident summaries. In staging, a single get_log_events call returned a 180 KB JSON blob — thousands of log lines with stack traces, health-check noise, and duplicate retries. The agent stuffed the entire payload into the next turn. Within three tool steps the run exceeded its context budget, dropped earlier reasoning, and started inventing error codes that never appeared in the logs. On-call engineers reported the agent was “confident and wrong” on 41% of Sev-2 pages where log volume was high.

Tool result summarization and truncation is the layer between raw API responses and what the model actually reads. It enforces per-tool observation caps, projects JSON to the fields the agent needs, paginates with cursors when more data exists, and optionally runs a cheap summarizer before the main reasoning model sees the payload. This guide covers observation budgets, truncation strategies, structured projection, inline summarization pipelines, pagination patterns, integration with parallel tool execution and tool error envelopes, the Harbor DevOps refactor, a technique decision table, pitfalls, and a production checklist.

Why raw tool outputs are a production hazard

Agents are only as good as the observations they can fit in working memory. Database queries, log searches, file reads, and REST list endpoints routinely return payloads far larger than a single model turn should consume. Dumping them verbatim causes predictable failures:

Context eviction — system prompt, tool definitions, and prior reasoning get pushed out when a 50k-token JSON array lands in history.
Needle loss — the signal line sits at row 4,812; the model attends to the head and tail, misses the middle, and guesses.
Cost and latency spirals — every subsequent turn re-pays tokens for irrelevant fields; parallel calls multiply the damage.
False confidence — models narrate patterns in truncated garbage as if they read the full file.

Summarization is not optional polish. It is part of the tool contract — the same way you would not expose an unbounded SQL SELECT * to a web form.

Observation budgets and per-tool caps

Start with a numeric observation budget per tool invocation, expressed in tokens (or bytes with a conservative chars÷4 estimate). Typical production bands:

Structured status tools (health checks, single-record lookups) — 500–2,000 tokens.
List and search tools — 2,000–8,000 tokens with hard row caps.
Document and log tools — 4,000–12,000 tokens after projection; never unbounded.

Budgets should compose with run-level ceilings from your context budget manager: if three parallel tools each return 8k tokens, the merge step must still fit. Return metadata the model can act on when truncated: truncated: true, total_rows: 12400, returned_rows: 40, next_cursor: "...".

Truncation strategies (fast path)

Head and tail preservation

For chronological logs, keep the first N and last M lines — errors often appear at boundaries (connection open, final exception). Insert an explicit ellipsis marker: [... 11,240 lines omitted ...]. Never silently cut mid-JSON; parse first, then truncate at record boundaries.

JSON path projection

Define allowlists per tool: $.items[*].{id, status, updated_at}. Drop base64 blobs, full HTML bodies, and nested audit trails unless the agent requested a detail mode. Projection runs server-side in the tool adapter, not in the model prompt.

Row caps and stable sort

Return the most relevant slice: sort by severity desc, then timestamp desc; cap at 50 rows. Tell the model the sort key so it knows what is missing.

Binary and blob refusal

Replace file bytes with { "sha256": "...", "size_bytes": 8910021, "mime": "application/pdf", "preview": null } and offer a extract_text_page(range) follow-up tool instead of inlining.

Summarization pipelines (when truncation is not enough)

When projected rows still exceed budget — long incident threads, multi-file diffs, aggregated metrics — run a summarization hop before the observation enters agent history:

Extractive pass — regex or parser pulls error codes, HTTP status lines, exception types, and top stack frames. Cheap and faithful.
Inline LLM summary — a small fast model condenses the projected payload into a bullet brief with cited line numbers or record IDs. Cap output at 800 tokens.
Structured summary schema — force JSON: { "symptoms": [], "likely_services": [], "timeline": [], "open_questions": [] } so the main agent plans against fields, not prose drift.

Store the full raw payload in object storage keyed by observation_id for audit replay (see deterministic replay), but only pass the summary + pointer into the conversation. This mirrors context compression but at the tool boundary instead of mid-history.

Pagination, cursors and follow-up tools

Truncation without navigation traps the agent in partial data. Prefer cursor-based pagination over offset paging for live logs:

First call returns summary + next_cursor when has_more: true.
Expose tool_name_continue(cursor) or a page_token argument on the same tool.
Let the model request depth explicitly: “fetch next page” beats auto-inlining ten pages.

For parallel fan-out (querying twelve microservices), summarize each shard locally, then merge summaries — not twelve full JSON trees into one observation.

Harbor DevOps refactor walkthrough

Harbor’s incident agent had three log tools with no size guards. The refactor added a shared observation middleware:

CloudWatch adapter projects to timestamp, level, message, request_id and caps at 60 lines sorted by severity heuristic.
Overflow triggers an extractive pass (exception types, 5xx counts per service) plus an 600-token inline summary with line citations.
Raw events land in S3; conversation carries observation_id only.
Parallel deploy-status tools return one-row structs; a merge step builds a 12-line timeline instead of concatenating JSON.

Outcome: mean observation tokens per incident run fell from 38,400 to 4,200 (−89%). Sev-2 pages with wrong root cause dropped from 41% to 9%. Median time-to-first-actionable hypothesis improved because the model stopped re-reading noise.

Technique decision table

Scenario	Prefer	Avoid
Single JSON record < 2k tokens	Pass through with schema validation	LLM summary (adds latency, drops fields)
Search results > 50 rows	Sort + cap + cursor	Head-only truncate (loses recent rows)
Chronological logs	Head/tail + extractive errors	Random 8k-token middle slice
Multi-shard parallel queries	Per-shard summary then merge	Concatenate raw shards
Compliance audit need	Summary in context, raw in vault	Drop raw payload entirely
Model keeps asking for “full file”	Page tool with explicit ranges	Raise budget unbounded

Common pitfalls

Silent truncation — model does not know data was cut; always set truncated flags.
Summarizing before projection — paying LLM tokens on fields you will discard anyway.
Lossy summary without citations — ungrounded bullets are worse than smaller raw slices with line IDs.
One global cap for all tools — a metrics query and a PDF extract need different budgets.
Re-injecting full history on repair — after tool errors, re-fetch with tighter filters, not bigger blobs.
Skipping eval on compressed paths — golden tests must cover truncated and paginated branches.
PII in summaries — scrub before the summarizer model if raw logs contain secrets.

Production checklist

Every tool declares max_observation_tokens in its schema.
JSON tools project to documented field allowlists server-side.
Truncation sets truncated, total_rows, and optional next_cursor.
Raw payloads > cap stored by observation_id for replay.
Inline summarizer output capped and schema-validated when structured.
Parallel merge step summarizes shards before entering history.
Dashboard: p95 observation tokens per tool and per run.
Golden tests include truncated, paginated, and empty-result cases.
Alert when any single observation exceeds 150% of budget (regression).
Document which fields are lossy vs lossless in tool descriptions.

Key takeaways

Treat tool output size as part of the API contract — not an afterthought.
Project JSON before you summarize — drop noise cheaply, then condense signal.
Always tell the model what was omitted — flags and cursors beat silent cuts.
Keep raw data for audit, summaries for reasoning — replay without context bloat.
Harbor DevOps cut observation tokens 89% and wrong root-cause pages from 41% to 9% with middleware, not a bigger model.