Guide
LLM agent tool result summarization and truncation explained
Harbor DevOps shipped an on-call agent that could query CloudWatch, pull recent
deploy events, and draft incident summaries. In staging, a single
get_log_events call returned a 180 KB JSON blob — thousands
of log lines with stack traces, health-check noise, and duplicate retries.
The agent stuffed the entire payload into the next turn. Within three tool
steps the run exceeded its
context budget,
dropped earlier reasoning, and started inventing error codes that never
appeared in the logs. On-call engineers reported the agent was “confident
and wrong” on 41% of Sev-2 pages where log volume was high.
Tool result summarization and truncation is the layer between raw API responses and what the model actually reads. It enforces per-tool observation caps, projects JSON to the fields the agent needs, paginates with cursors when more data exists, and optionally runs a cheap summarizer before the main reasoning model sees the payload. This guide covers observation budgets, truncation strategies, structured projection, inline summarization pipelines, pagination patterns, integration with parallel tool execution and tool error envelopes, the Harbor DevOps refactor, a technique decision table, pitfalls, and a production checklist.
Why raw tool outputs are a production hazard
Agents are only as good as the observations they can fit in working memory. Database queries, log searches, file reads, and REST list endpoints routinely return payloads far larger than a single model turn should consume. Dumping them verbatim causes predictable failures:
- Context eviction — system prompt, tool definitions, and prior reasoning get pushed out when a 50k-token JSON array lands in history.
- Needle loss — the signal line sits at row 4,812; the model attends to the head and tail, misses the middle, and guesses.
- Cost and latency spirals — every subsequent turn re-pays tokens for irrelevant fields; parallel calls multiply the damage.
- False confidence — models narrate patterns in truncated garbage as if they read the full file.
Summarization is not optional polish. It is part of the tool contract —
the same way you would not expose an unbounded SQL SELECT * to a
web form.
Observation budgets and per-tool caps
Start with a numeric observation budget per tool invocation, expressed in tokens (or bytes with a conservative chars÷4 estimate). Typical production bands:
- Structured status tools (health checks, single-record lookups) — 500–2,000 tokens.
- List and search tools — 2,000–8,000 tokens with hard row caps.
- Document and log tools — 4,000–12,000 tokens after projection; never unbounded.
Budgets should compose with run-level ceilings from your
context budget manager:
if three parallel tools each return 8k tokens, the merge step must still
fit. Return metadata the model can act on when truncated:
truncated: true, total_rows: 12400,
returned_rows: 40, next_cursor: "...".
Truncation strategies (fast path)
Head and tail preservation
For chronological logs, keep the first N and last M lines — errors often
appear at boundaries (connection open, final exception). Insert an explicit
ellipsis marker: [... 11,240 lines omitted ...]. Never silently
cut mid-JSON; parse first, then truncate at record boundaries.
JSON path projection
Define allowlists per tool: $.items[*].{id, status, updated_at}.
Drop base64 blobs, full HTML bodies, and nested audit trails unless the agent
requested a detail mode. Projection runs server-side in the tool adapter, not
in the model prompt.
Row caps and stable sort
Return the most relevant slice: sort by severity desc, then timestamp desc; cap at 50 rows. Tell the model the sort key so it knows what is missing.
Binary and blob refusal
Replace file bytes with { "sha256": "...", "size_bytes": 8910021,
"mime": "application/pdf", "preview": null } and offer a
extract_text_page(range) follow-up tool instead of inlining.
Summarization pipelines (when truncation is not enough)
When projected rows still exceed budget — long incident threads, multi-file diffs, aggregated metrics — run a summarization hop before the observation enters agent history:
- Extractive pass — regex or parser pulls error codes, HTTP status lines, exception types, and top stack frames. Cheap and faithful.
- Inline LLM summary — a small fast model condenses the projected payload into a bullet brief with cited line numbers or record IDs. Cap output at 800 tokens.
- Structured summary schema — force JSON:
{ "symptoms": [], "likely_services": [], "timeline": [], "open_questions": [] }so the main agent plans against fields, not prose drift.
Store the full raw payload in object storage keyed by observation_id
for audit replay (see
deterministic replay),
but only pass the summary + pointer into the conversation. This mirrors
context compression
but at the tool boundary instead of mid-history.
Pagination, cursors and follow-up tools
Truncation without navigation traps the agent in partial data. Prefer cursor-based pagination over offset paging for live logs:
- First call returns summary +
next_cursorwhenhas_more: true. - Expose
tool_name_continue(cursor)or apage_tokenargument on the same tool. - Let the model request depth explicitly: “fetch next page” beats auto-inlining ten pages.
For parallel fan-out (querying twelve microservices), summarize each shard locally, then merge summaries — not twelve full JSON trees into one observation.
Harbor DevOps refactor walkthrough
Harbor’s incident agent had three log tools with no size guards. The refactor added a shared observation middleware:
- CloudWatch adapter projects to
timestamp, level, message, request_idand caps at 60 lines sorted by severity heuristic. - Overflow triggers an extractive pass (exception types, 5xx counts per service) plus an 600-token inline summary with line citations.
- Raw events land in S3; conversation carries
observation_idonly. - Parallel deploy-status tools return one-row structs; a merge step builds a 12-line timeline instead of concatenating JSON.
Outcome: mean observation tokens per incident run fell from 38,400 to 4,200 (−89%). Sev-2 pages with wrong root cause dropped from 41% to 9%. Median time-to-first-actionable hypothesis improved because the model stopped re-reading noise.
Technique decision table
| Scenario | Prefer | Avoid |
|---|---|---|
| Single JSON record < 2k tokens | Pass through with schema validation | LLM summary (adds latency, drops fields) |
| Search results > 50 rows | Sort + cap + cursor | Head-only truncate (loses recent rows) |
| Chronological logs | Head/tail + extractive errors | Random 8k-token middle slice |
| Multi-shard parallel queries | Per-shard summary then merge | Concatenate raw shards |
| Compliance audit need | Summary in context, raw in vault | Drop raw payload entirely |
| Model keeps asking for “full file” | Page tool with explicit ranges | Raise budget unbounded |
Common pitfalls
- Silent truncation — model does not know data was cut;
always set
truncatedflags. - Summarizing before projection — paying LLM tokens on fields you will discard anyway.
- Lossy summary without citations — ungrounded bullets are worse than smaller raw slices with line IDs.
- One global cap for all tools — a metrics query and a PDF extract need different budgets.
- Re-injecting full history on repair — after tool errors, re-fetch with tighter filters, not bigger blobs.
- Skipping eval on compressed paths — golden tests must cover truncated and paginated branches.
- PII in summaries — scrub before the summarizer model if raw logs contain secrets.
Production checklist
- Every tool declares
max_observation_tokensin its schema. - JSON tools project to documented field allowlists server-side.
- Truncation sets
truncated,total_rows, and optionalnext_cursor. - Raw payloads > cap stored by
observation_idfor replay. - Inline summarizer output capped and schema-validated when structured.
- Parallel merge step summarizes shards before entering history.
- Dashboard: p95 observation tokens per tool and per run.
- Golden tests include truncated, paginated, and empty-result cases.
- Alert when any single observation exceeds 150% of budget (regression).
- Document which fields are lossy vs lossless in tool descriptions.
Key takeaways
- Treat tool output size as part of the API contract — not an afterthought.
- Project JSON before you summarize — drop noise cheaply, then condense signal.
- Always tell the model what was omitted — flags and cursors beat silent cuts.
- Keep raw data for audit, summaries for reasoning — replay without context bloat.
- Harbor DevOps cut observation tokens 89% and wrong root-cause pages from 41% to 9% with middleware, not a bigger model.
Related reading
- Agent context budget and token management explained — run-level ceilings and allocation
- Tool error handling explained — structured observations on failure
- Context compression explained — mid-history pruning patterns
- Parallel tool execution explained — shard summaries before merge