Guide
LLM agent tool result caching systems explained
Harbor Integrations shipped a support triage agent that reads Zendesk
tickets, searches CRM accounts, and drafts replies through a
draft-critic loop. On paper the design was sound. In production,
reflection added three extra model passes per ticket, and each pass
re-invoked get_ticket and
search_customer with identical arguments.
A single user message could trigger eight external API calls for the
same ticket body. Rate limits throttled the fleet; p95 run latency
climbed; and finance noticed API spend scaling with reflection depth,
not ticket volume. Duplicate external reads hit
62% of tool calls before anyone labeled the problem
caching, not model quality.
Tool result caching stores normalized observations from read tools so agents do not re-pay latency, quota, and money for deterministic lookups within a freshness window. It is distinct from prompt prefix caching (model prefill reuse) and from summarization (shrinking payloads). Caching answers: “have we already fetched this exact tool input recently, and is the answer still safe to reuse?” This guide covers cache key design, freshness classes, in-run vs cross-run tiers, invalidation on writes, tenant isolation, integration with parallel tool execution, the Harbor Integrations refactor, a technique decision table, pitfalls, and a production checklist tied to cost attribution.
When tool caching helps and when it hurts
Not every tool observation belongs in a cache. Writes, irreversible actions, and security-sensitive reads need stricter rules than idempotent GET-shaped lookups.
- Cache candidates — idempotent reads: ticket fetch, account lookup, inventory snapshot, geocode, public docs retrieval, deterministic calculator outputs.
- Never cache by default — payment capture, password reset, permission grant, send-email, any tool whose side effect is the point of the call.
- Conditional cache — search/list tools where results change frequently; use short TTL or cursor-versioned keys.
- In-run vs cross-run — reflection loops and multi-step plans benefit from in-run memory even when cross-run Redis would be too risky.
Harbor classified every tool in the registry with a
cache_policy enum. That single field prevented
engineers from sprinkling ad-hoc memoization inside tool handlers.
Cache key design: exact, normalized, versioned
Bad keys cause stale wrong answers or cross-tenant leaks. A production key should bind:
cache_key = hash(
tenant_id,
tool_name,
tool_manifest_version,
canonical_json(normalized_args),
freshness_class
)
Normalization rules matter. Sort JSON object keys; lowercase emails; trim whitespace; reject ambiguous floats; map equivalent enums to canonical values. Two calls that look identical to the model must hash identically.
- tool_manifest_version — bump when response schema or upstream API behavior changes; avoids serving v1-shaped JSON to v2 parsers.
- tenant_id — mandatory; never rely on args alone. See tenant isolation.
- user/session scope — optional fourth dimension when the same tool args return user-specific rows under row-level security.
- No secrets in keys — hash args after redacting tokens; store ciphertext values outside the key.
Exact-key caching is the default. Semantic similarity caching (embedding near-duplicate queries) is a separate, higher-risk layer — useful for RAG retrieval, dangerous for CRM record IDs.
Freshness classes and TTL policy
One global TTL fails. Harbor uses freshness classes declared in the tool registry:
| Class | TTL (indicative) | Example tools |
|---|---|---|
| ephemeral | In-run only (process memory) | Reflection re-reads, planner scratch lookups |
| short | 30–120 seconds | Ticket status, order tracking |
| medium | 5–30 minutes | CRM account profile, product catalog slice |
| long | Hours (immutable refs) | Geocode, static policy PDF metadata |
| none | Never cache | Writes, auth, rate-limit probes |
Attach cached_at and freshness_class to
every cache hit returned to the model so the agent can reason about
staleness (“as of 14:02 UTC”). For compliance workflows,
log cache hits in the
audit trail
with the same metadata.
Cache tiers: in-run memory, shared store, negative cache
Layer caches from cheapest to broadest:
- L0 — in-run dict keyed by
cache_key; cleared when the run completes. Eliminates reflection duplicate calls with zero cross-tenant risk. - L1 — tenant-scoped Redis for cross-run
reuse within TTL; namespaced
tenant:{id}:tool_cache:{hash}. - L2 — optional CDN/edge only for truly public static tool backends (rare for agents).
Negative caching stores “not found” responses briefly so a confused agent looping on a bad ID does not hammer the API. Cap negative TTL lower than positive hits; never negative-cache permission errors as 404.
On parallel waves, use per-key inflight coalescing: concurrent identical reads share one upstream request (request de-duplication / singleflight).
Invalidation: writes, webhooks, and explicit bust
Cached reads go stale when the world changes. Three invalidation patterns:
- Write-through bust — any successful write
tool deletes keys matching a declared
invalidates_prefixlist (ticket:{id}:*). - Event-driven bust — Zendesk webhook enqueues cache delete for affected entities; pairs well with webhook ingress.
- TTL-only — acceptable for low-risk catalog reads when invalidation plumbing is not worth building yet.
Harbor wires write tools with explicit invalidation maps in the tool manifest. The model never chooses invalidation; the runtime applies it after confirmed success, same as idempotency ledger updates.
Security, PII, and what never enters the cache
- Tenant isolation — separate namespaces; integration tests that prove tenant A cannot read tenant B keys.
- PII encryption at rest — cache values containing emails or phone numbers encrypted with tenant KMS keys; align with PII pipelines.
- No OAuth tokens in cached payloads — strip auth headers from stored JSON; tools fetch tokens per call.
- Cache poisoning defense — only the trusted tool runtime writes cache entries; model cannot inject cache keys.
- User deletion — GDPR erasure must purge cache rows for that subject, not only vector stores.
Observation envelope returned to the model
Cache hits should be visibly labeled so the agent does not treat stale data as live:
{
"ok": true,
"data": { "...": "..." },
"_cache": {
"hit": true,
"tier": "L0_in_run",
"cached_at": "2026-06-12T14:02:11Z",
"freshness_class": "short",
"expires_at": "2026-06-12T14:04:11Z"
}
}
If a read exceeds freshness budget mid-run, the runtime can force
refresh or return hit plus stale_warning for the model
to decide. Do not silently refresh on every hit — that defeats
caching. Pair large payloads with
summarization
on first fetch; cache the truncated projection, not the 38k-token
raw blob.
Harbor Integrations refactor
Four changes dropped duplicate external reads from 62% to 8% of tool calls (remaining 8% were intentional refreshes after writes or TTL expiry):
- cache_policy on every tool in the manifest with freshness class and invalidation map.
- L0 in-run cache on the agent runtime so reflection loops reuse ticket and customer payloads.
- L1 Redis with tenant namespaces and 90-second TTL for ticket reads; webhook bust on ticket.updated.
- Singleflight on parallel duplicate reads in the same wave.
p95 run latency fell 41%; Zendesk 429 rate-limit events dropped sharply. External API spend per resolved ticket tracked in cost attribution dashboards now scales with unique entities touched, not reflection depth.
Technique decision table
| Approach | Strengths | Weaknesses | Best for |
|---|---|---|---|
| No caching | Always fresh; simplest mental model | Duplicate calls; rate limits; high latency | Early prototypes, write-heavy agents |
| In-run memory only (recommended minimum) | Safe; kills reflection duplicates | No cross-run benefit | All production read tools |
| Cross-run exact-key cache (recommended) | Cuts API cost across sessions | Needs TTL + invalidation discipline | CRM, ticketing, catalog lookups |
| Prompt prefix caching | Cheaper model prefill | Does not reduce external API calls | Static system prompts, tool schemas |
| Semantic tool cache | Hits near-duplicate queries | Wrong-record risk; hard to invalidate | Public knowledge search only |
Common pitfalls
- Caching write tool responses — never treat mutation acknowledgments as reusable reads.
- Keys without tenant_id — the worst bug class; one hit leaks another customer’s row.
- Ignoring manifest version — schema drift turns hits into parse failures.
- Unnormalized args —
{"id": "123"}vs{"id":123}miss the cache twice. - Caching pre-redaction PII — encrypt or strip before Redis; audits still need lineage.
- No invalidation on writes — agents confidently cite stale ticket status after an update tool ran.
- Caching errors too long — negative cache of 500s blocks recovery after upstream heals.
- Hidden cache hits — model cannot disclose
freshness to the user when
_cachemetadata is stripped.
Production checklist
- Declare
cache_policyand freshness class per tool in the manifest. - Build cache keys from tenant + tool + version + canonical args.
- Implement L0 in-run cache on every agent worker.
- Add L1 tenant-scoped store for cross-run reads with TTL per class.
- Return
_cachemetadata on hits for model and audit visibility. - Wire write tools to invalidation maps; add webhook bust where available.
- Use singleflight for parallel duplicate reads in the same wave.
- Encrypt or redact PII in cached values; purge on subject erasure.
- Metric: cache hit rate, duplicate-call rate, stale-served count.
- Integration test: cross-tenant key collision must miss, never hit wrong row.
Key takeaways
- Tool caching cuts duplicate external reads — especially across reflection and planner loops.
- Exact normalized keys + tenant scope are non-negotiable; semantic cache is a special case.
- Freshness classes beat one global TTL — ticket status and geocode do not share the same staleness budget.
- Harbor Integrations cut duplicate API calls from 62% to 8% with L0/L1 tiers and webhook invalidation.
- Pair caching with summarization — cache the projection the model actually needs, not raw megabyte payloads.
Related reading
- LLM agent tool result summarization explained — shrink payloads before they enter context
- LLM agent prompt caching explained — prefix reuse for cheaper prefill
- LLM agent parallel tool execution explained — singleflight and wave scheduling
- LLM agent cost attribution explained — meter API and token spend per run