Guide

LLM agent tool result caching systems explained

Harbor Integrations shipped a support triage agent that reads Zendesk tickets, searches CRM accounts, and drafts replies through a draft-critic loop. On paper the design was sound. In production, reflection added three extra model passes per ticket, and each pass re-invoked get_ticket and search_customer with identical arguments. A single user message could trigger eight external API calls for the same ticket body. Rate limits throttled the fleet; p95 run latency climbed; and finance noticed API spend scaling with reflection depth, not ticket volume. Duplicate external reads hit 62% of tool calls before anyone labeled the problem caching, not model quality.

Tool result caching stores normalized observations from read tools so agents do not re-pay latency, quota, and money for deterministic lookups within a freshness window. It is distinct from prompt prefix caching (model prefill reuse) and from summarization (shrinking payloads). Caching answers: “have we already fetched this exact tool input recently, and is the answer still safe to reuse?” This guide covers cache key design, freshness classes, in-run vs cross-run tiers, invalidation on writes, tenant isolation, integration with parallel tool execution, the Harbor Integrations refactor, a technique decision table, pitfalls, and a production checklist tied to cost attribution.

When tool caching helps and when it hurts

Not every tool observation belongs in a cache. Writes, irreversible actions, and security-sensitive reads need stricter rules than idempotent GET-shaped lookups.

  • Cache candidates — idempotent reads: ticket fetch, account lookup, inventory snapshot, geocode, public docs retrieval, deterministic calculator outputs.
  • Never cache by default — payment capture, password reset, permission grant, send-email, any tool whose side effect is the point of the call.
  • Conditional cache — search/list tools where results change frequently; use short TTL or cursor-versioned keys.
  • In-run vs cross-run — reflection loops and multi-step plans benefit from in-run memory even when cross-run Redis would be too risky.

Harbor classified every tool in the registry with a cache_policy enum. That single field prevented engineers from sprinkling ad-hoc memoization inside tool handlers.

Cache key design: exact, normalized, versioned

Bad keys cause stale wrong answers or cross-tenant leaks. A production key should bind:

cache_key = hash(
  tenant_id,
  tool_name,
  tool_manifest_version,
  canonical_json(normalized_args),
  freshness_class
)

Normalization rules matter. Sort JSON object keys; lowercase emails; trim whitespace; reject ambiguous floats; map equivalent enums to canonical values. Two calls that look identical to the model must hash identically.

  • tool_manifest_version — bump when response schema or upstream API behavior changes; avoids serving v1-shaped JSON to v2 parsers.
  • tenant_id — mandatory; never rely on args alone. See tenant isolation.
  • user/session scope — optional fourth dimension when the same tool args return user-specific rows under row-level security.
  • No secrets in keys — hash args after redacting tokens; store ciphertext values outside the key.

Exact-key caching is the default. Semantic similarity caching (embedding near-duplicate queries) is a separate, higher-risk layer — useful for RAG retrieval, dangerous for CRM record IDs.

Freshness classes and TTL policy

One global TTL fails. Harbor uses freshness classes declared in the tool registry:

Class TTL (indicative) Example tools
ephemeral In-run only (process memory) Reflection re-reads, planner scratch lookups
short 30–120 seconds Ticket status, order tracking
medium 5–30 minutes CRM account profile, product catalog slice
long Hours (immutable refs) Geocode, static policy PDF metadata
none Never cache Writes, auth, rate-limit probes

Attach cached_at and freshness_class to every cache hit returned to the model so the agent can reason about staleness (“as of 14:02 UTC”). For compliance workflows, log cache hits in the audit trail with the same metadata.

Cache tiers: in-run memory, shared store, negative cache

Layer caches from cheapest to broadest:

  1. L0 — in-run dict keyed by cache_key; cleared when the run completes. Eliminates reflection duplicate calls with zero cross-tenant risk.
  2. L1 — tenant-scoped Redis for cross-run reuse within TTL; namespaced tenant:{id}:tool_cache:{hash}.
  3. L2 — optional CDN/edge only for truly public static tool backends (rare for agents).

Negative caching stores “not found” responses briefly so a confused agent looping on a bad ID does not hammer the API. Cap negative TTL lower than positive hits; never negative-cache permission errors as 404.

On parallel waves, use per-key inflight coalescing: concurrent identical reads share one upstream request (request de-duplication / singleflight).

Invalidation: writes, webhooks, and explicit bust

Cached reads go stale when the world changes. Three invalidation patterns:

  • Write-through bust — any successful write tool deletes keys matching a declared invalidates_prefix list (ticket:{id}:*).
  • Event-driven bust — Zendesk webhook enqueues cache delete for affected entities; pairs well with webhook ingress.
  • TTL-only — acceptable for low-risk catalog reads when invalidation plumbing is not worth building yet.

Harbor wires write tools with explicit invalidation maps in the tool manifest. The model never chooses invalidation; the runtime applies it after confirmed success, same as idempotency ledger updates.

Security, PII, and what never enters the cache

  • Tenant isolation — separate namespaces; integration tests that prove tenant A cannot read tenant B keys.
  • PII encryption at rest — cache values containing emails or phone numbers encrypted with tenant KMS keys; align with PII pipelines.
  • No OAuth tokens in cached payloads — strip auth headers from stored JSON; tools fetch tokens per call.
  • Cache poisoning defense — only the trusted tool runtime writes cache entries; model cannot inject cache keys.
  • User deletion — GDPR erasure must purge cache rows for that subject, not only vector stores.

Observation envelope returned to the model

Cache hits should be visibly labeled so the agent does not treat stale data as live:

{
  "ok": true,
  "data": { "...": "..." },
  "_cache": {
    "hit": true,
    "tier": "L0_in_run",
    "cached_at": "2026-06-12T14:02:11Z",
    "freshness_class": "short",
    "expires_at": "2026-06-12T14:04:11Z"
  }
}

If a read exceeds freshness budget mid-run, the runtime can force refresh or return hit plus stale_warning for the model to decide. Do not silently refresh on every hit — that defeats caching. Pair large payloads with summarization on first fetch; cache the truncated projection, not the 38k-token raw blob.

Harbor Integrations refactor

Four changes dropped duplicate external reads from 62% to 8% of tool calls (remaining 8% were intentional refreshes after writes or TTL expiry):

  1. cache_policy on every tool in the manifest with freshness class and invalidation map.
  2. L0 in-run cache on the agent runtime so reflection loops reuse ticket and customer payloads.
  3. L1 Redis with tenant namespaces and 90-second TTL for ticket reads; webhook bust on ticket.updated.
  4. Singleflight on parallel duplicate reads in the same wave.

p95 run latency fell 41%; Zendesk 429 rate-limit events dropped sharply. External API spend per resolved ticket tracked in cost attribution dashboards now scales with unique entities touched, not reflection depth.

Technique decision table

Approach Strengths Weaknesses Best for
No caching Always fresh; simplest mental model Duplicate calls; rate limits; high latency Early prototypes, write-heavy agents
In-run memory only (recommended minimum) Safe; kills reflection duplicates No cross-run benefit All production read tools
Cross-run exact-key cache (recommended) Cuts API cost across sessions Needs TTL + invalidation discipline CRM, ticketing, catalog lookups
Prompt prefix caching Cheaper model prefill Does not reduce external API calls Static system prompts, tool schemas
Semantic tool cache Hits near-duplicate queries Wrong-record risk; hard to invalidate Public knowledge search only

Common pitfalls

  • Caching write tool responses — never treat mutation acknowledgments as reusable reads.
  • Keys without tenant_id — the worst bug class; one hit leaks another customer’s row.
  • Ignoring manifest version — schema drift turns hits into parse failures.
  • Unnormalized args{"id": "123"} vs {"id":123} miss the cache twice.
  • Caching pre-redaction PII — encrypt or strip before Redis; audits still need lineage.
  • No invalidation on writes — agents confidently cite stale ticket status after an update tool ran.
  • Caching errors too long — negative cache of 500s blocks recovery after upstream heals.
  • Hidden cache hits — model cannot disclose freshness to the user when _cache metadata is stripped.

Production checklist

  • Declare cache_policy and freshness class per tool in the manifest.
  • Build cache keys from tenant + tool + version + canonical args.
  • Implement L0 in-run cache on every agent worker.
  • Add L1 tenant-scoped store for cross-run reads with TTL per class.
  • Return _cache metadata on hits for model and audit visibility.
  • Wire write tools to invalidation maps; add webhook bust where available.
  • Use singleflight for parallel duplicate reads in the same wave.
  • Encrypt or redact PII in cached values; purge on subject erasure.
  • Metric: cache hit rate, duplicate-call rate, stale-served count.
  • Integration test: cross-tenant key collision must miss, never hit wrong row.

Key takeaways

  • Tool caching cuts duplicate external reads — especially across reflection and planner loops.
  • Exact normalized keys + tenant scope are non-negotiable; semantic cache is a special case.
  • Freshness classes beat one global TTL — ticket status and geocode do not share the same staleness budget.
  • Harbor Integrations cut duplicate API calls from 62% to 8% with L0/L1 tiers and webhook invalidation.
  • Pair caching with summarization — cache the projection the model actually needs, not raw megabyte payloads.

Related reading