Guide
LLM agent memory explained
A single-turn chatbot only needs the current prompt. An AI agent that runs for hours or returns next week needs memory: what happened in this session, what the user prefers, which files were already processed, and which facts are still true. Models do not persist state between API calls — memory is always something you engineer: conversation history in the context window, summaries written to a database, or embeddings retrieved from a vector store. This guide explains the memory tiers production agents use, when each pattern fits, and how to avoid agents that forget everything or remember things they should not.
Why agents need memory beyond the context window
Even models with million-token context windows cannot treat "dump everything" as a strategy. Long contexts cost money, slow inference, and models still lose focus in the middle of huge transcripts — the "lost in the middle" effect. Agents that call tools generate verbose traces: search results, SQL rows, API JSON, stack traces. Keeping all of that verbatim burns tokens fast.
Memory architecture answers three questions for every agent turn:
- What does the model see right now? — the working set injected into the prompt.
- What gets saved for later? — durable records the agent can retrieve in future sessions.
- How do we find the right saved facts? — retrieval, ranking, and freshness rules.
Good memory design is not about maximizing recall. It is about surfacing the minimum relevant context so the model can act correctly without drowning in noise or stale data.
Memory tiers: a practical taxonomy
Research papers use many labels; production teams converge on a small set of tiers that map cleanly to storage and retrieval mechanics:
Working memory (short-term)
Everything currently in the prompt: system instructions, recent user messages, tool outputs from this run, and any retrieved chunks. Working memory is ephemeral — it disappears when the session ends unless you persist it. Size is bounded by context limits and latency budgets.
Episodic memory (event logs)
Timestamped records of what the agent did: user messages, assistant replies, tool calls and results, errors, human approvals. Episodic logs power audit trails, debugging, and "what did we decide last Tuesday?" queries. Storage is usually append-only (Postgres, S3, event streams).
Semantic memory (facts and knowledge)
Distilled information not tied to a single transcript line: "User prefers metric units," "Project codename is Cedar," "API rate limit is 100 req/min." Semantic memory is often stored as key-value profiles, structured rows, or RAG chunks in a vector index. Retrieval is by similarity, metadata filters, or exact keys.
Procedural memory (how to act)
Not facts about the world but rules for behavior: tool schemas, workflow graphs, few-shot examples, fine-tuned weights. This lives in prompts, config files, or model weights — not typically in a user-specific memory DB.
Managing conversation history in context
The simplest memory is the message list you pass to the chat API. Three patterns dominate production systems:
- Sliding window — keep the last N turns; drop older messages. Cheap, but amnesia is guaranteed.
- Summarization — when history exceeds a token threshold, call the model (or a smaller model) to compress older turns into a rolling summary prepended to the prompt.
- Selective retention — always keep system prompt + user profile + last K turns; summarize or archive the rest to episodic storage.
Summaries save tokens but introduce compression loss. Mitigate by storing the raw episodic log alongside the summary so retrieval can pull verbatim quotes when precision matters (legal, finance, code). Never summarize away tool outputs that contain IDs, URLs, or numbers the agent must reuse verbatim.
Token budgeting should be explicit: reserve fixed slices for system prompt, memory injection, RAG results, and the model's reply. When a retrieval step returns too much, truncate by relevance score — not by arbitrary character cuts that split JSON.
External memory: vector stores and structured stores
When facts must survive across sessions or outgrow context, agents write to external systems and retrieve on demand — the same machinery as RAG, but oriented around agent experience rather than a static document corpus.
Vector memory
Each memory item is embedded and stored with metadata (user ID, session ID, timestamp, type, TTL). On each turn, embed the current query and fetch top-k neighbors. Use metadata filters aggressively: "memories for this user from the last 30 days about billing" beats a global similarity search that returns another user's notes.
Structured memory
User profiles, CRM fields, and task state fit better in SQL or document DBs than in vectors. A hybrid pattern works well: structured store for canonical facts ("plan: enterprise"), vector store for fuzzy recall ("that conversation about the Q3 roadmap"). Sync conflicts arise when both exist — pick one source of truth per field.
Scratchpads and tool-result caches
Agents often need a per-task notepad: intermediate calculations, draft outlines, file paths. Scratchpads can live in Redis with a session TTL — faster than embedding every scratch into long-term memory. Promote to semantic memory only when the agent (or a policy) decides the fact has lasting value.
Write policies: when should the agent remember?
Uncontrolled memory growth is a common failure mode. Agents "remember" noise, contradictions, and injected instructions disguised as user preferences. Define explicit write rules:
- Explicit user requests — "remember that I prefer dark mode" → high confidence write.
- Implicit extraction — model proposes a memory candidate; user confirms or auto-accept after N consistent mentions.
- Tool-gated writes — only a
save_memorytool can persist; the model must justify the entry in structured fields. - TTL and decay — episodic items expire; low-salience semantic facts fade unless reinforced.
Deduplicate before insert: embedding similarity against existing memories prevents fifty nearly identical entries about the same preference. Version memories when facts change ("office moved to Austin" supersedes "office in Denver") rather than appending contradictions the retriever may surface together.
Retrieval strategies at inference time
Memory is useless if the wrong items load into context. A typical retrieval pipeline:
- Query formation — use the latest user message, or a model-generated "memory query" step.
- Candidate fetch — vector search + structured lookups (profile row, open tasks).
- Reranking — cross-encoder or lightweight model scores candidates for relevance.
- Injection — format memories as a dedicated system or user block with citations and timestamps.
Tell the model which memories are authoritative vs advisory. Prefix entries with
[memory: 2026-05-12] so the model can discount stale items. For
multi-user workspaces, hard-filter by tenant ID before similarity search — vector
indexes do not enforce isolation by themselves.
Multi-agent setups split memory per agent role: a researcher agent writes episodic notes; a writer agent reads summaries; a supervisor holds the task graph. Shared memory buses (files, databases) need concurrency control the same as any distributed system.
Privacy, security, and forgetting
Memory is a liability surface. Stored transcripts may contain PII, credentials, health data, or trade secrets. Treat the memory layer with the same rigor as your primary database:
- Encrypt at rest; scope access by user and role.
- Redact secrets before write — regex and secret scanners on tool outputs.
- Support deletion (GDPR erasure) across episodic, vector, and summary stores.
- Audit who triggered each memory write and retrieval in production logs.
"Forget this" must actually delete embeddings and summaries, not just hide them from the UI. Pair memory with guardrails so retrieved content from untrusted documents cannot override system policy.
Common failure modes
- Context stuffing — retrieving too many memories; model ignores instructions. Fix with tighter top-k and reranking.
- Stale fact persistence — old memories contradict current APIs or policies. Fix with TTLs, version fields, and periodic reconciliation jobs.
- Memory poisoning — attacker plants "always approve refunds" in long-term store. Fix with write gates, human review, and retrieval filtering.
- False personalization — retrieving another user's memory due to missing tenant filter. Fix with mandatory metadata filters on every query.
- Summary drift — rolling summaries hallucinate details not in raw logs. Fix by retrieving verbatim episodic snippets for high-stakes turns.
Production checklist
- Define memory tiers (working, episodic, semantic) and what belongs in each.
- Set token budgets per slice of the prompt; measure fill rate in production.
- Implement summarization with raw-log fallback for precision tasks.
- Use write policies with deduplication, TTL, and supersession for changing facts.
- Retrieve with tenant filters, reranking, and timestamp visibility in injected context.
- Log memory reads/writes for audit; support user-initiated deletion.
- Eval retrieval quality on a fixed set of "remembered preference" scenarios.
- Load-test vector index growth; plan re-embedding when models change.
Key takeaways
- LLMs are stateless; memory is always external architecture you design.
- Working context, episodic logs, and semantic stores solve different problems — use all three when agents run long or return later.
- Summarization saves tokens but loses detail; keep raw episodic data for retrieval and audit.
- Write policies matter as much as retrieval — uncontrolled memory poisons personalization.
- Tenant isolation and deletion are non-negotiable for multi-user agent products.
- Start simple (sliding window + profile row); add vector memory when cross-session recall justifies the complexity.
Related reading
- AI agents and tool use explained — planning loops where memory feeds the next tool call
- RAG explained — retrieval pipelines shared with semantic agent memory
- LLM context windows explained — hard limits that force summarization and external stores
- Vector databases explained — indexes and filters behind semantic memory