Guide
LLM conversation history management explained
Harbor Support's refund assistant handled a 16-turn billing dispute cleanly until turn 17, when it asked the customer to re-upload their receipt — even though the agent had already confirmed a $47.50 credit on turn 6 and promised email confirmation on turn 9. The model was not hallucinating; the early turns had been silently truncated when the assembled prompt crossed the gateway's 32K-token cap. System instructions and the latest user message survived; the credit promise did not.
Conversation history management is how you keep multi-turn chats coherent when the raw transcript no longer fits inside a context window. It is distinct from long-term agent memory (cross-session recall in vector stores) and from context compression (shrinking a single large document). This guide covers history taxonomy, token budget allocation, summarization triggers, the Harbor Support refactor, a technique decision table, pitfalls, and a production checklist.
Why raw transcripts fail
Every turn appends user and assistant messages. Tool calls, retrieved documents, and reasoning traces inflate token count faster than users expect. A 20-turn support thread with order JSON and policy snippets can exceed 40K tokens while the model only sees 32K.
Naive fixes create new failures:
- Drop oldest messages — loses commitments, account IDs, and constraints stated early.
- Keep only the last N turns — same problem with a predictable cutoff.
- Summarize everything into one blob — compresses away exact amounts, dates, and ticket numbers the model must quote verbatim.
- Never trim — hits hard limits, raises cost, and triggers lost-in-the-middle recall gaps even when the window technically fits.
Production systems need a budget-aware assembly pipeline: decide what must survive, what can be compressed, and what can be retrieved on demand.
History management taxonomy
Sliding window (verbatim recent turns)
Keep the last k user-assistant pairs in full text. Cheap, deterministic, easy to debug. Fails when critical facts live outside the window — exactly Harbor Support's bug. Best for short chats, demos, and flows where state lives in structured slots rather than free-form dialogue.
Rolling summarization
When cumulative history exceeds a threshold, summarize turns 1–n into a compact narrative block and retain recent turns verbatim. The summary is re-generated or incrementally updated as the thread grows. Preserves gist but risks losing precise numbers unless you extract them separately.
Facts block (structured state)
A JSON or bullet list of immutable session facts: order ID, refund amount, policy exception flags, user preferences. Updated by extraction after each turn, not by free-form summarization. Pinned near the system prompt so it survives every trim pass. This is the highest-leverage pattern for transactional bots.
Hybrid assembly
Typical production stack:
- System prompt + facts block (always included).
- Rolling summary of older turns (compressed).
- Last 4–8 turns verbatim (recency for tone and follow-ups).
- Current retrieval context (RAG chunks for this question only).
- Latest user message (never truncated).
External recall (not inline history)
Store full transcripts in a database; retrieve relevant past turns or memories by semantic search when needed. Overlaps with agent memory tiers but scoped to “this conversation” vs “all past sessions with this user.”
Token budget allocation
Treat the context window as a fixed pie. Reserve slices before filling chat history:
| Slice | Typical share | Notes |
|---|---|---|
| System + tools schema | 5–15% | Fixed overhead; count tool definitions |
| Facts block / session state | 3–8% | Pinned; never sacrificed to history |
| RAG / retrieved docs | 20–40% | Per-turn; rerank before inject |
| History (summary + recent) | 25–45% | Dynamic; summarize when over budget |
| Generation headroom | 10–20% | Reserved for model output |
Measure with your tokenizer (cl100k_base, SentencePiece, etc.) — character counts lie. Run assembly in code, not in prompt instructions: “please remember earlier” does not survive truncation.
Summarization triggers and quality
When to summarize:
- History slice exceeds its budget (hard trigger).
- Turn count crosses a threshold (e.g. every 6 turns after turn 10).
- Topic shift detected (embedding distance between latest user message and running summary).
- Before handoff to a more expensive model (compress first, then escalate).
How to summarize well:
- Use a dedicated summarization prompt with explicit fields: goals, decisions, open questions, quoted amounts and IDs.
- Run extraction in parallel: summary for narrative + structured JSON for facts block.
- Validate: if facts block omits a regex-matched order ID from the last turn, re-extract.
- Version summaries; store the pre-summary transcript in your DB for audit and human review.
- Never summarize tool outputs that contain authoritative numbers — copy them to facts block first.
Harbor Support ticket refactor
Pre-refactor: sliding window of last 12 messages, no facts block, RAG policy chunks competing with history for the same 32K cap. Escalation rate on turn 15+: 23%. Repeat questions (“what was the credit amount?”) spiked after turn 14.
The refactor:
- Session facts block — order_id, refund_amount, credit_status, promised_actions[], customer_email; updated by structured extraction after each assistant turn.
- Rolling summary — regenerated when history slice > 6K tokens; keeps tone and dispute narrative, not numbers.
- Verbatim tail — last 6 turns always full text.
- Budget assembler — fills slices in priority order; drops oldest summary paragraphs before touching facts or tail.
- Audit log — full transcript in Postgres; model never sees everything, humans can.
Post-refactor: repeat-question rate on long threads fell from 23% to 4% over three weeks. Average tokens per request dropped 18% because policy RAG stopped fighting a bloated raw history. CSAT on 15+ turn tickets rose 0.6 points.
Technique decision table
| Approach | Best when | Risk |
|---|---|---|
| Sliding window only | Short chats, stateless Q&A, prototypes | Silent amnesia on long threads |
| Rolling summary | Open-ended coaching, narrative support | Number drift; needs facts block alongside |
| Facts block + recent tail | Transactional bots, refunds, bookings | Extraction errors propagate; validate each turn |
| Full external memory | Personal assistants, months-long relationships | Retrieval noise; privacy and tenant isolation |
| Giant context, no management | Low-volume internal tools with budget | Cost, latency, middle-loss on 100K+ pastes |
| Prompt caching only | Stable long prefixes (system + docs) | Does not solve unbounded chat growth |
Common pitfalls
- Truncating without logging. You cannot debug amnesia if you do not know what was dropped.
- Summarizing after the fact on failure. Schedule summarization before you hit the hard cap, not when the API returns 400.
- Double-counting system instructions. Re-sending the full policy doc every turn burns budget history could use.
- Assistant apologies in history. “Sorry, I forgot” turns pollute summaries and reinforce bad behavior.
- Tool traces in verbatim tail. Raw JSON tool outputs balloon fast; compress to facts block and drop raw payloads.
- Assuming provider “memory” is yours. Vendor session features may not meet audit, PII, or cross-channel needs.
- No regression on turn 20+. Test long threads explicitly; turn 3 quality is meaningless.
Production checklist
- Token budget per slice documented and enforced in code.
- Full transcript persisted outside the model context.
- Session facts block with structured extraction after each turn.
- Rolling summary trigger threshold configured and tested.
- Verbatim recent-tail count chosen (typically 4–8 turns).
- Assembly priority: system > facts > RAG > summary > tail > user message.
- Summarization prompt requires quoted IDs, amounts, and dates.
- Extraction validation for critical fields (regex or schema check).
- Metrics: tokens per slice, turn count at first trim, repeat-question rate.
- Eval suite includes 15+ turn threads with buried commitments.
- Human-readable debug view of assembled prompt for support escalations.
- Re-run assembly tests when switching models or context limits.
Key takeaways
- History management is assembly engineering, not a bigger window.
- Facts blocks protect numbers; summaries protect narrative.
- Trim with priority order — never drop the latest user message.
- Persist full transcripts; the model sees a curated subset.
- Test turn 20, not turn 2 — that is where production breaks.
Related reading
- LLM agent memory explained — cross-session episodic and semantic recall tiers
- LLM context compression explained — shrinking large documents, not chat threads
- Context engineering explained — designing prompts as attention budgets
- LLM lost in the middle explained — why long pasted history still fails recall