Guide

LLM conversation history management explained

Harbor Support's refund assistant handled a 16-turn billing dispute cleanly until turn 17, when it asked the customer to re-upload their receipt — even though the agent had already confirmed a $47.50 credit on turn 6 and promised email confirmation on turn 9. The model was not hallucinating; the early turns had been silently truncated when the assembled prompt crossed the gateway's 32K-token cap. System instructions and the latest user message survived; the credit promise did not.

Conversation history management is how you keep multi-turn chats coherent when the raw transcript no longer fits inside a context window. It is distinct from long-term agent memory (cross-session recall in vector stores) and from context compression (shrinking a single large document). This guide covers history taxonomy, token budget allocation, summarization triggers, the Harbor Support refactor, a technique decision table, pitfalls, and a production checklist.

Why raw transcripts fail

Every turn appends user and assistant messages. Tool calls, retrieved documents, and reasoning traces inflate token count faster than users expect. A 20-turn support thread with order JSON and policy snippets can exceed 40K tokens while the model only sees 32K.

Naive fixes create new failures:

  • Drop oldest messages — loses commitments, account IDs, and constraints stated early.
  • Keep only the last N turns — same problem with a predictable cutoff.
  • Summarize everything into one blob — compresses away exact amounts, dates, and ticket numbers the model must quote verbatim.
  • Never trim — hits hard limits, raises cost, and triggers lost-in-the-middle recall gaps even when the window technically fits.

Production systems need a budget-aware assembly pipeline: decide what must survive, what can be compressed, and what can be retrieved on demand.

History management taxonomy

Sliding window (verbatim recent turns)

Keep the last k user-assistant pairs in full text. Cheap, deterministic, easy to debug. Fails when critical facts live outside the window — exactly Harbor Support's bug. Best for short chats, demos, and flows where state lives in structured slots rather than free-form dialogue.

Rolling summarization

When cumulative history exceeds a threshold, summarize turns 1–n into a compact narrative block and retain recent turns verbatim. The summary is re-generated or incrementally updated as the thread grows. Preserves gist but risks losing precise numbers unless you extract them separately.

Facts block (structured state)

A JSON or bullet list of immutable session facts: order ID, refund amount, policy exception flags, user preferences. Updated by extraction after each turn, not by free-form summarization. Pinned near the system prompt so it survives every trim pass. This is the highest-leverage pattern for transactional bots.

Hybrid assembly

Typical production stack:

  1. System prompt + facts block (always included).
  2. Rolling summary of older turns (compressed).
  3. Last 4–8 turns verbatim (recency for tone and follow-ups).
  4. Current retrieval context (RAG chunks for this question only).
  5. Latest user message (never truncated).

External recall (not inline history)

Store full transcripts in a database; retrieve relevant past turns or memories by semantic search when needed. Overlaps with agent memory tiers but scoped to “this conversation” vs “all past sessions with this user.”

Token budget allocation

Treat the context window as a fixed pie. Reserve slices before filling chat history:

Slice Typical share Notes
System + tools schema 5–15% Fixed overhead; count tool definitions
Facts block / session state 3–8% Pinned; never sacrificed to history
RAG / retrieved docs 20–40% Per-turn; rerank before inject
History (summary + recent) 25–45% Dynamic; summarize when over budget
Generation headroom 10–20% Reserved for model output

Measure with your tokenizer (cl100k_base, SentencePiece, etc.) — character counts lie. Run assembly in code, not in prompt instructions: “please remember earlier” does not survive truncation.

Summarization triggers and quality

When to summarize:

  • History slice exceeds its budget (hard trigger).
  • Turn count crosses a threshold (e.g. every 6 turns after turn 10).
  • Topic shift detected (embedding distance between latest user message and running summary).
  • Before handoff to a more expensive model (compress first, then escalate).

How to summarize well:

  • Use a dedicated summarization prompt with explicit fields: goals, decisions, open questions, quoted amounts and IDs.
  • Run extraction in parallel: summary for narrative + structured JSON for facts block.
  • Validate: if facts block omits a regex-matched order ID from the last turn, re-extract.
  • Version summaries; store the pre-summary transcript in your DB for audit and human review.
  • Never summarize tool outputs that contain authoritative numbers — copy them to facts block first.

Harbor Support ticket refactor

Pre-refactor: sliding window of last 12 messages, no facts block, RAG policy chunks competing with history for the same 32K cap. Escalation rate on turn 15+: 23%. Repeat questions (“what was the credit amount?”) spiked after turn 14.

The refactor:

  1. Session facts block — order_id, refund_amount, credit_status, promised_actions[], customer_email; updated by structured extraction after each assistant turn.
  2. Rolling summary — regenerated when history slice > 6K tokens; keeps tone and dispute narrative, not numbers.
  3. Verbatim tail — last 6 turns always full text.
  4. Budget assembler — fills slices in priority order; drops oldest summary paragraphs before touching facts or tail.
  5. Audit log — full transcript in Postgres; model never sees everything, humans can.

Post-refactor: repeat-question rate on long threads fell from 23% to 4% over three weeks. Average tokens per request dropped 18% because policy RAG stopped fighting a bloated raw history. CSAT on 15+ turn tickets rose 0.6 points.

Technique decision table

Approach Best when Risk
Sliding window only Short chats, stateless Q&A, prototypes Silent amnesia on long threads
Rolling summary Open-ended coaching, narrative support Number drift; needs facts block alongside
Facts block + recent tail Transactional bots, refunds, bookings Extraction errors propagate; validate each turn
Full external memory Personal assistants, months-long relationships Retrieval noise; privacy and tenant isolation
Giant context, no management Low-volume internal tools with budget Cost, latency, middle-loss on 100K+ pastes
Prompt caching only Stable long prefixes (system + docs) Does not solve unbounded chat growth

Common pitfalls

  • Truncating without logging. You cannot debug amnesia if you do not know what was dropped.
  • Summarizing after the fact on failure. Schedule summarization before you hit the hard cap, not when the API returns 400.
  • Double-counting system instructions. Re-sending the full policy doc every turn burns budget history could use.
  • Assistant apologies in history. “Sorry, I forgot” turns pollute summaries and reinforce bad behavior.
  • Tool traces in verbatim tail. Raw JSON tool outputs balloon fast; compress to facts block and drop raw payloads.
  • Assuming provider “memory” is yours. Vendor session features may not meet audit, PII, or cross-channel needs.
  • No regression on turn 20+. Test long threads explicitly; turn 3 quality is meaningless.

Production checklist

  • Token budget per slice documented and enforced in code.
  • Full transcript persisted outside the model context.
  • Session facts block with structured extraction after each turn.
  • Rolling summary trigger threshold configured and tested.
  • Verbatim recent-tail count chosen (typically 4–8 turns).
  • Assembly priority: system > facts > RAG > summary > tail > user message.
  • Summarization prompt requires quoted IDs, amounts, and dates.
  • Extraction validation for critical fields (regex or schema check).
  • Metrics: tokens per slice, turn count at first trim, repeat-question rate.
  • Eval suite includes 15+ turn threads with buried commitments.
  • Human-readable debug view of assembled prompt for support escalations.
  • Re-run assembly tests when switching models or context limits.

Key takeaways

  • History management is assembly engineering, not a bigger window.
  • Facts blocks protect numbers; summaries protect narrative.
  • Trim with priority order — never drop the latest user message.
  • Persist full transcripts; the model sees a curated subset.
  • Test turn 20, not turn 2 — that is where production breaks.

Related reading