Guide
LLM agent conversation history compaction and summarization systems explained
Harbor Legal shipped a contract-review assistant that walked paralegals through redlines, cited clause IDs from a document store, and filed structured notes through function-calling tools. Staging threads rarely exceeded fifteen turns. In production, complex M&A packets ran thirty to fifty turns with large tool payloads. When the assembled prompt crossed the model’s context ceiling, the runtime dropped the oldest messages — including the user’s explicit “never waive Section 4.2(b)” constraint and the assistant’s own commitment to flag indemnity caps above $2M. Wrong answers on follow-up questions hit 31% after turn twenty-five. Traces showed the model wasn’t hallucinating randomly; it was answering from a truncated thread that no longer contained binding facts. After Harbor replaced naive head-chopping with a conversation compaction pipeline — structured rollups, anchor invariants, and tool-result archival — context-loss errors fell to 4.8% while median prefill tokens per turn dropped forty-one percent.
Compaction is not the same as long-term memory and not the same as shrinking a single tool observation. It is the middleware that rewrites the visible chat transcript before each model call so the agent stays within context budget without forgetting what already mattered. This guide covers compaction triggers, structured summary schemas, anchor preservation rules, tool-result tiering, incremental vs full re-summarization, the Harbor Legal refactor, a decision table versus naive truncation and vector-only recall, implementation pitfalls, and a production checklist.
What compaction does in the agent loop
Every turn, the runtime assembles a prompt from: system instructions, optional retrieved memory, the conversation transcript, and the latest user message. As turns accumulate, transcript tokens dominate. Compaction runs between turns (or immediately before a call when a threshold fires) to produce a shorter transcript that is semantically equivalent for decision-making.
- Measure — count tokens in transcript + pending tool results against per-model budget headroom.
- Classify — tag messages as anchor, compressible, or archivable (see below).
- Roll up — replace a contiguous compressible span with a structured summary block.
- Archive — move full text to durable storage keyed
by
message_idfor replay and audit. - Validate — run invariant checks that anchors survived; reject summaries that drop required fields.
Compaction output is itself a message (often a synthetic system or
assistant role block) prepended to the retained tail. Downstream
steps must treat it as authoritative history, not optional flavor text.
Message classification taxonomy
Production pipelines assign each message a compaction class before any summarizer runs:
| Class | Examples | Compaction rule |
|---|---|---|
| Anchor | User constraints, approved plans, legal holds, confirmed tool commits | Never summarized away; copy verbatim or extract to pinned anchor block |
| Tail | Last k user/assistant turns (typically 4–8) | Keep raw for conversational coherence and repair loops |
| Compressible | Exploratory Q&A, superseded drafts, redundant clarifications | Eligible for structured rollup |
| Tool payload | JSON API responses, search hit lists, document excerpts | Project via tool-result summarization before or during compaction |
| Ephemeral | Typing indicators, debug spans, duplicate retries | Drop entirely; log to observability only |
Harbor tagged any user message containing modal verbs tied to obligations (“must”, “never”, “do not”) as anchor candidates. A lightweight classifier confirmed; false positives were cheaper than dropped constraints.
Structured summary schema
Free-text “summarize the conversation so far” prompts drift. Use a fixed JSON or markdown template the model must populate — same discipline as structured output pipelines:
{
"rollup_version": 3,
"covered_turns": [4, 22],
"user_goals": ["Review indemnity cap in draft v3"],
"constraints": ["Never waive Section 4.2(b)", "Cap > $2M requires partner review"],
"decisions_made": ["Reject unlimited liability clause in §7"],
"open_questions": ["Insurance certificate expiry date"],
"tool_facts": [{"id": "clause-4.2b", "summary": "Mutual cap at 1x fees"}],
"superseded": ["Draft v1 discussion — replaced by v3"]
}
Required fields depend on domain. Legal agents need constraints
and decisions_made; support agents need ticket_ids
and escalation_state. Validate the rollup against schema before
swapping it into the transcript. Failed validation triggers a smaller compressible
window or a dedicated re-read of archived messages — never silent
truncation.
Incremental vs full re-summarization
Incremental: summarize only new compressible turns since the
last rollup, then merge into the previous structured object. Cheaper; risk of
merge bugs when decisions contradict.
Full: re-summarize the entire compressible span from archives
each time. Higher cost; easier to reason about after policy changes.
Harbor used incremental for turns under thirty, full re-roll every tenth
compaction or when a user edits an anchor.
Trigger design and budgets
Compaction should be proactive, not emergency-only:
- Soft threshold — at 70% of transcript budget, compact oldest compressible span.
- Hard threshold — at 85%, compact aggressively (shorter tail, stronger tool projection).
- Pre-call guard — if still over budget after compaction, block the call and surface a user-visible “thread too long” with handoff options per session transfer patterns.
- Post-tool spike — compact immediately after a tool returns a payload > N tokens, before the next model step.
Charge compaction summarizer tokens against the same run budget as the main agent. Hidden compaction spend is a common FinOps surprise.
Archival, replay, and audit
Compaction destroys raw text in the hot path but must not destroy evidence:
- Store pre-compaction messages in
durable run state
with immutable
message_idand content hash. - Attach
rollup_hashandarchived_spanto traces so support can reconstruct what the model saw. - For regulated workflows, export anchor blocks to the audit trail before summarization — do not rely on the model to quote them faithfully.
- Replay tests: feed golden long threads through compaction and assert required anchors appear in the assembled prompt.
Vector memory can supplement compaction but should not replace anchors. Semantic search may miss an exact dollar threshold; pinned constraints should not depend on embedding luck.
Harbor Legal refactor
Before: FIFO message drop when token_count > limit. After:
- Compaction middleware registered in the hook pipeline after tool-result projection, before prompt cache keying.
- Domain schema for rollups with CI fixtures from anonymized production threads.
- Anchor extractor on user messages + confirmed tool writes (e.g.
file_redlinesuccess). - Separate small model for rollups (cheaper, temperature 0) with schema validation and one repair attempt.
- Dashboard: compaction events per run, tokens saved, validation failure rate, anchor miss alerts (post-hoc check that constraint strings appear in prompt).
Context-loss errors (answers contradicting earlier binding constraints) went from 31% to 4.8% on threads over twenty-five turns. Remaining failures were true ambiguities in source documents, not dropped history.
Decision table: compaction vs alternatives
| Approach | When it fits | Production risk |
|---|---|---|
| Structured compaction pipeline (this guide) | Long multi-turn agents, tool-heavy threads, obligation-tracking workflows | Low when anchors are pinned and rollups are schema-validated |
| Naive oldest-message truncation | Prototypes, stateless Q&A, threads under ten turns | High — silent loss of constraints and tool facts |
| Vector memory only (no transcript rewrite) | Open-ended research with fuzzy recall needs | Medium — misses exact IDs, numbers, and negation |
| Unstructured one-paragraph summary | Low-stakes chat without commitments | Medium — summary drift; hard to test regressions |
| Hard thread length cap (reject new turns) | High-risk domains requiring human handoff | Low safety, poor UX unless paired with transfer flow |
Compaction works alongside context-budget tiering: budgets decide how much room exists; compaction decides what fills that room. It complements episodic memory by feeding higher-quality rollups into long-term stores instead of raw noisy spans.
Common pitfalls
- Summarizing anchor messages — one paraphrase changes “never” to “avoid”; pin verbatim constraints.
- Compacting before tool errors resolve — loses error context needed for repair; keep failed tool exchanges in tail.
- No archive of pre-compaction text — impossible to debug user reports of “it forgot what I said.”
- Same model, same temperature for rollup and reply — creative summarization invents facts; use temperature 0 and schema gates.
- Ignoring cache invalidation — changing the rollup block busts prompt cache prefixes; account for cost in trigger thresholds.
- Compaction without regression tests — model upgrades change summary quality; golden threads should run in CI.
Production checklist
- Define message classes (anchor, tail, compressible, tool, ephemeral) with explicit rules.
- Implement soft and hard token thresholds; compact proactively, not at OOM.
- Use a structured rollup schema validated before transcript swap.
- Pin anchor constraints outside the compressible span or in a dedicated block.
- Project large tool results before compaction; link to truncation policies.
- Archive full pre-compaction messages to durable storage with content hashes.
- Attach
rollup_hashand span metadata to observability traces. - Run golden-thread CI tests asserting anchors survive compaction.
- Block or hand off when hard budget exceeded after max compaction — never silent drop.
- Meter compaction model tokens separately in run cost attribution.
Key takeaways
- Long agent threads need a compaction pipeline, not FIFO truncation — oldest messages often carry binding constraints.
- Structured rollups with schema validation beat one-paragraph summaries for testability and anchor preservation.
- Classify messages before summarizing: anchors and recent tail stay raw; tool payloads get projected first.
- Harbor Legal cut context-loss errors from 31% to 4.8% with pinned constraints, archival replay, and a dedicated rollup model.
- Compaction rewrites the hot transcript; vector memory and durable checkpoints hold cold detail for audit and recall.
Related reading
- LLM agent context budget and token management explained — per-turn allocation and budget tiers
- LLM agent memory explained — episodic logs and vector recall beyond the hot transcript
- LLM agent tool result summarization and truncation explained — shrinking observations before rollups
- LLM agent durable state and checkpointing explained — archival and crash-safe run records