Guide

LLM agent conversation history compaction and summarization systems explained

Harbor Legal shipped a contract-review assistant that walked paralegals through redlines, cited clause IDs from a document store, and filed structured notes through function-calling tools. Staging threads rarely exceeded fifteen turns. In production, complex M&A packets ran thirty to fifty turns with large tool payloads. When the assembled prompt crossed the model’s context ceiling, the runtime dropped the oldest messages — including the user’s explicit “never waive Section 4.2(b)” constraint and the assistant’s own commitment to flag indemnity caps above $2M. Wrong answers on follow-up questions hit 31% after turn twenty-five. Traces showed the model wasn’t hallucinating randomly; it was answering from a truncated thread that no longer contained binding facts. After Harbor replaced naive head-chopping with a conversation compaction pipeline — structured rollups, anchor invariants, and tool-result archival — context-loss errors fell to 4.8% while median prefill tokens per turn dropped forty-one percent.

Compaction is not the same as long-term memory and not the same as shrinking a single tool observation. It is the middleware that rewrites the visible chat transcript before each model call so the agent stays within context budget without forgetting what already mattered. This guide covers compaction triggers, structured summary schemas, anchor preservation rules, tool-result tiering, incremental vs full re-summarization, the Harbor Legal refactor, a decision table versus naive truncation and vector-only recall, implementation pitfalls, and a production checklist.

What compaction does in the agent loop

Every turn, the runtime assembles a prompt from: system instructions, optional retrieved memory, the conversation transcript, and the latest user message. As turns accumulate, transcript tokens dominate. Compaction runs between turns (or immediately before a call when a threshold fires) to produce a shorter transcript that is semantically equivalent for decision-making.

Measure — count tokens in transcript + pending tool results against per-model budget headroom.
Classify — tag messages as anchor, compressible, or archivable (see below).
Roll up — replace a contiguous compressible span with a structured summary block.
Archive — move full text to durable storage keyed by message_id for replay and audit.
Validate — run invariant checks that anchors survived; reject summaries that drop required fields.

Compaction output is itself a message (often a synthetic system or assistant role block) prepended to the retained tail. Downstream steps must treat it as authoritative history, not optional flavor text.

Message classification taxonomy

Production pipelines assign each message a compaction class before any summarizer runs:

Class	Examples	Compaction rule
Anchor	User constraints, approved plans, legal holds, confirmed tool commits	Never summarized away; copy verbatim or extract to pinned anchor block
Tail	Last k user/assistant turns (typically 4–8)	Keep raw for conversational coherence and repair loops
Compressible	Exploratory Q&A, superseded drafts, redundant clarifications	Eligible for structured rollup
Tool payload	JSON API responses, search hit lists, document excerpts	Project via tool-result summarization before or during compaction
Ephemeral	Typing indicators, debug spans, duplicate retries	Drop entirely; log to observability only

Harbor tagged any user message containing modal verbs tied to obligations (“must”, “never”, “do not”) as anchor candidates. A lightweight classifier confirmed; false positives were cheaper than dropped constraints.

Structured summary schema

Free-text “summarize the conversation so far” prompts drift. Use a fixed JSON or markdown template the model must populate — same discipline as structured output pipelines:

{
  "rollup_version": 3,
  "covered_turns": [4, 22],
  "user_goals": ["Review indemnity cap in draft v3"],
  "constraints": ["Never waive Section 4.2(b)", "Cap > $2M requires partner review"],
  "decisions_made": ["Reject unlimited liability clause in §7"],
  "open_questions": ["Insurance certificate expiry date"],
  "tool_facts": [{"id": "clause-4.2b", "summary": "Mutual cap at 1x fees"}],
  "superseded": ["Draft v1 discussion — replaced by v3"]
}

Required fields depend on domain. Legal agents need constraints and decisions_made; support agents need ticket_ids and escalation_state. Validate the rollup against schema before swapping it into the transcript. Failed validation triggers a smaller compressible window or a dedicated re-read of archived messages — never silent truncation.

Incremental vs full re-summarization

Incremental: summarize only new compressible turns since the last rollup, then merge into the previous structured object. Cheaper; risk of merge bugs when decisions contradict.
Full: re-summarize the entire compressible span from archives each time. Higher cost; easier to reason about after policy changes.
Harbor used incremental for turns under thirty, full re-roll every tenth compaction or when a user edits an anchor.

Trigger design and budgets

Compaction should be proactive, not emergency-only:

Soft threshold — at 70% of transcript budget, compact oldest compressible span.
Hard threshold — at 85%, compact aggressively (shorter tail, stronger tool projection).
Pre-call guard — if still over budget after compaction, block the call and surface a user-visible “thread too long” with handoff options per session transfer patterns.
Post-tool spike — compact immediately after a tool returns a payload > N tokens, before the next model step.

Charge compaction summarizer tokens against the same run budget as the main agent. Hidden compaction spend is a common FinOps surprise.

Archival, replay, and audit

Compaction destroys raw text in the hot path but must not destroy evidence:

Store pre-compaction messages in durable run state with immutable message_id and content hash.
Attach rollup_hash and archived_span to traces so support can reconstruct what the model saw.
For regulated workflows, export anchor blocks to the audit trail before summarization — do not rely on the model to quote them faithfully.
Replay tests: feed golden long threads through compaction and assert required anchors appear in the assembled prompt.

Vector memory can supplement compaction but should not replace anchors. Semantic search may miss an exact dollar threshold; pinned constraints should not depend on embedding luck.

Harbor Legal refactor

Before: FIFO message drop when token_count > limit. After:

Compaction middleware registered in the hook pipeline after tool-result projection, before prompt cache keying.
Domain schema for rollups with CI fixtures from anonymized production threads.
Anchor extractor on user messages + confirmed tool writes (e.g. file_redline success).
Separate small model for rollups (cheaper, temperature 0) with schema validation and one repair attempt.
Dashboard: compaction events per run, tokens saved, validation failure rate, anchor miss alerts (post-hoc check that constraint strings appear in prompt).

Context-loss errors (answers contradicting earlier binding constraints) went from 31% to 4.8% on threads over twenty-five turns. Remaining failures were true ambiguities in source documents, not dropped history.

Decision table: compaction vs alternatives

Approach	When it fits	Production risk
Structured compaction pipeline (this guide)	Long multi-turn agents, tool-heavy threads, obligation-tracking workflows	Low when anchors are pinned and rollups are schema-validated
Naive oldest-message truncation	Prototypes, stateless Q&A, threads under ten turns	High — silent loss of constraints and tool facts
Vector memory only (no transcript rewrite)	Open-ended research with fuzzy recall needs	Medium — misses exact IDs, numbers, and negation
Unstructured one-paragraph summary	Low-stakes chat without commitments	Medium — summary drift; hard to test regressions
Hard thread length cap (reject new turns)	High-risk domains requiring human handoff	Low safety, poor UX unless paired with transfer flow

Compaction works alongside context-budget tiering: budgets decide how much room exists; compaction decides what fills that room. It complements episodic memory by feeding higher-quality rollups into long-term stores instead of raw noisy spans.

Common pitfalls

Summarizing anchor messages — one paraphrase changes “never” to “avoid”; pin verbatim constraints.
Compacting before tool errors resolve — loses error context needed for repair; keep failed tool exchanges in tail.
No archive of pre-compaction text — impossible to debug user reports of “it forgot what I said.”
Same model, same temperature for rollup and reply — creative summarization invents facts; use temperature 0 and schema gates.
Ignoring cache invalidation — changing the rollup block busts prompt cache prefixes; account for cost in trigger thresholds.
Compaction without regression tests — model upgrades change summary quality; golden threads should run in CI.

Production checklist

Define message classes (anchor, tail, compressible, tool, ephemeral) with explicit rules.
Implement soft and hard token thresholds; compact proactively, not at OOM.
Use a structured rollup schema validated before transcript swap.
Pin anchor constraints outside the compressible span or in a dedicated block.
Project large tool results before compaction; link to truncation policies.
Archive full pre-compaction messages to durable storage with content hashes.
Attach rollup_hash and span metadata to observability traces.
Run golden-thread CI tests asserting anchors survive compaction.
Block or hand off when hard budget exceeded after max compaction — never silent drop.
Meter compaction model tokens separately in run cost attribution.

Key takeaways

Long agent threads need a compaction pipeline, not FIFO truncation — oldest messages often carry binding constraints.
Structured rollups with schema validation beat one-paragraph summaries for testability and anchor preservation.
Classify messages before summarizing: anchors and recent tail stay raw; tool payloads get projected first.
Harbor Legal cut context-loss errors from 31% to 4.8% with pinned constraints, archival replay, and a dedicated rollup model.
Compaction rewrites the hot transcript; vector memory and durable checkpoints hold cold detail for audit and recall.