Guide

LLM agent context overflow recovery mid-run systems explained

Harbor Research shipped a literature-review agent that chained fourteen tool steps across three hours: semantic search over an internal corpus, full-text PDF extraction, citation graph walks, and draft synthesis. Step 13 returned a consolidated bibliography payload of 182,400 tokens from a single ingest tool. The orchestrator appended it verbatim to the working context and called the planner model. The provider rejected the request with context_length_exceeded. The run had no compaction policy, no overflow handler, and no checkpoint after step 12 — users saw a hard failure after 52% of long runs hit the same wall. Support tickets blamed “the model forgot everything” when the real fault was unbounded observation growth mid-loop.

Harbor added tokenizer preflight on every append, a four-rung emergency compaction ladder, and anchor-preserving rollups before any model call. Abandoned runs fell from 52% to 3.1% while median task completion quality held within 2 points on an internal rubric. This guide explains how context overflow happens in production agent loops, detection before and after provider calls, preemptive budgets that prevent most overflows, mid-run recovery without losing commitments, the Harbor Research refactor, a decision table, pitfalls, and a production checklist.

What context overflow means in agent loops

A context overflow occurs when the serialized prompt — system instructions, tool definitions, conversation history, and tool observations — exceeds the model’s hard input limit or your platform’s configured safety cap. Unlike a single chat turn that grows slowly, agents compound context every loop iteration: each tool result becomes a permanent message until something removes or compresses it.

Why mid-run overflow is worse than startup overflow

Sunk work — hours of tool calls, paid API fees, and user waiting time disappear if the only recovery path is restart.
Hidden growth — one “small” JSON blob from a search or log-dump tool can dwarf the entire prior thread.
Provider variance — tokenizer counts differ from character heuristics; a prompt that passed staging on one model fails on the production route.
Cascade failures — retrying the same oversized payload without mutation burns rate limits and duplicates side effects if write tools are not idempotent.

Proactive context budgeting prevents most overflows; this guide covers what happens when budgets slip or a surprise payload arrives anyway.

Detection: catch overflow before the provider does

Production systems should treat context size as a first-class metric updated after every state mutation, not only when the API returns 400.

Preflight tokenizer pass

Run the same tokenizer the provider uses (or a conservative over-estimator) on the fully materialized prompt immediately before each model call. Store estimated_tokens, hard_limit, and headroom_pct on the run record. Trigger soft warnings at 80% and hard compaction at 90% of limit — do not wait for the provider error.

Per-append guards on tool observations

Before merging a tool result into working memory, compute delta_tokens. If the delta alone exceeds a per-tool cap, route through summarization or structured projection first. Never append first and compact later unless compaction is synchronous and blocking.

Provider error taxonomy

context_length_exceeded / max_tokens — input too large; requires compaction or model swap, not blind retry.
rate_limit — unrelated; do not confuse with overflow (retry with backoff).
invalid_request with token field hints — parse provider message for actual vs allowed counts when exposed.

Map each code to a recovery handler in the run FSM. Overflow handlers must never re-submit the identical payload.

Preemptive budgets that stop most overflows

Emergency compaction is expensive and lossy. Design the happy path so it rarely runs.

Fixed tier allocation

Reserve slices of the window up front: system + tools (static), user goal (pinned), rolling history (sliding), observation buffer (per-step cap), and generation reserve (output tokens). Reject new runs at enqueue time if the static slice already exceeds 40% of the model limit.

Observation buffer with spill to store

Tool results larger than the observation cap are written to object storage or a vector index; only a handle, schema summary, and first N rows enter context. The agent can re-fetch via a read_artifact tool with pagination. This pattern pairs with durable checkpointing so spill files survive pod restarts.

Scheduled compaction on step count

Every K planner steps (Harbor uses K=6), run lightweight history compaction even if headroom looks fine. Long runs drift upward from accumulated formatting overhead and tool metadata.

Emergency compaction ladder (mid-run recovery)

When preflight reports >90% utilization or the provider returns overflow, execute a deterministic ladder. Stop at the first rung that restores headroom; log which rung fired for postmortems.

Rung 1: Shed verbose tool traces

Replace full tool stdout with one-line outcome summaries for completed steps. Keep the last two tool results verbatim for local coherence. Never shed steps that contain unresolved user commitments or pending write confirmations.

Rung 2: Structured rollup of middle history

Collapse messages between fixed anchors (original user goal + last three turns) into a JSON rollup: goals, decisions made, open questions, artifact handles. Use a small, fast model or template extractor; validate rollup against anchor invariants before continuing.

Rung 3: Drop retrievable evidence

Move raw citations and log excerpts to external store; inject only IDs and one-sentence claims. The planner must use retrieval tools to expand detail again if needed.

Rung 4: Model swap or split subagent

If still over limit, route the next step to a larger-context model (with cost approval) or spawn a subagent with an isolated budget that returns only a bounded executive summary to the parent. See subagent delegation for parent-child contracts.

After any rung, re-run tokenizer preflight. Cap ladder attempts at two per step to avoid infinite compact-and-fail loops.

Harbor Research refactor (case study)

Harbor’s literature agent previously stored every PDF chunk inline. The refactor introduced three changes:

Ingest cap — PDF parser writes chunks to S3; context receives chunk index + 400-token abstract max.
Preflight gate — planner call blocked until headroom_pct >= 12%; otherwise rung 1–2 ladder executes automatically.
Checkpoint after compaction — WAL event records pre/post token counts and rollup hash so replay does not double-compact.

Abandoned runs dropped from 52% to 3.1%. Remaining failures were almost entirely user-attached files above the tenant upload cap — now rejected at ingress with a clear error instead of mid-run collapse.

Technique decision table

Approach	Strength	Weakness	Best for
Restart run from scratch	Simple implementation	Wastes work; bad UX on long jobs	Dev only
Naive tail truncation	Fast	Drops recent tool results; breaks coherence	Never in production
Per-tool observation caps + spill	Prevents most surprises	Needs artifact store + pagination tools	Default for data-heavy agents
Scheduled anchor compaction	Smooth token curve	CPU cost; summary quality risk	Long multi-step workflows
Emergency ladder on overflow error	Recovers edge cases	Lossy; harder to debug	Safety net behind preflight
Larger model on overflow	Preserves full context	Cost spike; not always available	High-value one-shot tasks

Pitfalls

Character-count estimates — English prose and code tokenize differently; use provider tokenizers.
Compacting after a failed call only — by then side-effecting tools may have already fired on a partial retry.
Summarizing away write confirmations — the agent re-issues duplicate CRM updates or payments.
Unbounded repair loops — compact, fail, compact again with no progress metric.
Ignoring tool definition tokens — shrinking history while leaving a 200-tool manifest consumes half the window.
No user-visible notice — silent compaction erodes trust when answers miss earlier constraints.
Skipping checkpoint before compaction — crash mid-ladder loses the only copy of full observations.

Production checklist

Tokenizer preflight on every model call; soft warn at 80%, compact at 90%.
Per-tool observation caps with spill-to-store for oversized payloads.
Fixed tier budget: static, pinned goal, history, observations, generation reserve.
Scheduled compaction every N steps on long-running agents.
Deterministic emergency ladder (shed traces → rollup → externalize → swap model).
Anchor invariants: never compact away user goals or pending write approvals.
Map provider overflow errors to compaction handlers, not blind retry.
Checkpoint WAL before and after compaction with token counts and rollup hash.
Emit user-visible notice when emergency compaction runs.
Metric: overflow rate, ladder rung distribution, post-compaction task success.
Regression-test ladder on recorded oversized tool payloads in CI.
Pair with dynamic tool routing to keep manifest size under control.

Key takeaways

Context overflow mid-run is an orchestration problem, not a model memory flaw — unbounded tool observations are the usual trigger.
Tokenizer preflight before every call catches overflows cheaper than provider 400 errors after a failed planner step.
Spill large observations to durable storage and pass handles into context instead of raw payloads.
A deterministic compaction ladder with anchor invariants recovers runs without restart-from-scratch.
Harbor Research cut abandoned runs from 52% to 3.1% with ingest caps, preflight gates, and checkpointed compaction.