Guide
LLM agent context overflow recovery mid-run systems explained
Harbor Research shipped a literature-review agent that chained fourteen tool
steps across three hours: semantic search over an internal corpus, full-text
PDF extraction, citation graph walks, and draft synthesis. Step 13 returned a
consolidated bibliography payload of 182,400 tokens from a
single ingest tool. The orchestrator appended it verbatim to the working
context and called the planner model. The provider rejected the request with
context_length_exceeded. The run had no compaction policy, no
overflow handler, and no checkpoint after step 12 — users saw a hard
failure after 52% of long runs hit the same wall. Support
tickets blamed “the model forgot everything” when the real fault
was unbounded observation growth mid-loop.
Harbor added tokenizer preflight on every append, a four-rung emergency compaction ladder, and anchor-preserving rollups before any model call. Abandoned runs fell from 52% to 3.1% while median task completion quality held within 2 points on an internal rubric. This guide explains how context overflow happens in production agent loops, detection before and after provider calls, preemptive budgets that prevent most overflows, mid-run recovery without losing commitments, the Harbor Research refactor, a decision table, pitfalls, and a production checklist.
What context overflow means in agent loops
A context overflow occurs when the serialized prompt — system instructions, tool definitions, conversation history, and tool observations — exceeds the model’s hard input limit or your platform’s configured safety cap. Unlike a single chat turn that grows slowly, agents compound context every loop iteration: each tool result becomes a permanent message until something removes or compresses it.
Why mid-run overflow is worse than startup overflow
- Sunk work — hours of tool calls, paid API fees, and user waiting time disappear if the only recovery path is restart.
- Hidden growth — one “small” JSON blob from a search or log-dump tool can dwarf the entire prior thread.
- Provider variance — tokenizer counts differ from character heuristics; a prompt that passed staging on one model fails on the production route.
- Cascade failures — retrying the same oversized payload without mutation burns rate limits and duplicates side effects if write tools are not idempotent.
Proactive context budgeting prevents most overflows; this guide covers what happens when budgets slip or a surprise payload arrives anyway.
Detection: catch overflow before the provider does
Production systems should treat context size as a first-class metric updated after every state mutation, not only when the API returns 400.
Preflight tokenizer pass
Run the same tokenizer the provider uses (or a conservative over-estimator)
on the fully materialized prompt immediately before each model call. Store
estimated_tokens, hard_limit, and
headroom_pct on the run record. Trigger soft warnings at 80%
and hard compaction at 90% of limit — do not wait for the provider
error.
Per-append guards on tool observations
Before merging a tool result into working memory, compute
delta_tokens. If the delta alone exceeds a per-tool cap,
route through
summarization or structured projection
first. Never append first and compact later unless compaction is
synchronous and blocking.
Provider error taxonomy
- context_length_exceeded / max_tokens — input too large; requires compaction or model swap, not blind retry.
- rate_limit — unrelated; do not confuse with overflow (retry with backoff).
- invalid_request with token field hints — parse provider message for actual vs allowed counts when exposed.
Map each code to a recovery handler in the run FSM. Overflow handlers must never re-submit the identical payload.
Preemptive budgets that stop most overflows
Emergency compaction is expensive and lossy. Design the happy path so it rarely runs.
Fixed tier allocation
Reserve slices of the window up front: system + tools (static), user goal (pinned), rolling history (sliding), observation buffer (per-step cap), and generation reserve (output tokens). Reject new runs at enqueue time if the static slice already exceeds 40% of the model limit.
Observation buffer with spill to store
Tool results larger than the observation cap are written to object storage
or a vector index; only a handle, schema summary, and first N rows enter
context. The agent can re-fetch via a read_artifact tool with
pagination. This pattern pairs with
durable checkpointing
so spill files survive pod restarts.
Scheduled compaction on step count
Every K planner steps (Harbor uses K=6), run lightweight history compaction even if headroom looks fine. Long runs drift upward from accumulated formatting overhead and tool metadata.
Emergency compaction ladder (mid-run recovery)
When preflight reports >90% utilization or the provider returns overflow, execute a deterministic ladder. Stop at the first rung that restores headroom; log which rung fired for postmortems.
Rung 1: Shed verbose tool traces
Replace full tool stdout with one-line outcome summaries for completed steps. Keep the last two tool results verbatim for local coherence. Never shed steps that contain unresolved user commitments or pending write confirmations.
Rung 2: Structured rollup of middle history
Collapse messages between fixed anchors (original user goal + last three turns) into a JSON rollup: goals, decisions made, open questions, artifact handles. Use a small, fast model or template extractor; validate rollup against anchor invariants before continuing.
Rung 3: Drop retrievable evidence
Move raw citations and log excerpts to external store; inject only IDs and one-sentence claims. The planner must use retrieval tools to expand detail again if needed.
Rung 4: Model swap or split subagent
If still over limit, route the next step to a larger-context model (with cost approval) or spawn a subagent with an isolated budget that returns only a bounded executive summary to the parent. See subagent delegation for parent-child contracts.
After any rung, re-run tokenizer preflight. Cap ladder attempts at two per step to avoid infinite compact-and-fail loops.
Harbor Research refactor (case study)
Harbor’s literature agent previously stored every PDF chunk inline. The refactor introduced three changes:
- Ingest cap — PDF parser writes chunks to S3; context receives chunk index + 400-token abstract max.
- Preflight gate — planner call blocked until
headroom_pct >= 12%; otherwise rung 1–2 ladder executes automatically. - Checkpoint after compaction — WAL event records pre/post token counts and rollup hash so replay does not double-compact.
Abandoned runs dropped from 52% to 3.1%. Remaining failures were almost entirely user-attached files above the tenant upload cap — now rejected at ingress with a clear error instead of mid-run collapse.
Technique decision table
| Approach | Strength | Weakness | Best for |
|---|---|---|---|
| Restart run from scratch | Simple implementation | Wastes work; bad UX on long jobs | Dev only |
| Naive tail truncation | Fast | Drops recent tool results; breaks coherence | Never in production |
| Per-tool observation caps + spill | Prevents most surprises | Needs artifact store + pagination tools | Default for data-heavy agents |
| Scheduled anchor compaction | Smooth token curve | CPU cost; summary quality risk | Long multi-step workflows |
| Emergency ladder on overflow error | Recovers edge cases | Lossy; harder to debug | Safety net behind preflight |
| Larger model on overflow | Preserves full context | Cost spike; not always available | High-value one-shot tasks |
Pitfalls
- Character-count estimates — English prose and code tokenize differently; use provider tokenizers.
- Compacting after a failed call only — by then side-effecting tools may have already fired on a partial retry.
- Summarizing away write confirmations — the agent re-issues duplicate CRM updates or payments.
- Unbounded repair loops — compact, fail, compact again with no progress metric.
- Ignoring tool definition tokens — shrinking history while leaving a 200-tool manifest consumes half the window.
- No user-visible notice — silent compaction erodes trust when answers miss earlier constraints.
- Skipping checkpoint before compaction — crash mid-ladder loses the only copy of full observations.
Production checklist
- Tokenizer preflight on every model call; soft warn at 80%, compact at 90%.
- Per-tool observation caps with spill-to-store for oversized payloads.
- Fixed tier budget: static, pinned goal, history, observations, generation reserve.
- Scheduled compaction every N steps on long-running agents.
- Deterministic emergency ladder (shed traces → rollup → externalize → swap model).
- Anchor invariants: never compact away user goals or pending write approvals.
- Map provider overflow errors to compaction handlers, not blind retry.
- Checkpoint WAL before and after compaction with token counts and rollup hash.
- Emit user-visible notice when emergency compaction runs.
- Metric: overflow rate, ladder rung distribution, post-compaction task success.
- Regression-test ladder on recorded oversized tool payloads in CI.
- Pair with dynamic tool routing to keep manifest size under control.
Key takeaways
- Context overflow mid-run is an orchestration problem, not a model memory flaw — unbounded tool observations are the usual trigger.
- Tokenizer preflight before every call catches overflows cheaper than provider 400 errors after a failed planner step.
- Spill large observations to durable storage and pass handles into context instead of raw payloads.
- A deterministic compaction ladder with anchor invariants recovers runs without restart-from-scratch.
- Harbor Research cut abandoned runs from 52% to 3.1% with ingest caps, preflight gates, and checkpointed compaction.
Related reading
- Context budget and token management explained — proactive allocation before overflow happens
- Conversation history compaction and summarization explained — anchor-preserving rollups for long threads
- Tool result summarization and truncation explained — shrink observations at ingest time
- ReAct agent loop explained — where context grows each think-act-observe cycle