Guide

LLM agent durable state, checkpointing and run persistence explained

Harbor Research shipped a literature-review agent that crawled PubMed, summarized papers, and drafted synthesis sections over four to six hours. Every step lived in process memory. When Kubernetes rolled the deployment during a node drain, or a worker OOM-killed mid-batch, the run simply vanished. Users saw a spinner, then nothing. Analytics showed 41% of runs longer than 90 minutes ended without a terminal status — not because the model failed, but because the runtime had no durable record of progress. Researchers re-submitted the same queries, burning $18k/month in duplicate embedding and summarization spend before platform engineering treated persistence as a first-class concern.

Durable state and checkpointing at the agent layer means every meaningful transition — planner decision, tool invocation, tool result, budget decrement, human approval — is recorded to storage before the next step proceeds, so a replacement worker can resume from the last consistent point. This guide covers checkpoint schemas, write-ahead event logs, resume tokens, coordination with retry and idempotency and lifecycle cancellation, the Harbor Research refactor, a technique decision table, pitfalls, and a production checklist.

What must survive a crash: the persistence boundary

Not every byte of an agent session belongs on disk. Production systems draw a persistence boundary around state that is expensive to recompute or unsafe to repeat:

  • Conversation transcript — user messages, assistant turns, and tool observations needed to continue planning.
  • Tool side effects ledger — which write tools committed, with idempotency keys and external IDs (see saga rollbacks when undo is required).
  • Run control metadata — current FSM state, step index, token and cost budgets consumed, model routing decisions.
  • External handles — uploaded file IDs, vector index job IDs, webhook correlation tokens.

Ephemeral caches (prompt template renders, read-only API responses with short TTL) can be recomputed on resume if the checkpoint records what was fetched and when, enabling conditional refresh. The boundary should be explicit in your run manifest so engineers know which fields are authoritative after restart.

Checkpoint models: snapshot vs event sourcing

Two patterns dominate agent persistence; most production stacks combine them.

Periodic snapshots

Serialize the full run state every N steps or M seconds. Simple to implement and fast to restore. Downsides: large payloads (full message history), risk of saving mid-tool-call inconsistent state, and expensive writes when context is huge. Mitigate with observation summarization before snapshot and by snapshotting only at step boundaries.

Write-ahead event log (WAL)

Append immutable events: RunStarted, ModelCallCompleted, ToolInvoked, ToolResultRecorded, BudgetDebited, HumanApprovalReceived. Rebuild state by replaying events from offset zero or from the latest snapshot plus tail events. WAL integrates cleanly with audit trails and deterministic replay for debugging.

Hybrid is the pragmatic default: snapshot every K events plus WAL between snapshots. On resume, load latest snapshot, replay tail, verify checksum, then continue from step_index + 1.

Checkpoint timing and consistency rules

A checkpoint is only useful if it represents a consistent cut — no orphaned tool calls, no debited budget without recorded outcome. Enforce these ordering rules:

  1. Write before invoke — persist ToolInvoked with idempotency key before HTTP leaves the worker.
  2. Result before advance — record ToolResultRecorded (or permanent failure) before incrementing step index or calling the model again.
  3. Budget atomicity — debit tokens in the same transaction as the completion event; never double-debit on resume.
  4. Human gates — checkpoint AwaitingApproval with frozen plan hash so resume cannot skip an approval step.

For ambiguous timeouts (tool may have committed), checkpoint status ToolOutcomeUnknown and block auto-resume until reconcile — same discipline as retry classification, not blind continuation.

Resume tokens, leases and single-writer semantics

Long runs often span multiple workers. Without coordination, two replicas can resume the same run and duplicate side effects.

  • Resume token — opaque ID returned to clients; maps to run_id + epoch. Each successful checkpoint bumps epoch; stale workers holding old epoch lose the lease.
  • Distributed lease — short TTL lock (Redis, DynamoDB) acquired at resume; renewed on heartbeat during execution.
  • Single-writer per run — database row version or optimistic concurrency on last_event_seq; second writer gets Conflict and backs off.

Client UX: expose GET /runs/{id} with status running | paused | completed | failed | resumable and last checkpoint timestamp so users trust that progress is not lost during deploys.

Idempotent replay on resume

Resuming is replay with side effects. Every write tool must be safe when the WAL shows ToolInvoked but the worker died before ToolResultRecorded:

  • Re-query idempotency store with the persisted key before re-invoking.
  • If external system reports “already exists,” synthesize result and append ToolResultReconciled event.
  • Never re-run irreversible tools (send email, charge card) without human confirmation or compensating flow.

Model calls on resume should use the same message list reconstructed from WAL, not a lossy summary, unless context budget policy explicitly compressed earlier steps and recorded the compression manifest in the checkpoint.

Storage backends and retention

Choose storage by durability SLA, query pattern, and compliance retention:

BackendGood forWatch for
PostgreSQL JSONBTransactional snapshots + event table, strong consistencyRow size limits on huge transcripts
S3 / object storeLarge snapshot blobs, cheap retentionEventual consistency without careful key design
Redis StreamsLow-latency WAL tail, lease locksNot sole durability tier without AOF/cluster
Event bus (Kafka)Multi-consumer audit + analyticsResume path still needs materialized state

Tier retention: hot WAL 7–30 days for resume, warm snapshots 90 days, cold archive for compliance. Encrypt at rest; redact secrets from checkpoints per credential injection policy (store references, not raw tokens).

Harbor Research refactor walkthrough

Harbor replaced in-memory-only runs with a RunPersistence module:

  1. EventStore — append-only Postgres table run_events(seq, run_id, type, payload, created_at) with monotonic seq per run.
  2. Snapshotter — every 10 events or 5 minutes, writes compressed snapshot to S3; payload includes summarized tool observations over 4k tokens.
  3. LeaseManager — 30s Redis lease per active worker; resume API requires valid lease acquisition.
  4. ResumeOrchestrator — load snapshot + replay tail; reconcile unknown tools; restore budgets from ledger.
  5. Deploy hook — graceful drain: finish current step, checkpoint, release lease; new pods pick up resumable runs within 60s.

Results after eight weeks: abandoned long-run rate 41% → 2.8% (remainder were genuine model failures or user cancel), duplicate embedding spend $18k/mo → $2.1k/mo, median resume latency after pod death under 45 seconds. User NPS on the research product rose 23 points once progress survived deploys.

Technique decision table

ScenarioPreferAvoid
Sub-60s chat agentIn-memory + optional final transcript persistFull WAL every token
Multi-hour research / ETL agentWAL + periodic snapshots + resume APISnapshot only at process exit
Human-in-the-loop approvalCheckpoint frozen plan at gateResume into new model sample
Write-heavy tool chainIdempotency keys in every checkpointReplay writes on resume
Multi-worker fleetLease + epoch resume tokensLast-writer-wins on shared row
Regulated audit needsImmutable WAL + compliance retentionMutable snapshot overwrites only
Deploy during active runsDrain step + resumable statusSIGKILL without checkpoint

Common pitfalls

  • Checkpoint mid-tool-call — inconsistent state on replay.
  • No lease on resume — duplicate workers double-charge APIs.
  • Secrets in snapshots — compliance breach when blobs are copied.
  • Full context snapshots only — disk and restore latency explode.
  • Resume without reconcile — repeats committed writes.
  • Orphan runs — no TTL or terminal status; dashboards lie forever.
  • Ignoring cancel on resume — zombie runs after user abort.

Production checklist

  • Define persistence boundary (transcript, ledger, budgets, handles).
  • Append WAL events at step boundaries with monotonic sequence per run.
  • Snapshot every K events or T minutes to object store or JSONB.
  • Persist ToolInvoked before network; ToolResult before step advance.
  • Issue resume tokens with epoch; enforce single-writer lease.
  • Reconcile unknown tool outcomes before continuing write tools.
  • Integrate with cancellation FSM and budget ledger atomically.
  • Expose resumable status and last checkpoint time to clients.
  • Graceful deploy drain with checkpoint before pod termination.
  • Load-test kill -9 mid-run; verify resume within SLA without duplicate side effects.

Key takeaways

  • Durable agents record progress before the next step, not after success.
  • WAL + snapshots balance restore speed, auditability, and storage cost.
  • Resume is replay — idempotency and leases prevent duplicate work.
  • Consistent cuts at step boundaries avoid corrupted checkpoint state.
  • Harbor Research cut abandoned long runs from 41% to 2.8% with event-sourced persistence and resume orchestration.

Related reading