Guide
LLM agent durable state, checkpointing and run persistence explained
Harbor Research shipped a literature-review agent that crawled PubMed, summarized papers, and drafted synthesis sections over four to six hours. Every step lived in process memory. When Kubernetes rolled the deployment during a node drain, or a worker OOM-killed mid-batch, the run simply vanished. Users saw a spinner, then nothing. Analytics showed 41% of runs longer than 90 minutes ended without a terminal status — not because the model failed, but because the runtime had no durable record of progress. Researchers re-submitted the same queries, burning $18k/month in duplicate embedding and summarization spend before platform engineering treated persistence as a first-class concern.
Durable state and checkpointing at the agent layer means every meaningful transition — planner decision, tool invocation, tool result, budget decrement, human approval — is recorded to storage before the next step proceeds, so a replacement worker can resume from the last consistent point. This guide covers checkpoint schemas, write-ahead event logs, resume tokens, coordination with retry and idempotency and lifecycle cancellation, the Harbor Research refactor, a technique decision table, pitfalls, and a production checklist.
What must survive a crash: the persistence boundary
Not every byte of an agent session belongs on disk. Production systems draw a persistence boundary around state that is expensive to recompute or unsafe to repeat:
- Conversation transcript — user messages, assistant turns, and tool observations needed to continue planning.
- Tool side effects ledger — which write tools committed, with idempotency keys and external IDs (see saga rollbacks when undo is required).
- Run control metadata — current FSM state, step index, token and cost budgets consumed, model routing decisions.
- External handles — uploaded file IDs, vector index job IDs, webhook correlation tokens.
Ephemeral caches (prompt template renders, read-only API responses with short TTL) can be recomputed on resume if the checkpoint records what was fetched and when, enabling conditional refresh. The boundary should be explicit in your run manifest so engineers know which fields are authoritative after restart.
Checkpoint models: snapshot vs event sourcing
Two patterns dominate agent persistence; most production stacks combine them.
Periodic snapshots
Serialize the full run state every N steps or M seconds. Simple to implement and fast to restore. Downsides: large payloads (full message history), risk of saving mid-tool-call inconsistent state, and expensive writes when context is huge. Mitigate with observation summarization before snapshot and by snapshotting only at step boundaries.
Write-ahead event log (WAL)
Append immutable events: RunStarted,
ModelCallCompleted, ToolInvoked,
ToolResultRecorded, BudgetDebited,
HumanApprovalReceived. Rebuild state by replaying events
from offset zero or from the latest snapshot plus tail events. WAL
integrates cleanly with
audit trails
and
deterministic replay
for debugging.
Hybrid is the pragmatic default: snapshot every K
events plus WAL between snapshots. On resume, load latest snapshot,
replay tail, verify checksum, then continue from
step_index + 1.
Checkpoint timing and consistency rules
A checkpoint is only useful if it represents a consistent cut — no orphaned tool calls, no debited budget without recorded outcome. Enforce these ordering rules:
- Write before invoke — persist
ToolInvokedwith idempotency key before HTTP leaves the worker. - Result before advance — record
ToolResultRecorded(or permanent failure) before incrementing step index or calling the model again. - Budget atomicity — debit tokens in the same transaction as the completion event; never double-debit on resume.
- Human gates — checkpoint
AwaitingApprovalwith frozen plan hash so resume cannot skip an approval step.
For ambiguous timeouts (tool may have committed), checkpoint status
ToolOutcomeUnknown and block auto-resume until reconcile
— same discipline as
retry classification,
not blind continuation.
Resume tokens, leases and single-writer semantics
Long runs often span multiple workers. Without coordination, two replicas can resume the same run and duplicate side effects.
- Resume token — opaque ID returned to clients;
maps to
run_id+epoch. Each successful checkpoint bumps epoch; stale workers holding old epoch lose the lease. - Distributed lease — short TTL lock (Redis, DynamoDB) acquired at resume; renewed on heartbeat during execution.
- Single-writer per run — database row version
or optimistic concurrency on
last_event_seq; second writer getsConflictand backs off.
Client UX: expose GET /runs/{id} with status
running | paused | completed | failed | resumable and
last checkpoint timestamp so users trust that progress is not lost
during deploys.
Idempotent replay on resume
Resuming is replay with side effects. Every write tool must be safe
when the WAL shows ToolInvoked but the worker died before
ToolResultRecorded:
- Re-query idempotency store with the persisted key before re-invoking.
- If external system reports “already exists,” synthesize
result and append
ToolResultReconciledevent. - Never re-run irreversible tools (send email, charge card) without human confirmation or compensating flow.
Model calls on resume should use the same message list reconstructed from WAL, not a lossy summary, unless context budget policy explicitly compressed earlier steps and recorded the compression manifest in the checkpoint.
Storage backends and retention
Choose storage by durability SLA, query pattern, and compliance retention:
| Backend | Good for | Watch for |
|---|---|---|
| PostgreSQL JSONB | Transactional snapshots + event table, strong consistency | Row size limits on huge transcripts |
| S3 / object store | Large snapshot blobs, cheap retention | Eventual consistency without careful key design |
| Redis Streams | Low-latency WAL tail, lease locks | Not sole durability tier without AOF/cluster |
| Event bus (Kafka) | Multi-consumer audit + analytics | Resume path still needs materialized state |
Tier retention: hot WAL 7–30 days for resume, warm snapshots 90 days, cold archive for compliance. Encrypt at rest; redact secrets from checkpoints per credential injection policy (store references, not raw tokens).
Harbor Research refactor walkthrough
Harbor replaced in-memory-only runs with a RunPersistence module:
- EventStore — append-only Postgres table
run_events(seq, run_id, type, payload, created_at)with monotonic seq per run. - Snapshotter — every 10 events or 5 minutes, writes compressed snapshot to S3; payload includes summarized tool observations over 4k tokens.
- LeaseManager — 30s Redis lease per active worker; resume API requires valid lease acquisition.
- ResumeOrchestrator — load snapshot + replay tail; reconcile unknown tools; restore budgets from ledger.
- Deploy hook — graceful drain: finish current
step, checkpoint, release lease; new pods pick up
resumableruns within 60s.
Results after eight weeks: abandoned long-run rate 41% → 2.8% (remainder were genuine model failures or user cancel), duplicate embedding spend $18k/mo → $2.1k/mo, median resume latency after pod death under 45 seconds. User NPS on the research product rose 23 points once progress survived deploys.
Technique decision table
| Scenario | Prefer | Avoid |
|---|---|---|
| Sub-60s chat agent | In-memory + optional final transcript persist | Full WAL every token |
| Multi-hour research / ETL agent | WAL + periodic snapshots + resume API | Snapshot only at process exit |
| Human-in-the-loop approval | Checkpoint frozen plan at gate | Resume into new model sample |
| Write-heavy tool chain | Idempotency keys in every checkpoint | Replay writes on resume |
| Multi-worker fleet | Lease + epoch resume tokens | Last-writer-wins on shared row |
| Regulated audit needs | Immutable WAL + compliance retention | Mutable snapshot overwrites only |
| Deploy during active runs | Drain step + resumable status | SIGKILL without checkpoint |
Common pitfalls
- Checkpoint mid-tool-call — inconsistent state on replay.
- No lease on resume — duplicate workers double-charge APIs.
- Secrets in snapshots — compliance breach when blobs are copied.
- Full context snapshots only — disk and restore latency explode.
- Resume without reconcile — repeats committed writes.
- Orphan runs — no TTL or terminal status; dashboards lie forever.
- Ignoring cancel on resume — zombie runs after user abort.
Production checklist
- Define persistence boundary (transcript, ledger, budgets, handles).
- Append WAL events at step boundaries with monotonic sequence per run.
- Snapshot every K events or T minutes to object store or JSONB.
- Persist ToolInvoked before network; ToolResult before step advance.
- Issue resume tokens with epoch; enforce single-writer lease.
- Reconcile unknown tool outcomes before continuing write tools.
- Integrate with cancellation FSM and budget ledger atomically.
- Expose resumable status and last checkpoint time to clients.
- Graceful deploy drain with checkpoint before pod termination.
- Load-test kill -9 mid-run; verify resume within SLA without duplicate side effects.
Key takeaways
- Durable agents record progress before the next step, not after success.
- WAL + snapshots balance restore speed, auditability, and storage cost.
- Resume is replay — idempotency and leases prevent duplicate work.
- Consistent cuts at step boundaries avoid corrupted checkpoint state.
- Harbor Research cut abandoned long runs from 41% to 2.8% with event-sourced persistence and resume orchestration.
Related reading
- LLM agent cancellation, timeout and lifecycle management explained — cooperative shutdown and run FSMs
- LLM agent deterministic replay and run reproducibility explained — record-replay for debug
- LLM agent retry, backoff and transient failure recovery explained — idempotency on ambiguous timeouts
- LLM agent handoff and session transfer explained — warm transfer across channels