Guide
LLM durable agent execution and checkpointing explained
Harbor Logistics’ vendor onboarding agent was designed for a six-to-forty-eight-hour
lifecycle: collect W-9s, run sanctions screening, open a NetSuite vendor record, route
a purchase order through finance, and provision API credentials. It worked beautifully in
staging where nothing restarted. In production, a routine deploy at hour thirty-six
wiped the in-memory
ReAct loop.
The resumed process had no memory of the PO it had already filed. It called
create_purchase_order again. Finance approved a duplicate $84,000 commitment
before anyone noticed. Over three months, 23% of long runs produced at
least one duplicate side effect after crash or deploy.
Durable agent execution treats each agent run as a persisted workflow: checkpoint state after every tool call, record idempotency keys for mutating actions, pause cleanly at human approval gates, and resume from the last committed step instead of replaying from scratch. This guide covers run state machines, checkpoint schemas, side-effect ledgers, pause/resume semantics, integration with observability and tool error recovery, the Harbor Logistics refactor, a technique decision table versus in-memory loops and generic job queues, pitfalls, and a production checklist.
Why in-memory agents fail in production
A typical prototype keeps conversation history, tool results, and plan state in a Python
dict or Redis session keyed by session_id. That works until reality intervenes:
- Process restarts — deploys, OOM kills, autoscaling scale-in, spot instance preemption.
- Long human waits — legal review over a weekend; the worker must not hold memory or time out.
- Partial tool success — payment API returns 200 but the connection drops before the agent records the receipt.
- At-least-once delivery — queue consumers retry messages; the same step may execute twice without guards.
- Multi-worker routing — step 4 runs on pod A; after resume, step 5 lands on pod B with no shared RAM.
Naive “restart the whole prompt from the beginning” wastes tokens, duplicates expensive tool calls, and can violate compliance if a mutating action runs twice. Durable execution separates orchestration state (what step are we on, what is committed) from ephemeral inference (the next model call).
Run state machine
Model each agent task as a finite state machine with explicit transitions. A minimal production set:
pending— accepted, not yet started.running— actively executing model/tool steps.waiting_tool— external async tool in flight (webhook expected).waiting_human— blocked on approval or input from a review queue.paused— operator or budget policy halted the run.completed— terminal success; no further steps.failed— terminal error; may allow manual retry from last checkpoint.cancelled— user or policy aborted; side effects already committed stand.
Transitions must be atomic in your persistence layer: only one worker
may move running → waiting_tool for a given run_id. Use
optimistic locking (version column) or row-level locks. Emit state changes
to your
trace backend
so operators see “stuck in waiting_human 52 hours” without reading logs.
Checkpoint schema
Persist a checkpoint after every committed step — not after every token streamed. A practical JSON document:
{
"run_id": "run_8f3a…",
"trace_id": "trace_91bc…",
"step_index": 7,
"agent_version": "vendor-onboard/v14",
"prompt_template_hash": "a4e2c1…",
"messages": [ /* full LLM context at this step */ ],
"plan": { "current_goal": "provision_api_keys", "subtasks_done": ["kyc", "po"] },
"tool_ledger": [
{ "step": 3, "tool": "create_vendor", "idempotency_key": "run_8f3a:step_3",
"status": "committed", "external_id": "VND-4421", "observation_digest": "sha256:…" }
],
"budget": { "tokens_in": 42000, "tokens_out": 8100, "usd_estimate": 1.24 },
"state": "waiting_human",
"waiting_for": { "type": "approval", "queue_id": "finance_po", "since": "2026-06-10T14:22:00Z" }
}
Store large observations out-of-band (S3, blob store) and reference by digest in the
checkpoint. Keep the hot checkpoint row under a few hundred kilobytes so resume stays
fast. Version the schema; migrations should read checkpoint_schema_version
and upgrade on load.
What to checkpoint vs recompute
- Always persist — messages, tool arguments, tool receipts, external IDs, idempotency keys, step index.
- Safe to recompute — derived summaries if you can reproduce them deterministically from persisted messages.
- Never rely on recompute — mutating tool outcomes; the ledger is the source of truth for what happened in the real world.
Idempotent tools and side-effect ledgers
Durable execution without idempotency duplicates damage. Every mutating tool should
accept an idempotency_key scoped to run_id + step_index (or
a UUID generated before the call and stored in the checkpoint). The tool server:
- Looks up prior result by key in a durable store.
- If found, returns the cached receipt without re-executing.
- If not found, executes, stores result, returns.
The agent’s tool ledger mirrors this: before invoking
create_purchase_order, write status: pending with the key;
after success, flip to committed with external_id. On resume,
skip any step whose ledger entry is already committed. Pair with the
patterns in
tool error handling:
distinguish retryable transport errors (safe to retry same key) from business-logic
failures (mark failed, do not auto-retry).
Read-only tools (search, fetch document) need weaker guarantees but still benefit from short TTL caches keyed by arguments to avoid thundering herds on resume.
Pause, resume and human approval gates
Long-running agents spend most wall-clock time waiting, not inferring. When a step requires human sign-off:
- Checkpoint with
state: waiting_humanand a pointer to the review item. - Release the worker process; do not block a GPU or hold a queue lease for days.
- On approval webhook, enqueue a
resumejob withrun_id. - Load checkpoint, append the human decision as a structured message, set
state: running, continue fromstep_index + 1.
Rejections should branch explicitly: either terminal failed with reason,
or a repair loop where the agent drafts a revised action and re-enters the queue.
Tie into
HITL
confidence routing so only high-risk mutations pause; low-risk steps stream through.
Enforce
budget caps
across pause boundaries — total tokens and wall-clock SLA include wait time.
Workflow engines and async tool callbacks
You can build durable execution on Postgres plus a job queue, but complex graphs (parallel sub-agents, timers, compensating transactions) map cleanly to workflow engines (Temporal, Restate, Inngest, LangGraph with a persistent checkpointer). Patterns that matter for LLM agents:
- Activity timeouts — model calls get minutes; human gates get days with heartbeats.
- Signals — inject “user cancelled” or “approval granted” without polling.
- Child workflows — isolate a research sub-agent; parent waits on a result handle.
- Saga compensations — if step 8 fails after step 5 committed, run defined rollback tools.
Whether you use a framework or roll your own, the invariant is the same: the workflow code must be deterministic given the checkpoint. Non-determinism belongs inside activities (model calls, HTTP) whose outputs are recorded in the ledger.
Harbor Logistics refactor
Harbor’s vendor onboarding agent previously ran as a single long-lived Celery task with in-process message history. After the duplicate PO incident, the team shipped:
- Postgres
agent_runstable with optimistic locking and the checkpoint schema above. - Idempotency middleware on all mutating internal APIs (
create_vendor,create_po,provision_credentials). - Finance and legal approvals as
waiting_humanwith Slack deep links; workers released within 200 ms of entering wait. - Resume jobs triggered by approval webhooks and by a nightly sweeper for stuck runs.
- Dashboards: median wall time, time-in-waiting_human p95, duplicate-tool-call rate (target zero).
Outcomes over eight weeks: duplicate side effects on long runs 23% → 0%; successful resume after deploy or crash 94% (remaining 6% required manual intervention on corrupted checkpoints); median token spend per completed onboarding −18% because resume no longer replayed early research steps. Mean time to onboard a standard vendor 41 → 38 hours — modest wall-clock gain, but finance trust recovered enough to expand automated onboarding from 40% to 72% of new vendors.
Technique decision table
| Need | Prefer | Avoid |
|---|---|---|
| Sub-minute chat agent | In-memory session; optional trace export | Full workflow engine per message |
| Multi-hour task with approvals | Checkpoint + waiting_human + resume queue | Blocking worker holding RAM for days |
| Mutating enterprise APIs | Idempotency keys + tool ledger | Blind retry on resume |
| Crash during tool call | Ledger pending → reconcile or retry same key | Assume failure and re-run committed step |
| Parallel research subtasks | Child workflows with merged checkpoint | One giant message thread |
| Debug stuck run | State machine + trace_id in observability UI | Grep worker stdout |
| Regulated audit trail | Immutable ledger entries per committed tool | Reconstructed chat-only history |
Common pitfalls
- Checkpointing only on success — crashes mid-tool leave ambiguous state; write pending before call.
- Idempotency keys that change on resume — duplicates every side effect; key must be stable per logical step.
- Storing megabyte observations in the DB row — slow resume; use digest + object storage.
- No version field on checkpoints — two workers resume concurrently and diverge.
- Replaying the model from step zero — burns budget and may choose different tools; restore messages from checkpoint.
- Human wait without SLA alerts — runs rot in
waiting_humanfor weeks. - Missing cancellation path — operators cannot stop a run that already committed partial work.
- Ledger not visible in traces — on-call cannot tell committed vs attempted actions.
Production checklist
- Assign stable
run_idand propagatetrace_idacross resume attempts. - Define explicit run states and atomic transition rules with optimistic locking.
- Checkpoint after every committed tool step; persist pending before mutating calls.
- Require
idempotency_keyon all mutating tools; cache receipts server-side. - Maintain a tool ledger with
pending | committed | failedper step. - Store large observations by digest; keep checkpoint rows lean.
- Implement
waiting_humanwith webhook resume and release workers immediately. - Run a sweeper for stuck states (waiting_tool, waiting_human past SLA).
- On resume, skip ledger entries already
committed; never double-execute. - Enforce token and cost budgets cumulatively across pauses.
- Export metrics: resume success rate, duplicate-tool rate, time-in-wait p95.
- Document manual recovery for corrupted checkpoints and partial compensations.
Key takeaways
- Production agents are workflows, not single HTTP requests — state must survive restarts.
- Checkpoints + idempotency keys prevent duplicate POs and double charges on resume.
- Harbor Logistics cut duplicate side effects 23% → 0% with ledgers and approval pauses.
- Release workers during human waits — durability is about stored state, not running processes.
- Pair with tracing and HITL so operators see where a run is stuck and why.
Related reading
- ReAct agent loop — step anatomy your checkpoints should capture
- Human-in-the-loop — approval queues that trigger resume
- Tool error handling — retry vs fail semantics with idempotency
- Agent observability — trace state transitions and ledger events