Guide
LLM agent deterministic replay and run reproducibility explained
Harbor DevOps shipped a Kubernetes migration agent that reads cluster state,
proposes a rollout plan, opens a change ticket, and applies Terraform through
a gated pipeline. On a Tuesday night it approved a destructive node-pool
replacement on a staging cluster that should have been excluded by policy.
On-call pulled the OpenTelemetry trace: twelve ReAct steps, three parallel
tool batches, and a final apply_terraform call with wrong
arguments. Engineers tried to replay the run locally with the same user
prompt. They failed 27% of the time on the first attempt and
could not match the production trajectory in 73% of sessions
even after ten tries. Temperature was 0.7 in prod but 0 in dev. Live
describe_cluster returned different pod counts. The model version
had rolled forward overnight. Nobody had stored the exact tool observations
from the incident window.
Deterministic replay means re-executing an agent run so that every LLM completion and every tool observation matches a recorded baseline unless you explicitly swap in a new model or policy under test. Run reproducibility is the broader discipline: capturing enough of the environment — prompts, model IDs, decoding params, tool responses, clocks, feature flags — that another engineer can reproduce the failure or verify a fix. This complements agent tracing (what happened) and audit trails (provable history) with a rerunnable artifact. Harbor built run manifests, LLM cassettes, and tool replay stubs into their E2E harness. Incident reproducibility rose from 73% to 94%; mean time to root cause on P1 agent failures fell from 6.8 hours to 1.4 hours. This guide covers replay architecture, seeding and nondeterminism sources, record-replay modes, the Harbor refactor, a technique decision table, pitfalls, and a production checklist.
Why traces alone are not replay
Traces answer “what did the agent do?” They rarely answer “can I make it do that again?” Agent runs diverge for predictable reasons:
- LLM sampling — temperature, top-p, and provider-side batching change token sequences even with identical prompts.
- Model and prompt drift — silent router upgrades, prompt template changes, or prompt registry tags not pinned on the incident run.
- Live tool variance — APIs return different JSON; pagination cursors move; rate-limit headers differ; clock skew shifts “now” in relative date tools.
- Parallel ordering — parallel tool batches may complete in different order; observations land in different sequence unless the runtime sorts them.
- External state — saga ledgers, checkpoints, and idempotency keys from half-finished runs change the next turn.
- Missing context — traces redact PII or truncate large tool payloads; replay without the full observation body misleads the model.
Reproducibility requires treating a run as a versioned bundle, not a screenshot of spans.
Run manifest: the minimum reproducibility contract
Harbor stores a run manifest at the start of every production agent session and updates it on each step:
run_id,parent_run_id, tenant, and policy version- Initial user message and system prompt hash (not always the raw text if secrets embed)
- Model route: provider, model ID, API version, decoding params
- Tool catalog snapshot hash (names, schemas, compensator pairs)
- Feature flags, permission scope, and approval gate outcomes
- Injected clock (
2026-06-12T03:14:00Z) when tools use relative dates - Random seed or
seed_mode: recordfor any local RNG in tools
On incident export, the manifest plus a step-indexed event log
becomes the replay input. The event log append-only records:
LLM_REQUEST, LLM_RESPONSE, TOOL_CALL,
TOOL_RESULT, CHECKPOINT, POLICY_DECISION.
Each entry carries content hashes so tampering is detectable, aligned with
compliance audit patterns.
Full replay vs partial replay
- Full deterministic replay — LLM responses served from cassettes; tools return recorded JSON. Re-executes orchestration logic only. Use for regression tests and exact incident reproduction.
- Hybrid replay — cassettes for tools, live LLM with temperature 0 and pinned model. Use when testing prompt fixes against real model behavior while holding external APIs constant.
- Live re-run — same manifest constraints but fresh LLM and tools. Use for flake detection, not single-incident RCA.
LLM cassettes and tool stubs
An LLM cassette maps a normalized request fingerprint to the exact completion bytes returned in production:
fingerprint = hash(model_id, messages[], tools[], decoding_params)
cassette[fingerprint] = { completion, logprobs?, finish_reason, usage }
Normalization matters: strip nondeterministic metadata, canonicalize JSON key order in tool definitions, and hash system prompts by registry version ID rather than raw string when templates include dynamic timestamps.
Tool stubs replay TOOL_RESULT events by
(tool_name, arguments_hash, step_index). Step index disambiguates
identical calls in one run (e.g. two get_pod calls with the same
name but different namespaces on different turns). For mutating tools, stubs
return recorded responses without side effects; compensators are not invoked
during replay unless you run a
saga simulation
mode.
Harbor's harness fails CI if a replayed run produces a fingerprint miss — signaling orchestration drift or cassette staleness before merge.
Controlling nondeterminism at the source
Replay is cheaper when production is already replay-friendly:
- Pin decoding in prod for high-risk agents — temperature 0 for migration, finance, and provisioning agents; document exceptions.
- Stable observation ordering — sort parallel tool results
by
step_indexbefore appending to context. - Inject clocks — pass
now()from the runtime into tools instead of calling wall clock inside tool implementations. - Version everything — model route, prompt tag, tool schema bundle, policy engine ruleset.
- Record, don't rely on re-fetch — store full tool payloads in the event log (with PII scrubbing at write time), not just span summaries.
- Separate replay sandboxes — replay never hits production APIs even if cassettes miss; default to stub mode with explicit opt-in to live.
Harbor DevOps refactor walkthrough
Harbor's incident agent had failed because step 7's
describe_cluster observation listed zero staging taints on a
cluster that did have an exclusion taint in prod — a transient
cache bug in their wrapper. Re-running with live tools never reproduced step 7
exactly; the model sometimes skipped the taint check entirely.
After refactor:
- Every prod run wrote manifest + event log to object storage within 200 ms of step completion.
- On-call clicked “export replay bundle” from the trace UI; bundle included manifest, cassettes for all LLM calls, and tool results.
- Local CLI
harbor-agent replay --bundle id=run_8f2a --mode fullre-executed orchestration; step 7 matched prod byte-for-byte. - Engineers swapped only the taint-parser fix into the tool stub layer and re-ran hybrid mode (live LLM, stubbed tools) to verify the model now blocked apply.
- Regression test added: full replay of run_8f2a must end in
POLICY_DENYafter fix merge.
Metrics: reproducibility 73% → 94% (remaining gap = runs before manifest rollout and PII-redacted payloads too short to replay). P1 MTTR 6.8 h → 1.4 h. CI replay suite grew to 840 cassettes covering top agent workflows.
Technique decision table
| Approach | Best for | Weak when | Harbor-style signal |
|---|---|---|---|
| Trace-only debug | Quick latency and error taxonomy | Exact trajectory reproduction | 10 replays, 7 different outcomes |
| Full record-replay | Incident RCA, regression locks | Testing novel model behavior | Need byte-identical repro |
| Hybrid (stub tools, live LLM) | Prompt and policy fixes | Tool-wrapper bugs | Model chose wrong action given fixed facts |
| Live re-run with pinned manifest | Flake rate measurement | Single incident debugging | Probabilistic pass/fail gates |
| Counterfactual replay | “What if step 5 returned X?&rdash; | Large search space | Policy what-if analysis |
Common pitfalls
- Cassette keyed only on user message — ignores tool observations that changed the next LLM request; false cache hits.
- Storing truncated tool results in traces — replay context differs from prod; model takes a new path immediately.
- Replaying against live mutating APIs — doubles charges, sends duplicate emails; always stub writes during full replay.
- Ignoring parallel step order — cassettes miss when batch completion order flips.
- No manifest on subagent runs — parent replay succeeds; delegated branch diverges silently.
- PII redaction without replay tier — compliance scrubs bodies engineers need; use role-gated full-fidelity export.
- Cassette rot after prompt change — mass false failures; version cassettes per prompt tag and prune on registry bump.
- Assuming temperature 0 is enough — provider updates, batching, and tool variance still break live re-runs.
Engineer checklist
- Emit a run manifest at session start with model, prompt, tool, and policy versions.
- Append-only event log: LLM requests/responses, tool calls/results, checkpoints.
- Normalize LLM request fingerprints; store completion bytes in cassettes.
- Key tool stubs by tool name, arguments hash, and step index.
- Sort parallel tool observations before context injection.
- Inject runtime clock into date-sensitive tools.
- One-click export replay bundle from trace or audit UI.
- CLI/API: full, hybrid, and live re-run modes with sandbox defaults.
- Fail CI on cassette miss when orchestration code changes.
- Add regression replay for every P1/P2 agent incident after fix.
- Role-gate full-fidelity exports; redact only in general audit streams.
- Include subagent manifests in parent bundle exports.
Key takeaways
- Traces show history; replay bundles let you rerun it.
- Run manifests pin every versioned input — model, prompt, tools, policy, clock.
- LLM cassettes plus tool stubs enable deterministic full replay without prod side effects.
- Hybrid replay separates model bugs from tool and environment bugs.
- Harbor cut P1 MTTR 6.8 h → 1.4 h by making incidents exportable artifacts, not archaeology exercises.
Related reading
- LLM agent testing, mocking and E2E harness explained — cassettes, fakes, and CI integration
- Agent run audit trail and compliance logging explained — immutable event streams
- Agent observability and tracing explained — spans and production debug
- Durable agent execution and checkpointing explained — resume state in replay bundles