Guide
LLM agent testing, mocking and E2E harness explained
Harbor DevOps merged a pull request that added a “rollback failed deploy”
agent workflow. The nightly regression suite passed — but only because it called
the real OpenAI API with a live Postgres staging database. Three days later, a
dependency bump changed token sampling enough that the agent skipped the
verify_healthcheck tool call. Production rolled back the wrong
Kubernetes namespace. Post-mortem found 38% of CI runs on agent
PRs had been failing intermittently; engineers had been retrying until green.
Agent testing harnesses exercise orchestration code, tool wiring, and policy gates without depending on non-deterministic model output or shared staging data on every commit. This is distinct from offline evaluation (scoring trajectories on golden tasks) and from sandbox isolation (containing blast radius). Harbor split tests into three layers: pure unit tests with stubbed LLM responses, integration tests with record-replay cassettes, and a thin nightly E2E slice against a disposable environment. PR failure rate on agent changes fell from 38% to 4%, and the namespace regression would have been caught by a deterministic trajectory assertion. This guide covers the agent test pyramid, mocking strategies, cassette design, harness APIs, CI placement, the Harbor DevOps refactor, a technique decision table, pitfalls, and a production checklist.
The agent test pyramid
Single-turn prompt tests do not cover multi-step agents. Map coverage to layers with different cost and flakiness profiles.
Unit layer (milliseconds, no network)
- Tool schema validation — argument parsers reject malformed JSON before any HTTP call
- Policy gates — permission tiers block mutating tools when approval state is missing
- State machine transitions — run FSM legal edges without invoking the model
- Prompt assembly — snapshot system prompts and tool manifests; catch accidental token bloat
Integration layer (seconds, mocked LLM + mocked tools)
Feed canned assistant messages that mirror a recorded trajectory. Assert the harness invokes the correct next tool with expected arguments, handles tool errors, and reaches a terminal state. This is where most agent regressions belong.
E2E layer (minutes, real stack slice)
A small, curated set of tasks runs against ephemeral infrastructure with either live models (temperature 0, fixed seed where supported) or approved cassettes refreshed on schedule. Reserve for cross-service contracts and deploy smoke, not every PR.
Mocking the LLM: fixtures, stubs and cassettes
Non-determinism is the root flake. Choose a strategy based on what you need to prove.
Hard-coded response queues
Push an ordered list of assistant messages into a fake LLM client. Each
chat.completions.create call pops the next item. Fastest for
unit tests; brittle when prompts change wording but tool calls stay the same.
Record-replay cassettes
On first run (or a dedicated --record mode), call the real API and
serialize request hashes plus responses to versioned YAML. CI replays from disk.
Redact API keys and PII at write time. Re-record cassettes when you intentionally
change prompts or model versions — treat cassette diffs like snapshot review.
Tool-choice matchers
Instead of matching full natural-language replies, assert structural properties:
“next message must include a tool_calls entry for
list_pods with namespace label.” Pair with minimal stub text so
tests survive harmless phrasing drift.
Seeded live calls (sparingly)
Some providers expose seed parameters or logprobs for repeatability. Use only in nightly jobs with budget caps; never as the default PR gate.
Mocking tools and external systems
Agents fail in the glue between model decisions and side effects. Fake the boundaries, not the orchestration loop.
- In-memory fakes —
FakeK8sClientwith preset pod lists; assert rollback targets the canary namespace - HTTP wire mocks — Mountebank, WireMock, or
mswfor REST tools; match on method, path, and body subset - Fault injection — Return 503 on third
execute_sqlcall; verify retry policy and fallback resilience - Latency simulation — Delay tool responses to exercise timeout and cancel paths without wall-clock waits in unit tests (injectable clocks)
- Idempotency keys — Fake stores dedupe by key; replay same tool call twice and assert single side effect
Register mocks in a dependency-injection container keyed by test name so parallel CI workers do not share mutable global state.
Harness API design
A good harness looks like a test SDK, not a pile of copy-pasted setup blocks.
Scenario builder
const run = await harness
.withLlmCassette('rollback-happy-path.yaml')
.withToolFake('k8s', new FakeK8s({ namespaces: ['canary', 'prod'] }))
.withInitialState({ run_id: 'test-1', approval_tier: 2 })
.executeAgent({ user: 'Rollback deploy v2.4.1' });
Trajectory assertions
expect(run.toolsCalled).toEqualSequence(['get_deploy', 'verify_healthcheck', 'rollback'])expect(run.terminalState).toBe('succeeded')expect(fakeK8s.rollbackTarget).toBe('canary')
Trace export for debugging
On failure, dump OpenTelemetry-style spans (or a JSON trace) to the CI artifact store. Engineers should see which stub returned unexpected data without reproducing locally with live keys.
Golden task files
Store scenario definitions as data: user message, expected tool sequence, optional negative cases. Share the same files between CI harness and offline eval — eval scores quality; harness asserts invariants.
CI placement and flake control
- PR gate — unit + integration with cassettes; target < 90 s total on agent packages
- Merge queue — optional replay against fresh cassettes if multiple agents touched shared prompts
- Nightly — live-model E2E on disposable env; refresh cassettes when drift > threshold
- Quarantine — flaky tests get an issue owner; do not
retry: 3without fixing root cause - Parallelism — isolate DB/schema per worker; agents that write shared staging tables will race
Tag tests with @agent and @cassette-v3 so model-version
upgrades trigger selective re-recording instead of full-suite red.
Harbor DevOps refactor
Before the harness, Harbor’s agent package had twelve integration tests that each called OpenAI and a shared staging cluster. Flakes clustered around rate limits, stale deploy fixtures, and non-reproducible tool ordering when the model paraphrased success messages.
The team introduced AgentTestHarness with injectable LLM and tool
ports, migrated eight scenarios to YAML cassettes, and added three pure unit tests
for the rollback FSM. A new negative case asserted
verify_healthcheck must run before rollback — the
exact gap that caused the namespace incident. PR agent-test failure rate dropped
from 38% to 4%; median CI time on agent PRs fell
from 14 minutes to 3.2 minutes because live API calls moved to nightly only.
Harbor wired harness trace dumps into the same backend as production observability, so failing CI runs were inspectable with the same UI engineers used for on-call.
Technique decision table
| Approach | Best for | Weak when |
|---|---|---|
| Live API on every PR | Early prototypes, zero harness investment | Flakes, cost, non-reproducible failures at scale |
| Hard-coded LLM stubs | FSM, policy, and parser unit tests | Prompt wording changes break tests with no functional regression |
| Record-replay cassettes | Stable integration regression on tool trajectories | Model upgrades require deliberate re-record review |
| Eval-only gate (no harness) | Quality benchmarking, leaderboard tracking | Misses wiring bugs eval scenarios do not cover |
| Nightly live E2E | Cross-service contracts, provider drift detection | Too slow and expensive for per-commit feedback |
Common pitfalls
- Testing the model, not the agent — asserting exact prose instead of tool calls and terminal state.
- Shared mutable staging — parallel CI workers corrupt each other’s data; use per-test fakes or isolated schemas.
- Cassettes with secrets — API keys in committed YAML; redact at record time and scan in pre-commit.
- No negative paths — only happy-path cassettes; production fails on tool 503s and missing approvals.
- Harness bypasses production code — duplicate loop logic in tests; always execute the same orchestrator entrypoint.
- Ignoring checkpoint resume — crash mid-run and resume must be tested when using durable execution.
- Retry until green — masks flake rate; track
test_flake_rateas an engineering metric.
Production checklist
- Define agent test pyramid: unit, integration (cassettes), nightly E2E.
- Injectable LLM port with stub, cassette, and live implementations.
- Tool fakes with fault injection and idempotency assertions.
- Scenario builder API shared across packages.
- Trajectory assertions on tool name, args, and order — not full text.
- Version cassettes with model ID and prompt hash metadata.
- Redact secrets and PII when recording cassettes.
- Export trace JSON on CI failure to artifact store.
- Align golden task files with offline eval datasets.
- Test cancel, timeout, and resume paths with injectable clocks.
- PR gate under 90 s for agent integration suite.
- Track and alert on test flake rate; quarantine chronic offenders.
Key takeaways
- Separate harness from eval — harness proves wiring; eval scores quality on golden tasks.
- Harbor cut PR failures 38% → 4% with cassettes and tool fakes.
- Assert tool trajectories, not verbatim model prose.
- Record-replay balances realism and determinism for CI.
- Nightly live E2E catches provider drift without blocking every commit.
Related reading
- Agent evaluation and benchmarking — trajectory scoring and regression suites on golden tasks
- Sandbox execution — isolating tool blast radius in tests and production
- Durable agent execution — resume and idempotency cases the harness must cover
- Agent observability — trace export for failing CI runs