Guide
LLM agent A/B test experiment assignment and traffic splitting systems explained
Harbor Insights, a B2B analytics agent serving 400 enterprise tenants, ran
nine simultaneous prompt and retrieval experiments. Each microservice hashed
user_id with a different salt: the RAG reranker assigned Alice to
variant B while the main orchestrator assigned her to variant A. Session-level
metrics looked positive for “winning” bundles, but post-hoc replay
showed 52% of promoted variants were statistical noise or
interaction artifacts — not real improvements. Two quarters of roadmap
shipped regressions masked as wins.
The fix was a centralized experiment assignment plane: sticky bucketing on a stable unit id, atomic variant bundles that move together across services, mutual-exclusion groups, guardrail metrics with automatic holdouts, and assignment records attached to every run trace. False-positive promotions fell from 52% to 7.8% over two release cycles. This guide explains what agent A/B testing is, how it differs from canaries and feature flags, bucketing units, traffic splits, experiment lifecycle FSMs, the Harbor Insights refactor, a decision table, pitfalls, and a production checklist.
What agent A/B testing measures
An agent A/B test compares two or more variant bundles on live traffic while holding everything else as constant as production allows. A variant bundle is not just a prompt string — it is the full behavioral slice that should move together:
- Prompt template revision — system and task instructions.
- Model route — primary model, temperature, max tokens.
- Tool set and policy — which tools are exposed and approval gates.
- Retrieval stack — index version, reranker, hybrid weights.
- Loop parameters — reflection on/off, max iterations, parallel tool cap.
Primary metrics are usually task success (resolution rate, human thumbs-up, automated rubric score), latency, and cost per successful run. Guardrail metrics catch collateral damage: escalation rate, policy violations, tool error rate, and p95 latency. Pair experiments with offline evaluation suites so you do not promote variants that win on one narrow metric but fail broad regression tests.
Assignment units and sticky bucketing
The assignment unit is the entity whose experience must stay consistent across sessions and services. Common choices:
| Unit | Sticky across sessions | Best for | Risk |
|---|---|---|---|
user_id | Yes | Product UX, habit formation | Long experiments dilute if users churn |
tenant_id | Yes | B2B contracts, support load per customer | Small tenants never reach significance |
session_id | No (new session re-rolls) | Short-lived chat, low-stakes tasks | User sees flip-flopping behavior |
run_id | No | One-shot API calls, batch jobs | Cannot measure multi-turn outcomes |
Sticky bucketing means the same unit always maps to the same variant for the lifetime of an experiment (unless explicitly re-randomized at a defined boundary). Implementation pattern:
bucket = stable_hash(
salt = experiment.salt,
unit = assignment_unit_id,
experiment_id = "exp_reflection_v3"
) % 10000
if bucket < experiment.control_weight_bp:
variant = "control"
else:
variant = pick_treatment(bucket, experiment.treatments)
Use a stable hash (Murmur3, SHA-256 truncated) with a
per-experiment salt stored in the experiment registry — not
hash(user_id) % 2 copy-pasted in three repos. Log
experiment_id, variant_id, bucket, and
assignment_unit on the run root span for every assignment.
Traffic splitting at ingress
Assignment should happen once at ingress (API gateway, queue
consumer, or start_run handler) and propagate as an immutable
ExperimentContext attached to the run. Downstream services read
the context; they do not re-roll.
Split types
- Percentage split — 90% control / 10% treatment; standard for gradual tests.
- Whitelist override — internal dogfood tenants pinned to treatment for QA.
- Geographic or tier slice — enterprise tier only; avoids SMB noise.
- Holdout reserve — 5% never sees any experiment; measures cumulative lift vs true baseline.
Integrate splits with dynamic configuration: the experiment resolver is one layer in the config stack. When an experiment ends, winning variants promote into the global snapshot; losing variants retire with a deprecation window.
For high-risk changes, run a shadow canary first (replay production inputs without serving outputs), then an A/B test on live traffic with guardrails. Shadow proves safety; A/B proves lift.
Variant bundles and mutual exclusion
The deadliest mistake is letting independent experiments assign different variants to different layers of the same run. Harbor's nine experiments created 512 theoretical combinations; production traffic hit hundreds of unplanned bundles.
Rules that prevent this:
- One active experiment per layer — at most one live test touching prompts, one touching retrieval, one touching model route.
- Mutual-exclusion groups — experiments in group
main_loopcannot overlap; registry enforces at publish time. - Atomic variant bundles — variant
treatment_bis a manifest listing all pinned revisions; services load the manifest, not local hashes. - Factorial only by design — if you need 2×2 factorial analysis, register it as one experiment with four cells, not two independent coin flips.
variant_manifest = {
"variant_id": "treatment_b",
"prompt_rev": "support_triage@2.4.1",
"model_route": "gpt-4.1-mini",
"rag_index_rev": 1187,
"reflection_enabled": true
}
Pin manifests at run start and on checkpoint resume so multi-hour workflows do not cross variant boundaries mid-trajectory.
Experiment lifecycle FSM
Treat experiments as stateful resources with an explicit finite-state machine:
| State | Traffic | Allowed actions |
|---|---|---|
DRAFT | 0% | Edit hypothesis, metrics, manifests; offline eval only |
RUNNING | Configured split | Monitor; no manifest edits without new revision |
PAUSED | 0% (all to control) | Investigate guardrail breach; preserves bucket assignments |
CONCLUDED_WIN | Ramp down | Promote manifest to global config snapshot |
CONCLUDED_LOSS | 0% | Archive; document learnings |
ABORTED | Immediate 0% | Guardrail auto-trip or manual kill |
Automatic abort triggers: policy violation rate > 2× control for 15 minutes, p95 latency > SLO ceiling, or cost per run > budget cap. Abort must flip traffic in seconds via the config watch stream — not after a deploy.
Statistical discipline and guardrails
Agent metrics are noisy: small samples, heavy tails, and weekday seasonality. Production systems should:
- Pre-register primary and guardrail metrics before
RUNNING. - Require minimum exposure (e.g. 5,000 runs per variant) before auto-conclude.
- Use sequential testing or fixed-horizon tests — not peek-and-promote daily.
- Slice results by tenant tier, task type, and model provider — aggregate lifts hide regressions.
- Keep a permanent holdout to detect drift when everything is “winning.”
Export assignment and outcome events to the
audit trail
so data science can reproduce analysis without re-querying raw logs. Include
prompt_rev, model_id, and tool call outcomes —
not just thumbs-up/down.
Harbor Insights refactor
Root causes beyond inconsistent salts:
- Per-service assignment — three binaries, three hash functions, zero shared registry.
- Session vs user mismatch — retrieval sticky on session; main loop on user; multi-tab chaos.
- Overlapping experiments — no mutual-exclusion enforcement at publish time.
- Promotion without holdout — declared wins never validated against a clean control reserve.
- Missing assignment on traces — could not debug which variant produced a bad run.
Shipped fixes:
- Central experiment registry with FSM, mutual-exclusion groups, and manifest bundles.
- Single assignment at
start_run;ExperimentContextpropagated on all internal RPCs. - 5% global holdout lane excluded from all experiments.
- Guardrail auto-pause wired to PagerDuty; median abort time 41 seconds.
- Promotion workflow requires offline eval pass + 14-day fixed-horizon live result.
False-positive promotions dropped from 52% to 7.8%. Median time from experiment conclusion to safe global rollout fell from 11 days to 3 days because manifests promoted directly into the config snapshot.
Technique decision table
| Approach | Best for | Weak when |
|---|---|---|
| Per-service random coin flip | Local prototypes | Any multi-service agent; uninterpretable metrics |
| Shadow canary only | Safety validation before any user sees change | Measuring real user satisfaction or revenue lift |
| Central sticky assignment + manifests | Production A/B on prompts, models, retrieval | Team unwilling to operate experiment registry |
| Manual tenant pin | Design partner feedback, QA | Statistical inference at scale |
| Offline eval only | Regression gates in CI | Behaviors that only emerge on live long-tail inputs |
Common pitfalls
- Different hash salts per service — the Harbor failure mode; users get incoherent bundles.
- Re-randomizing mid-session — variant flip on reconnect destroys trust and invalidates metrics.
- Overlapping experiments without factorial design — interaction effects look like wins.
- Peeking and early stop — promotes noise; use pre-registered horizons.
- Optimizing proxy metrics — shorter answers score well but increase escalations.
- No assignment on traces — impossible to replay or dispute a conclusion.
- Tiny tenant slices — one enterprise customer dominates significance.
- Forgetting checkpoint resume — long runs pick up new variant after crash recovery.
Production checklist
- Define assignment unit per product surface (user, tenant, or session) and document it.
- Assign once at ingress; propagate
ExperimentContexton all internal calls. - Store experiments in a registry with FSM, salts, splits, and mutual-exclusion groups.
- Ship variant manifests that pin prompt, model, tool, and retrieval revisions together.
- Log
experiment_id,variant_id, andbucketon every run root span. - Pre-register primary and guardrail metrics; set auto-pause thresholds.
- Maintain a 3–10% holdout excluded from all experiments.
- Require minimum sample size before auto-conclude or promote.
- Integrate promotion with config snapshot and prompt template registry.
- Run shadow canaries for high-risk variants before live split.
- Pin experiment context on checkpoint resume for durable workflows.
- Review concluded experiments quarterly; archive salts and manifests for replay.
Key takeaways
- Agent A/B tests compare variant bundles, not isolated prompt tweaks.
- Sticky assignment at ingress keeps multi-service trajectories coherent.
- Mutual exclusion and atomic manifests prevent unplanned combination explosions.
- Guardrails and holdouts separate real lift from statistical noise.
- Harbor Insights cut false-positive promotions from 52% to 7.8% with a central assignment plane.
Related reading
- LLM agent canary deployment explained — shadow replay, metric gates, and safe cutover
- LLM agent feature flags explained — runtime toggles and config snapshots
- LLM agent prompt template versioning explained — semver registries and instruction rollouts
- LLM agent evaluation and benchmarking explained — offline regression suites and trajectory scoring