Guide

LLM agent A/B test experiment assignment and traffic splitting systems explained

Harbor Insights, a B2B analytics agent serving 400 enterprise tenants, ran nine simultaneous prompt and retrieval experiments. Each microservice hashed user_id with a different salt: the RAG reranker assigned Alice to variant B while the main orchestrator assigned her to variant A. Session-level metrics looked positive for “winning” bundles, but post-hoc replay showed 52% of promoted variants were statistical noise or interaction artifacts — not real improvements. Two quarters of roadmap shipped regressions masked as wins.

The fix was a centralized experiment assignment plane: sticky bucketing on a stable unit id, atomic variant bundles that move together across services, mutual-exclusion groups, guardrail metrics with automatic holdouts, and assignment records attached to every run trace. False-positive promotions fell from 52% to 7.8% over two release cycles. This guide explains what agent A/B testing is, how it differs from canaries and feature flags, bucketing units, traffic splits, experiment lifecycle FSMs, the Harbor Insights refactor, a decision table, pitfalls, and a production checklist.

What agent A/B testing measures

An agent A/B test compares two or more variant bundles on live traffic while holding everything else as constant as production allows. A variant bundle is not just a prompt string — it is the full behavioral slice that should move together:

Prompt template revision — system and task instructions.
Model route — primary model, temperature, max tokens.
Tool set and policy — which tools are exposed and approval gates.
Retrieval stack — index version, reranker, hybrid weights.
Loop parameters — reflection on/off, max iterations, parallel tool cap.

Primary metrics are usually task success (resolution rate, human thumbs-up, automated rubric score), latency, and cost per successful run. Guardrail metrics catch collateral damage: escalation rate, policy violations, tool error rate, and p95 latency. Pair experiments with offline evaluation suites so you do not promote variants that win on one narrow metric but fail broad regression tests.

Assignment units and sticky bucketing

The assignment unit is the entity whose experience must stay consistent across sessions and services. Common choices:

Unit	Sticky across sessions	Best for	Risk
`user_id`	Yes	Product UX, habit formation	Long experiments dilute if users churn
`tenant_id`	Yes	B2B contracts, support load per customer	Small tenants never reach significance
`session_id`	No (new session re-rolls)	Short-lived chat, low-stakes tasks	User sees flip-flopping behavior
`run_id`	No	One-shot API calls, batch jobs	Cannot measure multi-turn outcomes

Sticky bucketing means the same unit always maps to the same variant for the lifetime of an experiment (unless explicitly re-randomized at a defined boundary). Implementation pattern:

bucket = stable_hash(
  salt = experiment.salt,
  unit = assignment_unit_id,
  experiment_id = "exp_reflection_v3"
) % 10000

if bucket < experiment.control_weight_bp:
  variant = "control"
else:
  variant = pick_treatment(bucket, experiment.treatments)

Use a stable hash (Murmur3, SHA-256 truncated) with a per-experiment salt stored in the experiment registry — not hash(user_id) % 2 copy-pasted in three repos. Log experiment_id, variant_id, bucket, and assignment_unit on the run root span for every assignment.

Traffic splitting at ingress

Assignment should happen once at ingress (API gateway, queue consumer, or start_run handler) and propagate as an immutable ExperimentContext attached to the run. Downstream services read the context; they do not re-roll.

Split types

Percentage split — 90% control / 10% treatment; standard for gradual tests.
Whitelist override — internal dogfood tenants pinned to treatment for QA.
Geographic or tier slice — enterprise tier only; avoids SMB noise.
Holdout reserve — 5% never sees any experiment; measures cumulative lift vs true baseline.

Integrate splits with dynamic configuration: the experiment resolver is one layer in the config stack. When an experiment ends, winning variants promote into the global snapshot; losing variants retire with a deprecation window.

For high-risk changes, run a shadow canary first (replay production inputs without serving outputs), then an A/B test on live traffic with guardrails. Shadow proves safety; A/B proves lift.

Variant bundles and mutual exclusion

The deadliest mistake is letting independent experiments assign different variants to different layers of the same run. Harbor's nine experiments created 512 theoretical combinations; production traffic hit hundreds of unplanned bundles.

Rules that prevent this:

One active experiment per layer — at most one live test touching prompts, one touching retrieval, one touching model route.
Mutual-exclusion groups — experiments in group main_loop cannot overlap; registry enforces at publish time.
Atomic variant bundles — variant treatment_b is a manifest listing all pinned revisions; services load the manifest, not local hashes.
Factorial only by design — if you need 2×2 factorial analysis, register it as one experiment with four cells, not two independent coin flips.

variant_manifest = {
  "variant_id": "treatment_b",
  "prompt_rev": "support_triage@2.4.1",
  "model_route": "gpt-4.1-mini",
  "rag_index_rev": 1187,
  "reflection_enabled": true
}

Pin manifests at run start and on checkpoint resume so multi-hour workflows do not cross variant boundaries mid-trajectory.

Experiment lifecycle FSM

Treat experiments as stateful resources with an explicit finite-state machine:

State	Traffic	Allowed actions
`DRAFT`	0%	Edit hypothesis, metrics, manifests; offline eval only
`RUNNING`	Configured split	Monitor; no manifest edits without new revision
`PAUSED`	0% (all to control)	Investigate guardrail breach; preserves bucket assignments
`CONCLUDED_WIN`	Ramp down	Promote manifest to global config snapshot
`CONCLUDED_LOSS`	0%	Archive; document learnings
`ABORTED`	Immediate 0%	Guardrail auto-trip or manual kill

Automatic abort triggers: policy violation rate > 2× control for 15 minutes, p95 latency > SLO ceiling, or cost per run > budget cap. Abort must flip traffic in seconds via the config watch stream — not after a deploy.

Statistical discipline and guardrails

Agent metrics are noisy: small samples, heavy tails, and weekday seasonality. Production systems should:

Pre-register primary and guardrail metrics before RUNNING.
Require minimum exposure (e.g. 5,000 runs per variant) before auto-conclude.
Use sequential testing or fixed-horizon tests — not peek-and-promote daily.
Slice results by tenant tier, task type, and model provider — aggregate lifts hide regressions.
Keep a permanent holdout to detect drift when everything is “winning.”

Export assignment and outcome events to the audit trail so data science can reproduce analysis without re-querying raw logs. Include prompt_rev, model_id, and tool call outcomes — not just thumbs-up/down.

Harbor Insights refactor

Root causes beyond inconsistent salts:

Per-service assignment — three binaries, three hash functions, zero shared registry.
Session vs user mismatch — retrieval sticky on session; main loop on user; multi-tab chaos.
Overlapping experiments — no mutual-exclusion enforcement at publish time.
Promotion without holdout — declared wins never validated against a clean control reserve.
Missing assignment on traces — could not debug which variant produced a bad run.

Shipped fixes:

Central experiment registry with FSM, mutual-exclusion groups, and manifest bundles.
Single assignment at start_run; ExperimentContext propagated on all internal RPCs.
5% global holdout lane excluded from all experiments.
Guardrail auto-pause wired to PagerDuty; median abort time 41 seconds.
Promotion workflow requires offline eval pass + 14-day fixed-horizon live result.

False-positive promotions dropped from 52% to 7.8%. Median time from experiment conclusion to safe global rollout fell from 11 days to 3 days because manifests promoted directly into the config snapshot.

Technique decision table

Approach	Best for	Weak when
Per-service random coin flip	Local prototypes	Any multi-service agent; uninterpretable metrics
Shadow canary only	Safety validation before any user sees change	Measuring real user satisfaction or revenue lift
Central sticky assignment + manifests	Production A/B on prompts, models, retrieval	Team unwilling to operate experiment registry
Manual tenant pin	Design partner feedback, QA	Statistical inference at scale
Offline eval only	Regression gates in CI	Behaviors that only emerge on live long-tail inputs

Common pitfalls

Different hash salts per service — the Harbor failure mode; users get incoherent bundles.
Re-randomizing mid-session — variant flip on reconnect destroys trust and invalidates metrics.
Overlapping experiments without factorial design — interaction effects look like wins.
Peeking and early stop — promotes noise; use pre-registered horizons.
Optimizing proxy metrics — shorter answers score well but increase escalations.
No assignment on traces — impossible to replay or dispute a conclusion.
Tiny tenant slices — one enterprise customer dominates significance.
Forgetting checkpoint resume — long runs pick up new variant after crash recovery.

Production checklist

Define assignment unit per product surface (user, tenant, or session) and document it.
Assign once at ingress; propagate ExperimentContext on all internal calls.
Store experiments in a registry with FSM, salts, splits, and mutual-exclusion groups.
Ship variant manifests that pin prompt, model, tool, and retrieval revisions together.
Log experiment_id, variant_id, and bucket on every run root span.
Pre-register primary and guardrail metrics; set auto-pause thresholds.
Maintain a 3–10% holdout excluded from all experiments.
Require minimum sample size before auto-conclude or promote.
Integrate promotion with config snapshot and prompt template registry.
Run shadow canaries for high-risk variants before live split.
Pin experiment context on checkpoint resume for durable workflows.
Review concluded experiments quarterly; archive salts and manifests for replay.

Key takeaways

Agent A/B tests compare variant bundles, not isolated prompt tweaks.
Sticky assignment at ingress keeps multi-service trajectories coherent.
Mutual exclusion and atomic manifests prevent unplanned combination explosions.
Guardrails and holdouts separate real lift from statistical noise.
Harbor Insights cut false-positive promotions from 52% to 7.8% with a central assignment plane.