Guide

LLM agent A/B test experiment assignment and traffic splitting systems explained

Harbor Insights, a B2B analytics agent serving 400 enterprise tenants, ran nine simultaneous prompt and retrieval experiments. Each microservice hashed user_id with a different salt: the RAG reranker assigned Alice to variant B while the main orchestrator assigned her to variant A. Session-level metrics looked positive for “winning” bundles, but post-hoc replay showed 52% of promoted variants were statistical noise or interaction artifacts — not real improvements. Two quarters of roadmap shipped regressions masked as wins.

The fix was a centralized experiment assignment plane: sticky bucketing on a stable unit id, atomic variant bundles that move together across services, mutual-exclusion groups, guardrail metrics with automatic holdouts, and assignment records attached to every run trace. False-positive promotions fell from 52% to 7.8% over two release cycles. This guide explains what agent A/B testing is, how it differs from canaries and feature flags, bucketing units, traffic splits, experiment lifecycle FSMs, the Harbor Insights refactor, a decision table, pitfalls, and a production checklist.

What agent A/B testing measures

An agent A/B test compares two or more variant bundles on live traffic while holding everything else as constant as production allows. A variant bundle is not just a prompt string — it is the full behavioral slice that should move together:

  • Prompt template revision — system and task instructions.
  • Model route — primary model, temperature, max tokens.
  • Tool set and policy — which tools are exposed and approval gates.
  • Retrieval stack — index version, reranker, hybrid weights.
  • Loop parameters — reflection on/off, max iterations, parallel tool cap.

Primary metrics are usually task success (resolution rate, human thumbs-up, automated rubric score), latency, and cost per successful run. Guardrail metrics catch collateral damage: escalation rate, policy violations, tool error rate, and p95 latency. Pair experiments with offline evaluation suites so you do not promote variants that win on one narrow metric but fail broad regression tests.

Assignment units and sticky bucketing

The assignment unit is the entity whose experience must stay consistent across sessions and services. Common choices:

UnitSticky across sessionsBest forRisk
user_idYesProduct UX, habit formationLong experiments dilute if users churn
tenant_idYesB2B contracts, support load per customerSmall tenants never reach significance
session_idNo (new session re-rolls)Short-lived chat, low-stakes tasksUser sees flip-flopping behavior
run_idNoOne-shot API calls, batch jobsCannot measure multi-turn outcomes

Sticky bucketing means the same unit always maps to the same variant for the lifetime of an experiment (unless explicitly re-randomized at a defined boundary). Implementation pattern:

bucket = stable_hash(
  salt = experiment.salt,
  unit = assignment_unit_id,
  experiment_id = "exp_reflection_v3"
) % 10000

if bucket < experiment.control_weight_bp:
  variant = "control"
else:
  variant = pick_treatment(bucket, experiment.treatments)

Use a stable hash (Murmur3, SHA-256 truncated) with a per-experiment salt stored in the experiment registry — not hash(user_id) % 2 copy-pasted in three repos. Log experiment_id, variant_id, bucket, and assignment_unit on the run root span for every assignment.

Traffic splitting at ingress

Assignment should happen once at ingress (API gateway, queue consumer, or start_run handler) and propagate as an immutable ExperimentContext attached to the run. Downstream services read the context; they do not re-roll.

Split types

  • Percentage split — 90% control / 10% treatment; standard for gradual tests.
  • Whitelist override — internal dogfood tenants pinned to treatment for QA.
  • Geographic or tier slice — enterprise tier only; avoids SMB noise.
  • Holdout reserve — 5% never sees any experiment; measures cumulative lift vs true baseline.

Integrate splits with dynamic configuration: the experiment resolver is one layer in the config stack. When an experiment ends, winning variants promote into the global snapshot; losing variants retire with a deprecation window.

For high-risk changes, run a shadow canary first (replay production inputs without serving outputs), then an A/B test on live traffic with guardrails. Shadow proves safety; A/B proves lift.

Variant bundles and mutual exclusion

The deadliest mistake is letting independent experiments assign different variants to different layers of the same run. Harbor's nine experiments created 512 theoretical combinations; production traffic hit hundreds of unplanned bundles.

Rules that prevent this:

  1. One active experiment per layer — at most one live test touching prompts, one touching retrieval, one touching model route.
  2. Mutual-exclusion groups — experiments in group main_loop cannot overlap; registry enforces at publish time.
  3. Atomic variant bundles — variant treatment_b is a manifest listing all pinned revisions; services load the manifest, not local hashes.
  4. Factorial only by design — if you need 2×2 factorial analysis, register it as one experiment with four cells, not two independent coin flips.
variant_manifest = {
  "variant_id": "treatment_b",
  "prompt_rev": "support_triage@2.4.1",
  "model_route": "gpt-4.1-mini",
  "rag_index_rev": 1187,
  "reflection_enabled": true
}

Pin manifests at run start and on checkpoint resume so multi-hour workflows do not cross variant boundaries mid-trajectory.

Experiment lifecycle FSM

Treat experiments as stateful resources with an explicit finite-state machine:

StateTrafficAllowed actions
DRAFT0%Edit hypothesis, metrics, manifests; offline eval only
RUNNINGConfigured splitMonitor; no manifest edits without new revision
PAUSED0% (all to control)Investigate guardrail breach; preserves bucket assignments
CONCLUDED_WINRamp downPromote manifest to global config snapshot
CONCLUDED_LOSS0%Archive; document learnings
ABORTEDImmediate 0%Guardrail auto-trip or manual kill

Automatic abort triggers: policy violation rate > 2× control for 15 minutes, p95 latency > SLO ceiling, or cost per run > budget cap. Abort must flip traffic in seconds via the config watch stream — not after a deploy.

Statistical discipline and guardrails

Agent metrics are noisy: small samples, heavy tails, and weekday seasonality. Production systems should:

  • Pre-register primary and guardrail metrics before RUNNING.
  • Require minimum exposure (e.g. 5,000 runs per variant) before auto-conclude.
  • Use sequential testing or fixed-horizon tests — not peek-and-promote daily.
  • Slice results by tenant tier, task type, and model provider — aggregate lifts hide regressions.
  • Keep a permanent holdout to detect drift when everything is “winning.”

Export assignment and outcome events to the audit trail so data science can reproduce analysis without re-querying raw logs. Include prompt_rev, model_id, and tool call outcomes — not just thumbs-up/down.

Harbor Insights refactor

Root causes beyond inconsistent salts:

  1. Per-service assignment — three binaries, three hash functions, zero shared registry.
  2. Session vs user mismatch — retrieval sticky on session; main loop on user; multi-tab chaos.
  3. Overlapping experiments — no mutual-exclusion enforcement at publish time.
  4. Promotion without holdout — declared wins never validated against a clean control reserve.
  5. Missing assignment on traces — could not debug which variant produced a bad run.

Shipped fixes:

  • Central experiment registry with FSM, mutual-exclusion groups, and manifest bundles.
  • Single assignment at start_run; ExperimentContext propagated on all internal RPCs.
  • 5% global holdout lane excluded from all experiments.
  • Guardrail auto-pause wired to PagerDuty; median abort time 41 seconds.
  • Promotion workflow requires offline eval pass + 14-day fixed-horizon live result.

False-positive promotions dropped from 52% to 7.8%. Median time from experiment conclusion to safe global rollout fell from 11 days to 3 days because manifests promoted directly into the config snapshot.

Technique decision table

ApproachBest forWeak when
Per-service random coin flipLocal prototypesAny multi-service agent; uninterpretable metrics
Shadow canary onlySafety validation before any user sees changeMeasuring real user satisfaction or revenue lift
Central sticky assignment + manifestsProduction A/B on prompts, models, retrievalTeam unwilling to operate experiment registry
Manual tenant pinDesign partner feedback, QAStatistical inference at scale
Offline eval onlyRegression gates in CIBehaviors that only emerge on live long-tail inputs

Common pitfalls

  • Different hash salts per service — the Harbor failure mode; users get incoherent bundles.
  • Re-randomizing mid-session — variant flip on reconnect destroys trust and invalidates metrics.
  • Overlapping experiments without factorial design — interaction effects look like wins.
  • Peeking and early stop — promotes noise; use pre-registered horizons.
  • Optimizing proxy metrics — shorter answers score well but increase escalations.
  • No assignment on traces — impossible to replay or dispute a conclusion.
  • Tiny tenant slices — one enterprise customer dominates significance.
  • Forgetting checkpoint resume — long runs pick up new variant after crash recovery.

Production checklist

  • Define assignment unit per product surface (user, tenant, or session) and document it.
  • Assign once at ingress; propagate ExperimentContext on all internal calls.
  • Store experiments in a registry with FSM, salts, splits, and mutual-exclusion groups.
  • Ship variant manifests that pin prompt, model, tool, and retrieval revisions together.
  • Log experiment_id, variant_id, and bucket on every run root span.
  • Pre-register primary and guardrail metrics; set auto-pause thresholds.
  • Maintain a 3–10% holdout excluded from all experiments.
  • Require minimum sample size before auto-conclude or promote.
  • Integrate promotion with config snapshot and prompt template registry.
  • Run shadow canaries for high-risk variants before live split.
  • Pin experiment context on checkpoint resume for durable workflows.
  • Review concluded experiments quarterly; archive salts and manifests for replay.

Key takeaways

  • Agent A/B tests compare variant bundles, not isolated prompt tweaks.
  • Sticky assignment at ingress keeps multi-service trajectories coherent.
  • Mutual exclusion and atomic manifests prevent unplanned combination explosions.
  • Guardrails and holdouts separate real lift from statistical noise.
  • Harbor Insights cut false-positive promotions from 52% to 7.8% with a central assignment plane.

Related reading