Guide
LLM A/B testing and online experimentation explained
Harbor Analytics ran two candidate system prompts for its internal SQL assistant through the same 400-question offline benchmark. Variant A and Variant B scored within one point on exact-match accuracy. The team picked B because evaluators rated its explanations slightly more “helpful.” In production, B drove executable-query success from 71% to 89% on warehouse traffic — but also doubled references to tables that do not exist in the customer schema. Variant A would have been safer; nobody measured schema-grounding rate because the offline set used sanitized table names. A two-week randomized A/B test with sticky session bucketing, a primary metric of query execution success, and a guardrail on hallucinated entity rate would have caught the tradeoff before full rollout.
A/B testing randomly assigns users or sessions to control and treatment stacks, then compares outcome metrics with statistical discipline. For LLM products it is how you decide which prompt, model, retrieval configuration, or tool schema actually wins on real queries — not which looked best in a lab notebook. This guide covers experiment design, bucketing, metric hierarchies, variance reduction, sequential testing, interaction with canary deployment and online evaluation, the Harbor Analytics refactor, a technique decision table, pitfalls, and a production checklist.
What makes LLM experiments different from classic web A/B tests
Button-color tests measure one binary click. LLM stacks emit variable-length text, call tools, retrieve documents, and fail in ways offline suites miss. High variance in response quality, heavy-tailed latency, and session-level carryover (a bad first answer poisons the whole thread) all break naive analysis.
Unit of randomization: user, session, or request?
- User-level — stable experience; good for retention and habit metrics; slow to reach sample size on low-traffic products.
- Session-level — common for support bots and copilots; balances consistency within a conversation with faster accumulation of events.
- Request-level — only when each query is independent; risks inconsistent tone mid-thread unless you accept that tradeoff for speed on stateless APIs.
Pick one unit, document it, and never re-randomize mid-session without resetting conversation state. Pair bucketing with prompt versioning so control and treatment artifacts are immutable for the experiment duration.
What you can experiment on
- System and developer prompts (tone, format, refusal policy).
- Base model or fine-tuned checkpoint (same API, different weights).
- Retrieval top-k, reranker, chunk size, or embedding model.
- Tool schemas, decoding temperature, max tokens, structured-output mode.
- Agent loop depth, planner prompts, human-in-the-loop gates.
Change one layer per experiment when possible. Bundling a new model plus new prompt plus new index makes attribution impossible and invalidates rollback.
Metric design: one primary, explicit guardrails
Every experiment needs exactly one primary metric that decides ship or no-ship, pre-registered before traffic flows. Everything else is diagnostic or guardrail.
Good primary metrics for LLM products
- Task success rate — SQL executed without error, ticket resolved without escalation, code compiled, form submitted.
- Human thumbs-up rate on stratified sample (not raw aggregate — see pitfalls).
- Downstream conversion — purchase, signup, or workflow completion attributed to the assistant turn.
- Cost-adjusted success — success per dollar of inference when margin matters.
Guardrail metrics (auto-halt if breached)
- Hallucination or ungrounded-claim rate on sampled turns.
- P95 latency and token cost per session.
- Safety policy violation rate (PII leak, disallowed content).
- Escalation to human rate when lower is better (or higher, if the product goal is deflection).
- Retention on slice-level quality (see catastrophic forgetting when testing new fine-tunes).
Wire guardrails into automated experiment kill switches. A treatment that lifts primary metric 5% but doubles policy violations is a lose, not a win pending review.
Randomization, stickiness, and sample size
Sticky bucketing
Hash a stable identifier (user ID, session ID, or anonymous cookie) into a bucket 0–99. Assign buckets 0–49 to control and 50–99 to treatment for a 50/50 split. Sticky assignment prevents users from seeing both variants in one journey, which would confound session metrics and create bizarre UX.
Power and duration
Estimate required sample size from baseline conversion rate and minimum detectable effect (MDE). LLM metrics are often noisier than checkout funnels — plan for longer runs or accept larger MDE. A common mistake is stopping when the p-value flickers below 0.05 after three days; use sequential testing with spending functions or fixed horizon tests declared upfront.
CUPED and stratification
CUPED (Controlled-experiment Using Pre-Experiment Data) adjusts post-period outcomes by pre-period covariates on the same users, cutting variance 20–40% on repeat visitors. Stratify analysis by intent, locale, or query length so a global lift does not hide a collapse on long-tail slices. Log experiment ID, variant, model hash, and prompt version on every request for observability joins.
Experiment lifecycle: from hypothesis to decision
- Hypothesis — “Shorter schema-injection prompt reduces hallucinated table names without lowering execution success.”
- Pre-register primary metric, guardrails, MDE, max duration, and analysis plan (including multiple-comparison correction if >2 arms).
- Shadow or offline gate — run candidate in shadow mode per canary/shadow patterns before exposing output to users; fail fast on safety regressions.
- Ramp — start 5–10% traffic; verify logging and guardrail pipelines; expand to target split.
- Monitor — daily dashboards on primary and guardrails; automated alerts on guardrail breach.
- Decide — ship winner, keep control, or iterate. Document negative results; they prevent re-running the same failed prompt.
- Cleanup — remove dead variants from registry; promote winner through canary to 100% with rollback path.
Harbor Analytics refactor: from offline ties to grounded A/B decisions
Harbor’s SQL copilot serves analysts who ask natural-language questions against a 200-table warehouse. Two prompt strategies tied on offline exact-match: Variant A injected full CREATE TABLE snippets for the 30 most-queried tables; Variant B injected only column names and relied on the model to infer joins.
The production experiment ran 14 days, session-level 50/50 split, primary metric = percentage of generated queries that executed successfully on first attempt. Guardrails: hallucinated table/column rate (judge + schema validator), P95 latency, tokens per session.
- Variant A: execution success 71% → 84% (+13 pp, p < 0.01); hallucination rate flat at 4.2%.
- Variant B: execution success 71% → 89% (+18 pp) but hallucination rate 4.1% → 9.7% (guardrail breach at day 6; traffic auto-capped at 25% for remainder).
Decision: ship A as default; use B only on sessions pre-classified as single-table aggregates via intent router. The hybrid lifted overall success to 87% with hallucination under 5%. None of this was visible in the offline benchmark because eval questions used abbreviated schemas.
Technique decision table
| Approach | When it wins | Limitation |
|---|---|---|
| Classic A/B (2 arms, fixed horizon) | Clear ship/no-ship on one primary metric; regulated pre-registration | Slow on low traffic; one winner only |
| Sequential A/B (spending function) | Early stop on clear win or futility without p-hacking | Requires disciplined tooling; easy to misconfigure |
| Multi-armed bandit | Many variants, optimize exploration/exploitation on revenue | Harder to interpret; weak for rare safety events |
| Canary-only (no control comparison) | Fast rollout of obvious fix; safety monitoring | No causal estimate of lift vs status quo |
| Shadow inference | Compare quality before users see treatment | No user-behavior metrics (clicks, retention) |
| Offline eval only | Cheap screening of dozens of candidates | Misses distribution shift, tool feedback, user patience |
| Interleaving (ranking) | Side-by-side preference for two completions | UX cost; best for search/ranking surfaces |
Use offline eval to prune candidates; shadow to catch safety issues; A/B to measure causal lift on outcomes that matter to the business.
Common pitfalls
- Peeking without correction — stopping when p < 0.05 any day inflates false-positive rate; pre-specify horizon or use sequential methods.
- Multiple metrics fishing — reporting the one metric that moved; register primary metric before launch.
- Non-sticky assignment — users see both variants; session metrics become meaningless.
- Simpson’s paradox — global lift hides harm on mobile or non-English slices; always segment.
- Contaminated control — caching, CDN, or retrieval index updated mid-experiment for both arms unevenly.
- Thumbs-up bias — verbose answers get more clicks; pair with task success and LLM-as-judge on blinded pairs.
- Underpowered tests — declaring “no difference” after 200 sessions; report confidence intervals, not only p-values.
- Bundle changes — new model + new RAG + new prompt in one arm; impossible to learn what worked.
- Ignoring cost — treatment wins quality but 3x tokens; primary metric should reflect unit economics when relevant.
Production checklist
- Pre-register primary metric, guardrails, MDE, duration, and randomization unit.
- Implement sticky bucketing on user or session ID with immutable variant assignment.
- Log experiment_id, variant, model hash, prompt version, and retrieval snapshot ID per request.
- Run shadow or offline safety gate before exposing treatment output to users.
- Automate guardrail kill switches (hallucination rate, latency, policy violations).
- Stratify results by intent, locale, and query complexity; report slices, not only global.
- Apply CUPED or covariate adjustment when repeat users dominate traffic.
- Use fixed-horizon or sequential testing; document stopping rule before start.
- Archive negative results in experiment registry to prevent re-running failed variants.
- Promote winner via canary ramp with rollback; retire losing artifacts from serving path.
Key takeaways
- Offline benchmarks screen candidates; A/B tests measure causal lift on outcomes real users care about — you need both, in that order.
- One pre-registered primary metric plus explicit guardrails prevents metric fishing and unsafe wins.
- Sticky session bucketing and immutable prompt versions are non-negotiable for interpretable LLM experiments.
- Harbor Analytics chose a hybrid routing policy only after an A/B test revealed Variant B's execution gain came with doubled schema hallucinations.
- Pair experimentation with canary deployment for promotion and with online evaluation for continuous monitoring after ship.
Related reading
- LLM online evaluation explained — live quality signals beyond experiment windows
- LLM canary and shadow deployment explained — safe ramps after an experiment picks a winner
- LLM prompt versioning registry explained — immutable artifacts for control and treatment arms
- LLM observability explained — tracing and metrics joins for experiment analysis