Guide

LLM A/B testing and online experimentation explained

Harbor Analytics ran two candidate system prompts for its internal SQL assistant through the same 400-question offline benchmark. Variant A and Variant B scored within one point on exact-match accuracy. The team picked B because evaluators rated its explanations slightly more “helpful.” In production, B drove executable-query success from 71% to 89% on warehouse traffic — but also doubled references to tables that do not exist in the customer schema. Variant A would have been safer; nobody measured schema-grounding rate because the offline set used sanitized table names. A two-week randomized A/B test with sticky session bucketing, a primary metric of query execution success, and a guardrail on hallucinated entity rate would have caught the tradeoff before full rollout.

A/B testing randomly assigns users or sessions to control and treatment stacks, then compares outcome metrics with statistical discipline. For LLM products it is how you decide which prompt, model, retrieval configuration, or tool schema actually wins on real queries — not which looked best in a lab notebook. This guide covers experiment design, bucketing, metric hierarchies, variance reduction, sequential testing, interaction with canary deployment and online evaluation, the Harbor Analytics refactor, a technique decision table, pitfalls, and a production checklist.

What makes LLM experiments different from classic web A/B tests

Button-color tests measure one binary click. LLM stacks emit variable-length text, call tools, retrieve documents, and fail in ways offline suites miss. High variance in response quality, heavy-tailed latency, and session-level carryover (a bad first answer poisons the whole thread) all break naive analysis.

Unit of randomization: user, session, or request?

User-level — stable experience; good for retention and habit metrics; slow to reach sample size on low-traffic products.
Session-level — common for support bots and copilots; balances consistency within a conversation with faster accumulation of events.
Request-level — only when each query is independent; risks inconsistent tone mid-thread unless you accept that tradeoff for speed on stateless APIs.

Pick one unit, document it, and never re-randomize mid-session without resetting conversation state. Pair bucketing with prompt versioning so control and treatment artifacts are immutable for the experiment duration.

What you can experiment on

System and developer prompts (tone, format, refusal policy).
Base model or fine-tuned checkpoint (same API, different weights).
Retrieval top-k, reranker, chunk size, or embedding model.
Tool schemas, decoding temperature, max tokens, structured-output mode.
Agent loop depth, planner prompts, human-in-the-loop gates.

Change one layer per experiment when possible. Bundling a new model plus new prompt plus new index makes attribution impossible and invalidates rollback.

Metric design: one primary, explicit guardrails

Every experiment needs exactly one primary metric that decides ship or no-ship, pre-registered before traffic flows. Everything else is diagnostic or guardrail.

Good primary metrics for LLM products

Task success rate — SQL executed without error, ticket resolved without escalation, code compiled, form submitted.
Human thumbs-up rate on stratified sample (not raw aggregate — see pitfalls).
Downstream conversion — purchase, signup, or workflow completion attributed to the assistant turn.
Cost-adjusted success — success per dollar of inference when margin matters.

Guardrail metrics (auto-halt if breached)

Hallucination or ungrounded-claim rate on sampled turns.
P95 latency and token cost per session.
Safety policy violation rate (PII leak, disallowed content).
Escalation to human rate when lower is better (or higher, if the product goal is deflection).
Retention on slice-level quality (see catastrophic forgetting when testing new fine-tunes).

Wire guardrails into automated experiment kill switches. A treatment that lifts primary metric 5% but doubles policy violations is a lose, not a win pending review.

Randomization, stickiness, and sample size

Sticky bucketing

Hash a stable identifier (user ID, session ID, or anonymous cookie) into a bucket 0–99. Assign buckets 0–49 to control and 50–99 to treatment for a 50/50 split. Sticky assignment prevents users from seeing both variants in one journey, which would confound session metrics and create bizarre UX.

Power and duration

Estimate required sample size from baseline conversion rate and minimum detectable effect (MDE). LLM metrics are often noisier than checkout funnels — plan for longer runs or accept larger MDE. A common mistake is stopping when the p-value flickers below 0.05 after three days; use sequential testing with spending functions or fixed horizon tests declared upfront.

CUPED and stratification

CUPED (Controlled-experiment Using Pre-Experiment Data) adjusts post-period outcomes by pre-period covariates on the same users, cutting variance 20–40% on repeat visitors. Stratify analysis by intent, locale, or query length so a global lift does not hide a collapse on long-tail slices. Log experiment ID, variant, model hash, and prompt version on every request for observability joins.

Experiment lifecycle: from hypothesis to decision

Hypothesis — “Shorter schema-injection prompt reduces hallucinated table names without lowering execution success.”
Pre-register primary metric, guardrails, MDE, max duration, and analysis plan (including multiple-comparison correction if >2 arms).
Shadow or offline gate — run candidate in shadow mode per canary/shadow patterns before exposing output to users; fail fast on safety regressions.
Ramp — start 5–10% traffic; verify logging and guardrail pipelines; expand to target split.
Monitor — daily dashboards on primary and guardrails; automated alerts on guardrail breach.
Decide — ship winner, keep control, or iterate. Document negative results; they prevent re-running the same failed prompt.
Cleanup — remove dead variants from registry; promote winner through canary to 100% with rollback path.

Harbor Analytics refactor: from offline ties to grounded A/B decisions

Harbor’s SQL copilot serves analysts who ask natural-language questions against a 200-table warehouse. Two prompt strategies tied on offline exact-match: Variant A injected full CREATE TABLE snippets for the 30 most-queried tables; Variant B injected only column names and relied on the model to infer joins.

The production experiment ran 14 days, session-level 50/50 split, primary metric = percentage of generated queries that executed successfully on first attempt. Guardrails: hallucinated table/column rate (judge + schema validator), P95 latency, tokens per session.

Variant A: execution success 71% → 84% (+13 pp, p < 0.01); hallucination rate flat at 4.2%.
Variant B: execution success 71% → 89% (+18 pp) but hallucination rate 4.1% → 9.7% (guardrail breach at day 6; traffic auto-capped at 25% for remainder).

Decision: ship A as default; use B only on sessions pre-classified as single-table aggregates via intent router. The hybrid lifted overall success to 87% with hallucination under 5%. None of this was visible in the offline benchmark because eval questions used abbreviated schemas.

Technique decision table

Approach	When it wins	Limitation
Classic A/B (2 arms, fixed horizon)	Clear ship/no-ship on one primary metric; regulated pre-registration	Slow on low traffic; one winner only
Sequential A/B (spending function)	Early stop on clear win or futility without p-hacking	Requires disciplined tooling; easy to misconfigure
Multi-armed bandit	Many variants, optimize exploration/exploitation on revenue	Harder to interpret; weak for rare safety events
Canary-only (no control comparison)	Fast rollout of obvious fix; safety monitoring	No causal estimate of lift vs status quo
Shadow inference	Compare quality before users see treatment	No user-behavior metrics (clicks, retention)
Offline eval only	Cheap screening of dozens of candidates	Misses distribution shift, tool feedback, user patience
Interleaving (ranking)	Side-by-side preference for two completions	UX cost; best for search/ranking surfaces

Use offline eval to prune candidates; shadow to catch safety issues; A/B to measure causal lift on outcomes that matter to the business.

Common pitfalls

Peeking without correction — stopping when p < 0.05 any day inflates false-positive rate; pre-specify horizon or use sequential methods.
Multiple metrics fishing — reporting the one metric that moved; register primary metric before launch.
Non-sticky assignment — users see both variants; session metrics become meaningless.
Simpson’s paradox — global lift hides harm on mobile or non-English slices; always segment.
Contaminated control — caching, CDN, or retrieval index updated mid-experiment for both arms unevenly.
Thumbs-up bias — verbose answers get more clicks; pair with task success and LLM-as-judge on blinded pairs.
Underpowered tests — declaring “no difference” after 200 sessions; report confidence intervals, not only p-values.
Bundle changes — new model + new RAG + new prompt in one arm; impossible to learn what worked.
Ignoring cost — treatment wins quality but 3x tokens; primary metric should reflect unit economics when relevant.

Production checklist

Pre-register primary metric, guardrails, MDE, duration, and randomization unit.
Implement sticky bucketing on user or session ID with immutable variant assignment.
Log experiment_id, variant, model hash, prompt version, and retrieval snapshot ID per request.
Run shadow or offline safety gate before exposing treatment output to users.
Automate guardrail kill switches (hallucination rate, latency, policy violations).
Stratify results by intent, locale, and query complexity; report slices, not only global.
Apply CUPED or covariate adjustment when repeat users dominate traffic.
Use fixed-horizon or sequential testing; document stopping rule before start.
Archive negative results in experiment registry to prevent re-running failed variants.
Promote winner via canary ramp with rollback; retire losing artifacts from serving path.

Key takeaways

Offline benchmarks screen candidates; A/B tests measure causal lift on outcomes real users care about — you need both, in that order.
One pre-registered primary metric plus explicit guardrails prevents metric fishing and unsafe wins.
Sticky session bucketing and immutable prompt versions are non-negotiable for interpretable LLM experiments.
Harbor Analytics chose a hybrid routing policy only after an A/B test revealed Variant B's execution gain came with doubled schema hallucinations.
Pair experimentation with canary deployment for promotion and with online evaluation for continuous monitoring after ship.