Guide

LLM online evaluation explained

Harbor Support shipped a prompt refactor after its offline eval suite showed 91% pass rate on 800 held-out tickets. Two weeks later, first-contact resolution on live chats fell from 74% to 61% — a 13-point drop invisible to the benchmark because long, multi-intent threads (18% of traffic but 41% of escalations) were underrepresented in the test set. Thumbs-down rate barely moved; users simply abandoned chats and reopened tickets. The team rebuilt evaluation around online signals: stratified sampling from production logs, task-completion proxies, and nightly LLM-as-judge scoring on 2% of sessions. Within one sprint they identified the failure mode, rolled back the prompt, and added a canary gate tied to resolution rate before any future promotion.

Online evaluation measures model and prompt quality on real user traffic — not just static benchmarks run before deploy. It combines implicit behavioral signals, explicit feedback, sampled human review, and automated judges to detect regressions, drift, and edge cases that offline sets miss. This guide explains the offline–online gap, signal taxonomy, sampling and stratification, judge pipelines on live logs, tying metrics to canary promotion gates, the Harbor Support refactor, a technique decision table versus offline-only eval and full human review, pitfalls, and a production checklist.

Why offline benchmarks are necessary but not sufficient

Offline evaluation — fixed prompt–response pairs, golden sets, regression suites — is fast, reproducible, and cheap to run in CI. Every serious LLM product needs it. But offline scores systematically diverge from production reality for predictable reasons:

Distribution shift — live queries differ in length, language, intent mix, and tool usage from curated test sets.
Stale golden sets — product features, policies, and retrieval corpora change; benchmarks lag by weeks.
Missing failure modes — jailbreaks, ambiguous inputs, and multi-turn context collapse rarely appear in small held-out sets.
Proxy mismatch — BLEU, ROUGE, or even rubric scores do not correlate with business outcomes like resolution rate or revenue.

Online evaluation closes the loop: it observes what actually happens when real users interact with the system, weighted by traffic volume and business impact. It complements — never replaces — offline gates run before merge and deploy.

Signal taxonomy: implicit, explicit, and outcome metrics

Implicit behavioral signals

Users reveal preference without clicking thumbs up or down:

Task completion — ticket closed without reopen, checkout finished, code compiled, query answered in one turn.
Regeneration and edit rate — how often users click “try again” or heavily edit model output before sending.
Abandonment and bounce — session ended mid-task, chat closed without resolution, page exit after one assistant turn.
Copy and share rate — positive proxy when users copy assistant text into their workflow.
Escalation to human — explicit failure signal in copilot and support products; track rate by intent and model version.
Latency and cost — not quality directly, but regressions in time-to-first-token or tokens-per-resolution often accompany prompt or retrieval changes.

Explicit feedback

Thumbs, star ratings, free-text comments, and structured “was this helpful?” prompts. Explicit signals are sparse (typically 1–5% of sessions) and biased toward extremes — angry users and delighted users over-represent; the silent majority abandons. Weight explicit feedback heavily when present but never rely on it alone.

Outcome and business metrics

Tie LLM quality to metrics leadership already tracks: support resolution rate, sales conversion, deflection rate, time-to-resolution, refund rate, or developer PR merge rate for coding assistants. These are lagging but authoritative; online eval pipelines should dashboard them by model version, prompt hash, and retrieval index snapshot.

Building an online eval pipeline

1. Log everything needed for replay

Each request should emit a structured trace: prompt template version, model ID, retrieval chunks with scores, tool calls and results, final response, latency breakdown, user/session ID (hashed), and downstream outcome flags. Without replayable logs, online eval devolves into aggregate counters that cannot diagnose regressions. See LLM observability for trace schema design.

2. Stratified sampling

Random 2% sampling misses rare high-impact segments. Stratify samples by:

Intent or product area (billing vs technical vs account)
Session length quartile (single-turn vs 10+ turns)
Language and locale
Model version and prompt hash (for A/B and canary comparison)
Outcome bucket (resolved, escalated, abandoned)

Oversample tail segments even if they are 5% of traffic; they drive most escalations and brand risk.

3. Automated judging on samples

Run a separate judge model (or rubric chain) nightly on stratified samples. Score dimensions relevant to your product: correctness, policy compliance, tone, faithfulness to retrieved context, tool-use accuracy. Calibrate judge scores against a weekly human-labeled slice; if judge–human correlation drops below 0.7 on a dimension, pause automated alerts until recalibrated.

4. Human review queue

Route high-uncertainty sessions (low judge confidence, explicit thumbs-down, escalation, or safety flags) to a human review queue. Human labels feed back into golden sets, judge calibration, and fine-tuning datasets — closing the offline–online loop.

5. Alerting and promotion gates

Define SLIs with thresholds: e.g., resolution rate must stay within 2 points of baseline over 72 hours at 5% canary traffic. Wire alerts to rollback runbooks. Online eval is not a dashboard exercise; it must block bad promotions automatically when paired with canary deployment.

Detecting drift and regressions over time

Quality drift happens without code deploys: retrieval corpus ages, user behavior shifts seasonally, upstream model APIs change behavior silently. Monitor:

Input distribution — embedding centroid shift, new intent clusters, rising average prompt length.
Output distribution — refusal rate spikes, average response length, vocabulary drift, rising tool-error rate.
Score trends — rolling 7-day judge scores by dimension; CUSUM or Bayesian change-point detection on resolution rate.
Segment breakdowns — aggregate metrics hide segment regressions; always slice by intent, locale, and model version.

When drift is detected, triage whether the root cause is data (stale RAG), model (provider update), prompt (accidental edit), or traffic (new user cohort). Each root cause has a different fix; misdiagnosis wastes sprints.

Harbor Support refactor: from offline pass rate to live resolution

Before the refactor, Harbor Support evaluated prompt changes on a static 800-ticket set updated quarterly. The failing prompt scored 91% pass — but pass meant “judge model said acceptable,” not “ticket resolved.”

The online eval rebuild added:

Primary SLI: first-contact resolution rate within 24 hours, segmented by ticket complexity tier.
Secondary SLIs: escalation rate, average handle time, policy-violation flag rate, thumbs-down per 1,000 sessions.
Sampling: 3% of sessions nightly to LLM-as-judge with rubric covering correctness, empathy, and policy; 100% of escalations to human review within 4 hours.
Canary gate: promote only if resolution rate at 5% traffic stays within 1.5 points of control for 72 hours.

The next prompt experiment that looked strong offline (+4 points on golden set) failed canary at hour 36: resolution on “complex” tier dropped 6 points while simple tier improved. Auto-rollback fired; offline-only deploy would have shipped a net-negative change. After tuning stratified weights in the golden set to match live intent mix, a revised prompt passed canary and lifted resolution 3 points globally.

Technique decision table

Approach	Signal fidelity	Cost / latency	Time to detect regressions	Best when
Offline golden set only	Low for live drift	Very low	Pre-deploy only	CI regression gates, fast iteration
Implicit metrics dashboard	Medium–high (outcome-linked)	Low (existing analytics)	Hours to days	Products with clear completion events
LLM-as-judge on live samples	Medium (needs calibration)	Medium GPU/API cost	Hours (batch nightly)	High-volume chat without strong outcome metrics
Full human review of all traffic	Highest	Very high labor	Real-time	Low-volume, high-stakes (medical, legal send)
A/B test with business metric	High (causal)	Medium (split traffic)	Days to weeks	Major model upgrades needing causal proof
Canary + online SLI gates	High for deploy decisions	Low incremental	Hours at small traffic %	Every production prompt/model change

Common pitfalls

Optimizing offline score alone — teams ship when golden set passes even if live resolution flatlines; tie releases to online SLIs.
Thumbs as sole signal — sparse and biased; pair with implicit abandonment and outcome metrics.
Uniform random sampling — misses tail segments that drive escalations; stratify by intent and session length.
Uncalibrated LLM judges — automated scores drift as judge models update; recalibrate against human labels monthly.
Aggregate dashboards without slices — a flat resolution rate hides a crashing complex tier and improving simple tier.
No version attribution — logs missing prompt hash or model ID make regressions impossible to bisect.
Alert fatigue — alerting on every 0.5-point noise causes teams to ignore real regressions; use minimum effect sizes and duration windows.
Ignoring seasonality — holiday traffic mixes differ; compare canary to control on matched cohorts, not year-ago baselines alone.

Production checklist

Define 1–2 primary online SLIs tied to business outcomes (resolution, conversion, task success).
Log prompt version, model ID, retrieval snapshot, tools, latency, and outcome flags on every request.
Run offline regression suite in CI; block merge on golden-set failures.
Sample 2–5% of live traffic for nightly LLM-as-judge scoring with stratified oversampling of tail segments.
Calibrate judge scores against weekly human-labeled slice; alert if correlation drops below threshold.
Route escalations, thumbs-down, and low-confidence judge scores to human review within SLA.
Wire canary promotion gates to online SLIs with automatic rollback on breach.
Dashboard metrics by model version, prompt hash, intent, and locale — never aggregate-only.
Feed human review labels back into golden sets and training data monthly.
Run quarterly audit: compare offline pass rate vs live SLI correlation; refresh test set distribution.

Key takeaways

Offline benchmarks catch regressions before deploy but miss distribution shift, stale tests, and proxy mismatch with real user outcomes.
Online evaluation combines implicit behavior, explicit feedback, sampled judges, and business metrics to measure quality on live traffic.
Stratified sampling oversamples high-impact tail segments — random 2% sampling misses the threads that drive escalations.
Pair online SLIs with canary gates so bad prompt or model changes auto-rollback before full traffic exposure.
Harbor Support blocked a net-negative prompt that scored 91% offline by gating on live resolution rate — complex-tier regression invisible to the golden set.