Guide

LLM preference data curation and labeling explained

Harbor Support exported 40,000 production chat rows with thumbs-up and thumbs-down flags, converted them directly into chosen/rejected pairs, and ran DPO on a 7B support model. Offline win rate against the SFT baseline looked fine at 61%, but live A/B traffic showed a 6-point regression on first-contact resolution. Postmortem: 34% of “chosen” answers were longer but less accurate; 19% of pairs compared the same model family with different temperature, not different policies; 12% of prompts were duplicate billing FAQs that dominated the loss. The alignment team did not need a new algorithm — they needed a preference data pipeline: stratified prompt sampling, blind side-by-side labeling with a written rubric, adjudication for disagreements, and hard filters on length skew and near-tie pairs. After rebuilding 8,200 curated pairs (not 40k noisy rows), DPO win rate rose from 58% to 79% on a held-out human eval, and live resolution recovered plus 4 points above the pre-DPO baseline.

Alignment methods — RLHF, DPO, ORPO, SIMPO, KTO — all assume your labels encode what you actually want. Garbage preferences teach the model to be verbose, sycophantic, or confidently wrong. This guide covers label formats (pairwise, pointwise, listwise), rubric and UI design, stratification and coverage targets, quality control and inter-annotator agreement, debiasing length and position effects, when to use model-assisted labeling, the Harbor Support refactor, a technique decision table, pitfalls, and a production checklist for teams shipping preference data into training.

What preference data is (and what it is not)

Preference data records which of two or more model outputs better satisfies a goal for a fixed prompt (and optional context). Training consumes it as implicit rewards: if humans consistently prefer answer A over B, the optimizer increases the likelihood of A under Bradley-Terry or classification-style losses.

It is not the same as SFT demonstrations (single gold completions), thumbs-up logs without rejected alternatives, or benchmark scores. A thumbs-up on one reply does not tell the model what to avoid unless you pair it with a rejected candidate from the same prompt. It is also not a substitute for safety red-teaming: preference labels optimize average user satisfaction, which can drift toward agreeable hallucination unless rubrics explicitly punish fabrication.

Pairwise vs pointwise vs listwise

Pairwise (A vs B, pick winner or tie) is the default for DPO and reward modeling because Bradley-Terry theory maps cleanly to P(chosen > rejected). Pointwise scores (1–5 helpfulness) work for filtering and for methods like KTO that accept unpaired good/bad labels, but require calibrated rubrics so a “4” means the same across annotators. Listwise ranking (order 4+ completions) improves coverage per prompt but is cognitively expensive; many teams run pairwise tournaments (swiss or round-robin) to derive rankings offline.

Rubric design and labeling UI

A rubric is a short, ordered checklist annotators apply before picking a winner. Harbor Support's v2 rubric was: (1) factual correctness on account state, (2) complete resolution without unnecessary steps, (3) tone appropriate to upset customers, (4) no unauthorized promises. Ties were allowed when both failed (2) or both passed all four with negligible difference.

UI choices matter. Side-by-side blind comparison (randomized left/right, model IDs hidden) cuts position bias by roughly half versus always showing “new model” on the right. Show the full prompt and any retrieved policy snippets annotators must verify against. Disable markdown rendering tricks that make one answer look more “professional” through headers alone. For long answers, collapse to diff view on the sentences that differ.

Budget 30–90 seconds per pairwise judgment for consumer support; legal or medical domains often need 3–5 minutes and specialist annotators. If median time drops below 15 seconds, you are measuring vibe, not rubric compliance.

Stratification, coverage and pair construction

Raw production logs overweight frequent intents. Harbor's first dataset was 41% password-reset FAQs and 3% refund disputes — yet disputes drove 28% of escalations. Fix: define strata (intent, difficulty, user sentiment, language, policy edge case) and sample prompts to hit minimum counts per stratum before generating candidate pairs.

Pair construction should compare policies you want to distinguish:

Current production model vs candidate fine-tune
Candidate A vs candidate B at the same temperature
SFT baseline vs RLHF checkpoint (not two RLHF temps)

Generate 2&nd;4 completions per prompt with diverse decoding (temperature, top-p) but fixed system prompt. Discard pairs where both outputs are near-duplicates (BLEU or embedding cosine above a threshold) — they add noise without gradient signal.

Target 5k–50k high-quality pairs for domain-specific DPO on 7B–13B models before chasing six-figure scale; doubling noisy data often hurts more than it helps.

Quality control and adjudication

Sample 10–15% duplicate labeling on every batch. Track Cohen's kappa or percent agreement per stratum; pause batches below 0.55 kappa on factual tasks until rubric training improves. Maintain a golden set of 50–100 prompts with committee-adjudicated labels; require annotators to pass 90%+ on gold before production work.

Route disagreements and low-confidence ties to a senior adjudicator. Log why (rubric clause), not just winner — those notes become failure-mode buckets for the next SFT pass. Integrate with human-in-the-loop review queues so production escalations feed back into strata that are underrepresented.

Debiasing length, style and position

Models learn length bias quickly: longer answers win when annotators skim. Mitigations: cap display length with expand, normalize by token count in analysis, drop pairs where the winner is 40%+ longer but rubric scores are tied on correctness, and add explicit rubric clause “shorter wins if equally correct.” Position bias: randomize order, balance left-win rate in QA dashboards. Style bias: bullet lists beat plain text; audit with formatting-stripped rendering.

Model-assisted labeling (when and how)

LLM judges can pre-filter obvious losers or suggest ties, but should not be the sole label for high-stakes alignment. A practical pattern: model ranks 4 samples, humans adjudicate only the top-2 margin < 0.15 probability gap — Harbor cut human hours 52% while keeping gold agreement flat.

Constitutional AI / RLAIF scales synthetic critiques; treat those labels as a separate stratum and spot-check 5–10% against humans so constitution gaps do not become permanent blind spots. Never mix synthetic and human pairs in the same batch without a source tag — you will need to ablate later.

Harbor Support pipeline walkthrough

Intent taxonomy — 24 support intents with minimum 200 prompts each from stratified production sample.
Candidate generation — SFT baseline vs DPO candidate v3, temperature 0.7, same retrieval context.
Blind UI — randomized order, model IDs hashed, rubric sidebar mandatory before submit.
QC gates — drop length-skew pairs, near-duplicate pairs, and annotators failing gold set.
Adjudication — 8% of rows reviewed by lead; tie rate capped at 15% per stratum.
Export — JSONL with prompt, chosen, rejected, stratum, annotator_tier, rubric_version.
Train — DPO beta sweep on 8.2k pairs; select checkpoint on held-out human eval, not training loss.

Result: human eval win rate 58% → 79%; live resolution +4% vs pre-DPO baseline; average reply length −11% with fewer policy violations.

Technique decision table

Data strategy	Label cost	Best for	Risk
Curated pairwise + rubric	High	DPO, reward models, production alignment	Slow to scale
Thumbs-up/down only (unpaired)	Low	KTO, filtering, triage	Weak signal for pairwise methods
LLM-as-judge pre-filter	Medium	Large candidate pools, RAG QA	Judge-model bias propagates
RLAIF / constitutional critiques	Low marginal	Harmlessness, style norms	Constitution gaps, self-reinforcement
Crowd without domain rubric	Medium	Generic helpfulness	Factual errors in specialized domains
Synthetic preference only	Very low	Prototyping, data ablations	Offline metrics lie; live regression

Common pitfalls

Logging thumbs without rejected candidates — cannot form valid DPO pairs retroactively without regenerating losers.
Comparing different temperatures as policies — teaches verbosity/luck, not alignment.
Intent imbalance — model optimizes FAQ chit-chat while escalations persist.
No tie option — forces noise on equivalent answers; inflates fake margins.
Rubric drift — v1 and v3 labels mixed without version tags; impossible to debug regressions.
Annotator fatigue batches — quality collapses after 200 judgments/session; enforce breaks.
Training on eval prompts — benchmark contamination; hold out entire strata.
Ignoring near-ties — DPO on coin-flip pairs adds gradient noise; filter or down-weight.

Production checklist

Publish rubric version and annotator training materials before labeling starts.
Define strata with minimum per-stratum pair counts tied to business KPIs.
Generate candidates from policies you intend to compare at fixed decoding settings.
Randomize side-by-side order; hide model identity; log position and length metadata.
Run 10%+ duplicate labeling; compute kappa per stratum weekly.
Maintain golden adjudicated set; gate annotator access on gold accuracy.
Filter length-skew, near-duplicate, and low-agreement pairs before training export.
Tag rows with source (human, RLAIF, judge-filtered) and rubric_version.
Hold out full strata for human eval; select checkpoints on eval, not train loss.
Close the loop: production escalations feed back into under-covered strata.

Key takeaways

Preference quality dominates preference quantity for DPO and reward modeling.
Pairwise labels need rejected candidates from the same prompt — raw thumbs-up logs are not enough.
Stratified sampling and rubrics prevent FAQ-heavy datasets from hijacking the loss.
Blind side-by-side UI and length debiasing cut position and verbosity skew.
Harbor Support recovered a 6-point live regression by rebuilding 8.2k curated pairs instead of using 40k noisy logs.
Model-assisted labeling scales adjudication; it does not replace human gold on high-stakes facts.