Guide
LLM KTO Kahneman-Tversky optimization explained
Harbor Support’s triage assistant logged 42,000 agent reactions over six months — thumbs up on helpful draft replies, thumbs down on hallucinated refund policies or wrong ticket routing. Only 3,200 prompts had paired rankings from side-by-side evals, the format DPO and ORPO expect. Engineers were throwing away 92% of production signal. Switching to KTO (Kahneman-Tversky Optimization, Ethayarajh et al., 2024) trained directly on binary desirable/undesirable labels — no matched winner/loser required per prompt — and raised first-contact resolution rate on held-out tickets from 67% to 83% while using the full feedback corpus.
KTO reframes alignment through prospect theory: humans judge outcomes
relative to a reference point and feel losses more sharply than equivalent gains. The
training objective mirrors that asymmetry, penalizing undesirable completions with
stronger weight than it rewards desirable ones, while anchoring drift to a frozen
reference model like DPO. Unlike pairwise methods, each training row is simply
(x, y, s) where s ∈ {+1, −1} — perfect for
production logs, moderation queues, and implicit feedback where pairing is expensive.
This guide covers the KTO objective, loss-aversion hyperparameters, unpaired data
pipelines, when KTO beats DPO and
SIMPO,
the Harbor Support refactor, a technique decision table, common pitfalls, and a
production checklist.
What KTO changes versus pairwise preference methods
DPO, ORPO, and SIMPO all consume pairwise data: for prompt
x, a preferred completion yw and a rejected
yl. That structure is clean for offline eval but expensive
in production. Real systems emit:
- Thumbs up/down on a single model output per turn.
- Implicit signals — copy-paste, edit distance, ticket reopen, escalation to human.
- Moderation flags — approve or reject without generating a paired alternative.
- Class imbalance — mostly negatives on early model versions, mostly positives after iteration.
Converting binary logs to pairwise data requires either re-sampling a rejected completion (noisy) or discarding unpaired rows (wasteful). KTO trains on each labeled example independently:
- Desirable (
s = +1) — increase implicit reward foryrelative to the reference policy. - Undesirable (
s = −1) — decrease it, with prospect-theory weighting so the gradient on bad outputs is stronger than the mirror gradient on good ones.
The reference model πref supplies the anchor DPO users
know: KTO compares log-likelihood ratios against frozen base weights rather than
letting the policy drift unbounded. You still pay reference-model memory, unlike
reference-free ORPO/SIMPO — but you gain access to data formats those methods
cannot ingest natively.
The KTO objective in practice
Implicit rewards from log-ratios
As in DPO, define an implicit reward for completion y given prompt
x:
rθ(x, y) = β · log [πθ(y|x)
/ πref(y|x)]
Higher reward means the policy assigns more probability mass than the reference.
β controls how far the policy may move per unit log-likelihood
change — same intuition as DPO temperature.
Prospect-theory weighting
KTO assigns different utility to desirable versus undesirable labels. For desirable
examples, the loss encourages reward above a reference point; for undesirable
examples, falling below that point is penalized with a loss-aversion
coefficient λu > 1 (typical values
1.5–2.0 in published runs). Intuitively: one bad support reply that invents
a refund policy should pull gradients harder than one good reply pushes them.
Implementations express this through sigmoid gating on the reward with asymmetric
weights λd (desirable) and
λu (undesirable). The exact functional form varies
by trainer (HuggingFace TRL, Axolotl, custom loops) but the invariant is:
undesirable rows contribute larger magnitude gradients per example
when reward is insufficiently negative.
Unpaired batching
Each training step samples a batch of labeled tuples. A batch may contain only
positives, only negatives, or a mix — no requirement that
yw and yl share the same
x. Shuffle desirable and undesirable rows from separate parquet
partitions; apply class weights if production logs skew 90% negative on early
checkpoints. Monitor effective gradient norm per class so one regime does not
dominate after filtering.
Starting checkpoint and SFT
KTO assumes a reasonable reference — usually an SFT instruct model. Harbor Support started from an internal 7B SFT checkpoint trained on 18,000 resolved ticket transcripts. KTO then fine-tuned on the 42,000 binary labels without an intermediate pairwise curation stage. Teams with small pairwise sets can merge both: pairwise rows for hard comparisons, unpaired thumbs for volume — KTO consumes both as labeled singles.
When KTO beats DPO and when it does not
KTO tends to help when:
- Production feedback is binary and unpaired — support, sales copilots, moderation tools where only one completion exists per event.
- Negative examples are informative — edited drafts, reopened tickets, and safety flags carry strong “do not do this” signal.
- Pairwise collection is too slow — labeling two completions per prompt doubles annotator cost; KTO uses every logged row.
- Class imbalance reflects real failure modes — overweighting rare catastrophic outputs matches prospect-theory loss aversion.
DPO, ORPO, or SIMPO may still win when:
- Clean pairwise rankings exist at scale — offline eval farms producing tied winner/loser for the same prompt give sharper relative gradients than isolated labels.
- Reference model memory is prohibitive —
ORPO
and SIMPO drop
πref; KTO keeps it. - Both completions in a pair are bad — pairwise methods still learn “less bad”; KTO needs explicit desirable examples somewhere in the corpus.
- Length bias dominates — SIMPO’s per-token normalization targets verbosity hacking; KTO uses sequence log-probs unless you add masking tricks.
- Rewards require on-policy rollouts — verifiable math/code tasks often need GRPO or outcome reward models.
Harbor Support refactor: unpaired logs to KTO
Harbor Support routes tier-1 tickets: password resets, billing disputes, SLA
questions. The copilot drafts replies; agents accept, edit, or reject. The logging
pipeline recorded (ticket_id, prompt, completion, label) where label
derived from accept-without-edit (+1), heavy edit (−1), or explicit thumbs
down (−1). Distribution: 28% desirable, 72% undesirable — typical for
a model still learning policy boundaries.
First attempt: subsample 3,200 prompts with two model samples each, human-rank,
train DPO (β = 0.1). Resolution rate on a blind ticket set: 71%.
Most production negatives never entered training. Second attempt: KTO on all 42,000
labels, β = 0.05, λu = 1.8,
desirable weight 1.0, three epochs, class-balanced batching (50/50 per step via
oversampling positives). Resolution rate: 83%; hallucinated refund-policy rate
dropped from 11% to 3.4%; median draft length unchanged (no verbosity collapse).
They kept DPO weights for prompts with rich pairwise evals as an ensemble fallback but shipped KTO for production because it consumed the full feedback firehose. Reference checkpoint remained the pre-KTO SFT snapshot for auditability.
Technique decision table
| Method | Strength | Weakness | Best when |
|---|---|---|---|
| KTO | Unpaired binary labels; loss-averse weighting; uses full production logs | Reference model memory; needs some desirable examples | Thumbs up/down, moderation, implicit feedback at scale |
| DPO | Sharp relative signal; mature tooling | Requires paired data; throws away unpaired rows | Offline ranking eval with matched winner/loser |
| ORPO | Joint SFT + preference; no reference model | Pairwise only; chosen quality critical | Single-stage pairwise alignment on good winners |
| SIMPO | Reference-free; length-normalized margin | Pairwise only; no binary-native loss | Verbosity bias after SFT; tight GPU budget |
| RLHF (PPO + RM) | Online rewards; flexible scoring | Expensive; fragile | Rich reward models; research-scale compute |
Common pitfalls
- Label noise from implicit feedback — agents accept drafts to save time even when wrong; audit a stratified sample before trusting thumbs.
- All-negative corpora — KTO still needs desirable examples to know what to increase; synthesize positives via SFT or oversample rare accepts.
- Prompt leakage into loss — mask prompt tokens; only completion tokens contribute to log-prob ratios.
- Ignoring
λu— setting loss aversion to 1.0 removes KTO’s main advantage over symmetric losses. - Stale reference after long KTO runs — large
βwith many epochs can saturate againstπref; re-evaluate KL drift. - Conflating edit distance with label — minor typo fixes are not equivalent to policy violations; tier label severity.
- Evaluating only aggregate thumbs accuracy — track task KPIs (resolution rate, escalation rate, safety incidents).
- Skipping red team after alignment — run adversarial probes on refund, medical, and legal edge cases post-KTO.
Production checklist
- Define label policy: explicit thumbs vs implicit edits vs escalation; document edge cases.
- Audit 200+ rows per class for label quality before training.
- Choose SFT reference checkpoint; freeze
πrefwith version hash. - Implement masked log-prob ratio reward; configure
β,λd,λu. - Balance batches if production logs skew heavily negative or positive.
- Sweep
βandλuon held-out binary accuracy and task KPIs. - Log per-epoch desirable/undesirable loss, KL drift, median completion length.
- A/B against DPO on overlapping pairwise slice before full cutover.
- Export weights with dataset version, label policy, and hyperparameter manifest.
- Run regression evals: resolution rate, hallucination rate, refusal behavior, latency.
- Schedule periodic retrain as new production labels accumulate.
Key takeaways
- KTO aligns from binary desirable/undesirable labels without pairwise matching.
- Prospect-theory weighting penalizes bad completions more sharply than it rewards good ones.
- Production thumbs-up/down and implicit feedback are first-class training data, not DPO leftovers.
- Harbor Support used 42,000 unpaired labels to lift resolution rate from 67% to 83%.
- KTO keeps a reference model like DPO; choose ORPO/SIMPO if pairwise data is scarce but reference memory is the bottleneck.
- Balance batches and audit label noise — implicit accept is not ground truth.
Related reading
- Direct preference optimization (DPO) explained — pairwise alignment with implicit rewards and reference anchoring
- LLM ORPO explained — joint SFT and preference optimization without a reference model
- LLM SIMPO explained — reference-free alignment with length-normalized margins
- RLHF explained — reward models, PPO, and the classic three-stage alignment stack