Guide

Learning to rank explained

Learning to rank (LTR) is the branch of machine learning that optimizes order, not just relevance scores. Search engines, product catalogs, ad auctions, and feed recommenders all share the same shape: retrieve hundreds of candidates, then rank the top few where users actually look. A classifier that predicts click probability per item is a start, but ranking quality depends on relative order within each query — that is what LTR trains directly. This guide covers pointwise, pairwise, and listwise paradigms, NDCG and MRR evaluation, feature design, LambdaMART and neural rankers, a Harbor Supply catalog search worked example, an approach decision table, common pitfalls, and a practitioner checklist. For retrieval before ranking, see semantic search explained; for cross-encoder refinement, see LLM reranking explained; for feed-style personalization, see recommendation systems explained.

Retrieval vs ranking — two stages, one goal

Modern search and recommendation pipelines almost always split into two stages. Retrieval (first stage) casts a wide net: BM25 keyword match, approximate nearest-neighbor vector search, or collaborative filtering to produce 100–1,000 candidates in milliseconds. Ranking (second stage) scores and reorders that shortlist using richer features — query-document interactions, click history, inventory signals, freshness — because those features are too expensive to compute across the full corpus.

Learning to rank applies to the second stage (and sometimes to lightweight first-stage rankers). The objective is not "is this document relevant?" in isolation but "should document A appear above document B for this query?" That positional focus is why ranking metrics like NDCG differ from classification metrics like AUC, and why a model with excellent pointwise accuracy can still produce a bad user experience at the top of the results page.

What makes ranking different from classification

In classification, each example is independent. In ranking, examples come in groups (one group per query or user session), and only order within the group matters. Labels are often graded relevance (0 = irrelevant, 1 = marginally relevant, 2 = highly relevant) rather than binary. Evaluation weights top positions heavily — a relevant result at rank 10 contributes far less than the same result at rank 1.

Three LTR paradigms

Pointwise — predict a score per document

Pointwise methods treat each query-document pair as an independent training example. A regression model predicts a relevance score; a classifier predicts click or conversion probability. At inference, sort by predicted score. This is the easiest to implement — you can reuse logistic regression, gradient boosting, or a small neural net — and it works when labels are reliable per-item scores. The downside: it ignores relative order during training. Two documents both labeled "relevant" are treated equally even if one should clearly outrank the other.

Pairwise — learn from document pairs

Pairwise methods generate training pairs within each query group: for documents A and B where A was clicked and B was not, the model learns to score A higher than B. RankNet and LambdaRank follow this pattern. The loss is defined on pairs, so the model directly optimizes relative order. Pairwise training scales with the number of pairs per query, which can explode for large candidate sets — subsampling negative pairs is standard practice.

Listwise — optimize the whole ranked list

Listwise methods define loss functions over entire ranked lists. ListNet and LambdaMART (via ListMLE-style objectives) fall here. Listwise training aligns most closely with metrics like NDCG but is computationally heavier and less common in production than pairwise or pointwise-plus-reranking hybrids. Many teams achieve listwise-quality results with pairwise LambdaMART because the lambda gradients approximate NDCG optimization efficiently.

Ranking metrics that match the product

Offline ranking evaluation must mirror what users experience. These metrics appear in every LTR paper and production dashboard:

NDCG (Normalized Discounted Cumulative Gain)

NDCG@k is the default metric for graded relevance. Each document has a gain (0, 1, 2, 3 for irrelevant through perfect). DCG sums gain divided by log2(rank + 1), discounting lower positions. NDCG divides by the ideal DCG so scores range 0–1. NDCG@10 is standard for search; NDCG@3 matters when UI shows only three results. It requires graded labels — if you only have clicks, use binary gain (click = 1, no click = 0).

MRR (Mean Reciprocal Rank)

MRR answers: how quickly does the first relevant result appear? Reciprocal rank is 1/position of the first hit. MRR averages across queries. Use MRR when users need one correct answer (FAQ lookup, entity search) rather than a diverse list.

MAP and precision at k

Mean Average Precision (MAP) suits binary relevance with multiple relevant documents per query. Precision@k measures the fraction of top-k results that are relevant — useful when screen space is fixed. Pair these with click-through rate (CTR) and conversion rate in online A/B tests; offline metrics gate deployment, but business metrics decide success.

Features that move rankers

LTR models consume query features, document features, and query-document interaction features. Interaction features usually dominate:

Text overlap — BM25 score, TF-IDF cosine, exact title match, token overlap count.
Semantic similarity — embedding dot product or cosine between query and document vectors (see cosine similarity explained).
Behavioral signals — historical CTR for this query-document pair, popularity, add-to-cart rate, return rate.
Freshness and inventory — days since publish, in-stock flag, margin, shipping speed.
Personalization — user category affinity, past purchases, session context (careful with data leakage).

Tree-based rankers like LambdaMART handle heterogeneous features and nonlinear interactions natively. Neural rankers (DIN, transformer cross-encoders) excel when you have large click logs and GPU budget but add latency — often reserved for a third re-ranking stage on the top 20–50 candidates.

Common algorithms in production

LambdaMART — gradient-boosted trees (XGBoost/LightGBM with LambdaRank objective). Industry workhorse for tabular LTR features. Fast inference, interpretable feature importance, handles missing values.
RankNet / LambdaRank — neural pairwise rankers; predecessor to LambdaMART. Still used when embedding features feed a small MLP ranker.
Linear models with SVMrank — lightweight baseline; good sanity check before boosting complexity.
Cross-encoder transformers — score query-document pairs jointly; highest accuracy, highest latency. Typical third-stage reranker.

Most production stacks combine them: BM25 or ANN retrieval, LambdaMART on 200 candidates, optional cross-encoder on top 30. See gradient boosting explained for the underlying tree ensemble mechanics.

Worked example: Harbor Supply catalog search

Harbor Supply, a fictional industrial parts distributor, serves 400k SKUs. Users search by part number, description, or application. The team ships search in three iterations:

Baseline: BM25 only

Elasticsearch BM25 retrieval sorted by score. NDCG@10 = 0.61. Exact part-number queries work; descriptive queries ("high-temp gasket for 2 inch flange") surface wrong categories because BM25 cannot learn that "flange" queries should boost fittings over unrelated gaskets.

Stage two: pointwise click predictor

A LightGBM classifier trained on (query, SKU) pairs with click labels. Features: BM25 score, embedding cosine, category match, 30-day SKU CTR, in-stock flag. Sorted by predicted click probability. NDCG@10 rises to 0.69 — but part-number queries regress: the model underweights exact-match features because they are rare in the training distribution.

Stage three: LambdaMART pairwise ranker

Same features, LambdaRank objective with pairwise sampling (1 positive click vs 5 random negatives per query). Query-level group boundaries enforced in training. NDCG@10 reaches 0.78 offline; A/B test shows +11% search-attributed revenue and −14% null-result rate. Exact-match queries recover because pairwise loss penalizes wrong ordering within each query group, not just wrong absolute scores.

Harbor keeps a cross-encoder reranker on the top 25 results for premium enterprise accounts only — latency budget 120 ms p95 vs 35 ms for LambdaMART alone.

Approach decision table

Situation	Recommended approach	Primary metric
Small catalog, sparse clicks	BM25 + manual boosts	MRR, Precision@5
Medium catalog, click logs available	Pointwise LightGBM or LambdaMART	NDCG@10
Large catalog, rich features, latency-sensitive	Two-stage: ANN retrieval + LambdaMART	NDCG@10, p95 latency
One correct answer (support KB, entity lookup)	Pairwise ranker + MRR optimization	MRR
Feed recommendations, diverse items	Listwise or pairwise + diversity re-ranking	NDCG@k, coverage
Maximum accuracy, latency budget > 100 ms	Three-stage + cross-encoder rerank	NDCG@3, conversion rate
Cold-start queries (no click history)	Content features + semantic similarity	NDCG on held-out query set

Common pitfalls

Position bias in click logs — users click what they see first, not what is most relevant. Use inverse propensity scoring (IPS) or randomize a small fraction of results to debias training data.
Leaking future clicks into features — aggregating CTR over all time includes clicks that happened after the training label cutoff. Use point-in-time feature snapshots.
Evaluating on training queries only — head queries dominate click volume; hold out tail and cold-start queries separately.
Optimizing NDCG offline while ignoring latency — a cross-encoder that gains 0.02 NDCG but adds 200 ms loses users before they see results.
Duplicate or near-duplicate results — NDCG rewards relevance but not diversity; add MMR or category caps post-ranking.
Train-serve feature skew — offline BM25 computed in batch differs from online Elasticsearch scoring. Log features at serve time and compare distributions weekly.

Practitioner checklist

Define the ranking task: query group boundaries, label source (clicks, grades, purchases).
Split train/validation by query ID, not by row — never put the same query in both sets.
Establish a BM25 or popularity baseline; beat it on NDCG@10 before adding complexity.
Engineer query-document interaction features; verify point-in-time correctness.
Debias click logs or supplement with human relevance judgments on a sample.
Train LambdaMART (or equivalent) with query-group-aware objective.
Report NDCG@k, MRR, and latency p50/p95 on held-out queries.
Run an online A/B test measuring CTR, conversion, and null-result rate.
Monitor feature drift and ranking stability after deploy.
Document the three-stage pipeline: retrieval, LTR, optional rerank — with fallbacks.

Key takeaways

Ranking optimizes order within query groups, not independent classification accuracy.
Pointwise is easy; pairwise and listwise align better with NDCG and user experience.
LambdaMART on rich interaction features is the production default for tabular LTR.
NDCG@10 for search quality; MRR when one correct answer matters most.
Position bias and temporal leakage are the top silent killers in click-trained rankers.