Guide

Recommendation systems explained

A recommendation system predicts which items a user is most likely to engage with next — a product to buy, a video to watch, a song to play, or an article to read. The engine behind Netflix rows, Amazon “customers also bought,” and TikTok feeds is not one algorithm but a pipeline: collect interaction signals, build user and item representations, score millions of candidates, rank the top few, and continuously measure whether clicks and dwell time improve. This guide walks through collaborative and content-based filtering, matrix factorization, modern two-tower deep models, the cold-start problem, ranking metrics like precision@K and NDCG, exploration vs exploitation, and the production checklist teams use before shipping a recommender.

The problem: sparse user-item interactions

At the core is a user-item interaction matrix: rows are users, columns are items, cells hold signals — explicit ratings (1–5 stars), implicit events (clicks, purchases, watch time), or both. Real matrices are enormous and sparse: a shopper who browsed 200 products out of a catalog of two million has 99.99% empty cells. The job is to fill in plausible scores for unseen pairs and return the highest-ranked items per user.

Signals fall into two families:

  • Explicit feedback — deliberate ratings or likes. High signal quality but sparse; most users never rate anything.
  • Implicit feedback — clicks, add-to-cart, play counts, scroll depth. Abundant but noisy: a click does not always mean satisfaction, and lack of a click does not mean dislike.

Production systems almost always blend both. Before modeling, engineers define a positive event (purchase, 80% video completion) and often down-weight or ignore negatives to avoid treating “never seen” as “disliked.” That label design step is where many recommenders fail before any algorithm runs — the same discipline applies in feature engineering for tabular ML.

Collaborative filtering: wisdom of the crowd

Collaborative filtering (CF) assumes users who agreed in the past will agree in the future. It ignores item attributes entirely and learns from co-occurrence patterns in the interaction matrix.

User-based and item-based CF

User-based CF finds neighbors with similar rating histories and recommends items those neighbors liked. Item-based CF flips the perspective: “users who bought X also bought Y,” computed via cosine similarity between item columns. Item-based CF scales better for large catalogs because item vectors change slowly and can be precomputed.

Matrix factorization

Latent-factor models decompose the sparse matrix into low-dimensional user embeddings and item embeddings. If user u has vector pu and item i has vector qi, the predicted score is the dot product pu · qi (plus biases). Training minimizes squared error on observed entries — classic algorithms include alternating least squares (ALS) and stochastic gradient descent on sampled (user, item, rating) triples.

Matrix factorization captures hidden taste dimensions (“prefers atmospheric indie films”) without hand-labeling them. It struggles when interactions are extremely sparse or when item catalogs shift rapidly — which is why large platforms moved to neural extensions while keeping the embedding intuition from deep learning fundamentals.

Content-based filtering: item features and user profiles

Content-based recommenders describe each item with features — genre tags, price band, text embeddings of descriptions, image CNN features — and build a user profile as a weighted average of items they engaged with. Recommendations score unseen items by similarity to that profile (cosine distance, learned linear model, etc.).

Content-based methods shine when:

  • The catalog is new or niche and interaction data is thin (cold start).
  • Explainability matters (“because you read articles about Kubernetes”).
  • You must avoid recommending items a user has already consumed.

Weakness: filter bubbles. If a reader only sees more of what they already like, discovery stalls. Hybrid systems inject diversity, trending items, or exploration slots to counter this — a product decision as much as a modeling one.

Hybrid and deep learning recommenders

Modern production stacks rarely pick one family. A typical architecture:

  1. Candidate generation — cheaply retrieve hundreds or thousands of plausible items (collaborative embeddings, content similarity, co-visit graphs, geographic proximity).
  2. Ranking — a heavier model scores and orders the shortlist using rich features: user demographics, session context, item freshness, cross-features (“user segment × category”).
  3. Re-ranking — business rules, diversity constraints, deduplication, and exploration slots applied last.

Two-tower models

The two-tower architecture is the workhorse of large-scale retrieval. One neural network encodes the user (history, profile, context); another encodes the item (attributes, text, image). Training pulls positive pairs together and pushes negatives apart (contrastive or sampled-softmax loss). At serving time, item vectors are precomputed and stored in a vector database for approximate nearest-neighbor lookup — the same retrieval primitive used in hybrid search and RAG pipelines.

Sequential and session models

For feeds where order matters (e-commerce sessions, short-video apps), recurrent or transformer encoders model the sequence of recent events. The next-item prediction objective resembles language modeling: given items i1, i2, …, it, predict it+1. Session-aware models outperform static user embeddings when tastes shift within a single visit.

The cold-start problem

Cold start hits when a new user, new item, or new platform lacks interaction history:

  • New user — fall back to popularity baselines, onboarding preference picks (“pick three genres”), or demographic defaults until enough events accumulate.
  • New item — lean on content features and editorial placement; boost fresh items temporarily (exploration) so they can earn impressions and gather data.
  • New platform — import transfer signals (social graph, search logs) or seed with curated lists until the interaction matrix densifies.

A common mistake is training only on power users. Their behavior dominates gradients; casual visitors get poor recommendations and churn before you collect data. Stratified sampling and propensity scoring help correct popularity bias — techniques aligned with validation discipline in classical ML.

Evaluation: ranking metrics that matter

Recommenders are ranking problems, not single-label classifiers. Accuracy on a held-out click is misleading if the model only recommends blockbusters everyone would click anyway. Standard offline metrics:

  • Precision@K — fraction of the top-K recommendations that were relevant in the test set.
  • Recall@K — fraction of all relevant items captured in the top-K list.
  • NDCG@K (normalized discounted cumulative gain) — rewards relevant items appearing higher in the ranked list; the same family of metrics used in information retrieval evaluation.
  • MAP (mean average precision) — averages precision across recall levels; common when multiple relevant items exist per user.
  • Coverage and diversity — what fraction of the catalog ever gets recommended; intra-list similarity — low coverage means a few hits dominate the feed.

Offline metrics guide iteration, but online A/B tests decide launch. Watch click-through rate, conversion, session length, and long-term retention — a model that maximizes clicks can degrade satisfaction if it recommends clickbait. Holdout buckets and interleaving experiments compare rankers with live traffic.

Exploration, exploitation, and feedback loops

A recommender that only shows proven winners never learns about new inventory — the explore/exploit trade-off. Techniques include:

  • Epsilon-greedy — reserve a small fraction of slots for random or diverse items.
  • Thompson sampling / contextual bandits — maintain uncertainty estimates and sample actions proportional to expected reward plus exploration bonus.
  • Position bias correction — items shown at the top get more clicks regardless of quality; training must down-weight or model position as a feature.

Feedback loops amplify bias: if the model never shows niche content, niche creators leave; the training data becomes even more homogeneous. Periodic audits for demographic parity, catalog coverage, and stale embeddings are part of responsible MLOps, not an afterthought.

Decision table: which approach when?

Scenario Starting approach Upgrade path
Small catalog, sparse data Popularity + content-based Item-based CF when co-visits accumulate
Medium catalog, rich implicit feedback Matrix factorization (ALS) Two-tower retrieval + gradient-boosted ranker
Session-heavy feed (video, e-commerce) Sequential transformer encoder Multi-stage retrieve-then-rank with real-time features
Search + recommend blend Content embeddings + BM25 hybrid Unified embedding index with query-aware re-ranking
Strict explainability requirement Content-based with explicit feature weights Hybrid with post-hoc explanations on shortlists

Anti-patterns to avoid

  • Random train/test splits on temporal data — future clicks leak into training; use time-based cutoffs.
  • Treating missing interactions as negatives — most users never saw most items; implicit ALS and sampled softmax handle this differently.
  • Optimizing only offline AUC while ignoring catalog coverage, latency, and business guardrails.
  • Stale item embeddings after catalog updates — schedule nightly or streaming refresh; stale vectors recommend discontinued SKUs.
  • Ignoring latency budgets — a perfect ranker that takes 800 ms loses to a good-enough one at 40 ms; precompute item sides of two-tower models.
  • Recommending already-consumed items without explicit “watch again” intent — filter history unless repetition is the goal (music playlists vs news articles).

Production checklist

  • Positive and negative labels defined with product input; implicit vs explicit signals documented.
  • Time-based train/validation/test splits; no future leakage.
  • Baseline: popularity and/or content-based before claiming model lift.
  • Offline metrics: NDCG@K and coverage reported alongside precision@K.
  • Candidate generation separated from ranking; latency measured per stage.
  • Cold-start fallbacks tested for new users and new items.
  • Exploration slots or bandit layer for fresh inventory.
  • Position bias and popularity bias audited in training data.
  • Online A/B framework with guardrail metrics (retention, complaints).
  • Embedding and feature pipelines monitored for drift in production.

Key takeaways

  • Recommendation systems rank items from sparse user-item interactions — explicit ratings and implicit events.
  • Collaborative filtering learns from co-occurrence; matrix factorization compresses taste into embeddings.
  • Content-based models use item features — essential for cold start and explainability.
  • Production stacks use retrieve-then-rank pipelines; two-tower models power large-scale retrieval.
  • Measure with NDCG@K and online experiments; manage exploration, bias, and feedback loops deliberately.

Related reading