Guide
Recommendation systems explained
A recommendation system predicts which items a user is most likely to engage with next — a product to buy, a video to watch, a song to play, or an article to read. The engine behind Netflix rows, Amazon “customers also bought,” and TikTok feeds is not one algorithm but a pipeline: collect interaction signals, build user and item representations, score millions of candidates, rank the top few, and continuously measure whether clicks and dwell time improve. This guide walks through collaborative and content-based filtering, matrix factorization, modern two-tower deep models, the cold-start problem, ranking metrics like precision@K and NDCG, exploration vs exploitation, and the production checklist teams use before shipping a recommender.
The problem: sparse user-item interactions
At the core is a user-item interaction matrix: rows are users, columns are items, cells hold signals — explicit ratings (1–5 stars), implicit events (clicks, purchases, watch time), or both. Real matrices are enormous and sparse: a shopper who browsed 200 products out of a catalog of two million has 99.99% empty cells. The job is to fill in plausible scores for unseen pairs and return the highest-ranked items per user.
Signals fall into two families:
- Explicit feedback — deliberate ratings or likes. High signal quality but sparse; most users never rate anything.
- Implicit feedback — clicks, add-to-cart, play counts, scroll depth. Abundant but noisy: a click does not always mean satisfaction, and lack of a click does not mean dislike.
Production systems almost always blend both. Before modeling, engineers define a positive event (purchase, 80% video completion) and often down-weight or ignore negatives to avoid treating “never seen” as “disliked.” That label design step is where many recommenders fail before any algorithm runs — the same discipline applies in feature engineering for tabular ML.
Collaborative filtering: wisdom of the crowd
Collaborative filtering (CF) assumes users who agreed in the past will agree in the future. It ignores item attributes entirely and learns from co-occurrence patterns in the interaction matrix.
User-based and item-based CF
User-based CF finds neighbors with similar rating histories and recommends items those neighbors liked. Item-based CF flips the perspective: “users who bought X also bought Y,” computed via cosine similarity between item columns. Item-based CF scales better for large catalogs because item vectors change slowly and can be precomputed.
Matrix factorization
Latent-factor models decompose the sparse matrix into low-dimensional user embeddings and item embeddings. If user u has vector pu and item i has vector qi, the predicted score is the dot product pu · qi (plus biases). Training minimizes squared error on observed entries — classic algorithms include alternating least squares (ALS) and stochastic gradient descent on sampled (user, item, rating) triples.
Matrix factorization captures hidden taste dimensions (“prefers atmospheric indie films”) without hand-labeling them. It struggles when interactions are extremely sparse or when item catalogs shift rapidly — which is why large platforms moved to neural extensions while keeping the embedding intuition from deep learning fundamentals.
Content-based filtering: item features and user profiles
Content-based recommenders describe each item with features — genre tags, price band, text embeddings of descriptions, image CNN features — and build a user profile as a weighted average of items they engaged with. Recommendations score unseen items by similarity to that profile (cosine distance, learned linear model, etc.).
Content-based methods shine when:
- The catalog is new or niche and interaction data is thin (cold start).
- Explainability matters (“because you read articles about Kubernetes”).
- You must avoid recommending items a user has already consumed.
Weakness: filter bubbles. If a reader only sees more of what they already like, discovery stalls. Hybrid systems inject diversity, trending items, or exploration slots to counter this — a product decision as much as a modeling one.
Hybrid and deep learning recommenders
Modern production stacks rarely pick one family. A typical architecture:
- Candidate generation — cheaply retrieve hundreds or thousands of plausible items (collaborative embeddings, content similarity, co-visit graphs, geographic proximity).
- Ranking — a heavier model scores and orders the shortlist using rich features: user demographics, session context, item freshness, cross-features (“user segment × category”).
- Re-ranking — business rules, diversity constraints, deduplication, and exploration slots applied last.
Two-tower models
The two-tower architecture is the workhorse of large-scale retrieval. One neural network encodes the user (history, profile, context); another encodes the item (attributes, text, image). Training pulls positive pairs together and pushes negatives apart (contrastive or sampled-softmax loss). At serving time, item vectors are precomputed and stored in a vector database for approximate nearest-neighbor lookup — the same retrieval primitive used in hybrid search and RAG pipelines.
Sequential and session models
For feeds where order matters (e-commerce sessions, short-video apps), recurrent or transformer encoders model the sequence of recent events. The next-item prediction objective resembles language modeling: given items i1, i2, …, it, predict it+1. Session-aware models outperform static user embeddings when tastes shift within a single visit.
The cold-start problem
Cold start hits when a new user, new item, or new platform lacks interaction history:
- New user — fall back to popularity baselines, onboarding preference picks (“pick three genres”), or demographic defaults until enough events accumulate.
- New item — lean on content features and editorial placement; boost fresh items temporarily (exploration) so they can earn impressions and gather data.
- New platform — import transfer signals (social graph, search logs) or seed with curated lists until the interaction matrix densifies.
A common mistake is training only on power users. Their behavior dominates gradients; casual visitors get poor recommendations and churn before you collect data. Stratified sampling and propensity scoring help correct popularity bias — techniques aligned with validation discipline in classical ML.
Evaluation: ranking metrics that matter
Recommenders are ranking problems, not single-label classifiers. Accuracy on a held-out click is misleading if the model only recommends blockbusters everyone would click anyway. Standard offline metrics:
- Precision@K — fraction of the top-K recommendations that were relevant in the test set.
- Recall@K — fraction of all relevant items captured in the top-K list.
- NDCG@K (normalized discounted cumulative gain) — rewards relevant items appearing higher in the ranked list; the same family of metrics used in information retrieval evaluation.
- MAP (mean average precision) — averages precision across recall levels; common when multiple relevant items exist per user.
- Coverage and diversity — what fraction of the catalog ever gets recommended; intra-list similarity — low coverage means a few hits dominate the feed.
Offline metrics guide iteration, but online A/B tests decide launch. Watch click-through rate, conversion, session length, and long-term retention — a model that maximizes clicks can degrade satisfaction if it recommends clickbait. Holdout buckets and interleaving experiments compare rankers with live traffic.
Exploration, exploitation, and feedback loops
A recommender that only shows proven winners never learns about new inventory — the explore/exploit trade-off. Techniques include:
- Epsilon-greedy — reserve a small fraction of slots for random or diverse items.
- Thompson sampling / contextual bandits — maintain uncertainty estimates and sample actions proportional to expected reward plus exploration bonus.
- Position bias correction — items shown at the top get more clicks regardless of quality; training must down-weight or model position as a feature.
Feedback loops amplify bias: if the model never shows niche content, niche creators leave; the training data becomes even more homogeneous. Periodic audits for demographic parity, catalog coverage, and stale embeddings are part of responsible MLOps, not an afterthought.
Decision table: which approach when?
| Scenario | Starting approach | Upgrade path |
|---|---|---|
| Small catalog, sparse data | Popularity + content-based | Item-based CF when co-visits accumulate |
| Medium catalog, rich implicit feedback | Matrix factorization (ALS) | Two-tower retrieval + gradient-boosted ranker |
| Session-heavy feed (video, e-commerce) | Sequential transformer encoder | Multi-stage retrieve-then-rank with real-time features |
| Search + recommend blend | Content embeddings + BM25 hybrid | Unified embedding index with query-aware re-ranking |
| Strict explainability requirement | Content-based with explicit feature weights | Hybrid with post-hoc explanations on shortlists |
Anti-patterns to avoid
- Random train/test splits on temporal data — future clicks leak into training; use time-based cutoffs.
- Treating missing interactions as negatives — most users never saw most items; implicit ALS and sampled softmax handle this differently.
- Optimizing only offline AUC while ignoring catalog coverage, latency, and business guardrails.
- Stale item embeddings after catalog updates — schedule nightly or streaming refresh; stale vectors recommend discontinued SKUs.
- Ignoring latency budgets — a perfect ranker that takes 800 ms loses to a good-enough one at 40 ms; precompute item sides of two-tower models.
- Recommending already-consumed items without explicit “watch again” intent — filter history unless repetition is the goal (music playlists vs news articles).
Production checklist
- Positive and negative labels defined with product input; implicit vs explicit signals documented.
- Time-based train/validation/test splits; no future leakage.
- Baseline: popularity and/or content-based before claiming model lift.
- Offline metrics: NDCG@K and coverage reported alongside precision@K.
- Candidate generation separated from ranking; latency measured per stage.
- Cold-start fallbacks tested for new users and new items.
- Exploration slots or bandit layer for fresh inventory.
- Position bias and popularity bias audited in training data.
- Online A/B framework with guardrail metrics (retention, complaints).
- Embedding and feature pipelines monitored for drift in production.
Key takeaways
- Recommendation systems rank items from sparse user-item interactions — explicit ratings and implicit events.
- Collaborative filtering learns from co-occurrence; matrix factorization compresses taste into embeddings.
- Content-based models use item features — essential for cold start and explainability.
- Production stacks use retrieve-then-rank pipelines; two-tower models power large-scale retrieval.
- Measure with NDCG@K and online experiments; manage exploration, bias, and feedback loops deliberately.
Related reading
- Machine learning fundamentals explained — supervised learning, evaluation splits, and when classical models suffice
- Information retrieval explained — BM25, inverted indexes, and NDCG ranking evaluation
- Vector databases explained — approximate nearest-neighbor search for embedding retrieval
- Deep learning explained — neural networks, embeddings, and training optimizers behind modern rankers