Explainer · 7 June 2026

How recommendation algorithms work

Open a shopping app, a music player, or a short-video feed and the first screen is rarely the same for two people. Behind that surface is a recommendation system: software that scores millions of candidate items and ranks the few it thinks you will engage with next. The techniques range from simple counting to deep neural networks, but the core problem is always the same — predict preference from incomplete, biased history.

The prediction problem

Formally, a recommender estimates a function score(user, item) → number. Higher scores surface first. The training signal is almost always implicit feedback (clicks, watch time, purchases, skips) rather than explicit star ratings. Implicit data is abundant but noisy: a click might mean interest, accidental tap, or curiosity about a thumbnail.

Production systems split the work into stages for latency reasons:

Candidate generation — narrow billions of items to hundreds using cheap heuristics or approximate nearest neighbors.
Ranking — score and sort those hundreds with a heavier model.
Re-ranking — apply business rules (diversity, freshness, policy filters) before display.

You rarely see one algorithm; you see a pipeline where each stage can fail independently. A brilliant ranker cannot fix a candidate generator that never surfaces new creators.

Collaborative filtering: wisdom of the crowd

Collaborative filtering ignores item descriptions and uses only interaction history. The intuition: users who agreed in the past will agree again.

User-user CF finds people similar to you and recommends what they liked that you have not seen. It struggles when you are unique (sparse neighborhood) and when the catalog is huge (comparing you to every user is expensive).

Item-item CF is often more stable: "people who bought A also bought B." Amazon's classic approach precomputes similar-item lists offline. At query time you look up items in the user's history and merge their neighbors — fast enough for web-scale traffic.

Both variants suffer when interactions are sparse. A niche documentary watched by twelve people will have weak similarity estimates. Matrix sparsity is why modern systems moved toward learned embeddings.

Content-based filtering and hybrid models

Content-based recommenders describe each item with features — genre tags, artist, price band, text embeddings of a product description — and build a profile of what you tend to like. They excel at explaining recommendations ("because you listen to jazz") and at the cold start for new items when you already know the metadata.

Pure content models repeat the same cluster: recommend more of what you already consumed. Real products blend collaborative and content signals in hybrid architectures: CF for discovery, content features for new inventory, hand-tuned weights per surface (home vs search vs email).

Text and image embeddings from large models blurred the line between content and collaborative approaches. A video's visual embedding can sit in the same vector space as user taste vectors learned from watch history — see retrieval-augmented generation for how similar embedding retrieval powers Q&A systems.

Matrix factorization and embeddings

Matrix factorization (popularized by the Netflix Prize era) represents each user and each item as a low-dimensional vector. The predicted rating is the dot product of those vectors. Training minimizes error on known interactions while regularizing so vectors generalize to missing cells.

Geometrically, users and items live in the same space: items you like should lie in the direction your user vector points. Similar items cluster; similar users cluster. The dimension count (50, 128, 512) is a bias-variance knob — too few dimensions underfit taste; too many memorize noise.

Two-tower neural models extend this idea. One neural network encodes the user from history and context (time of day, device); another encodes the candidate item. Dot product or a small MLP produces the score. Towers train on billions of (user, item, label) tuples with negative sampling — random items the user did not click, treated as negatives.

Serving uses approximate nearest neighbor indexes (HNSW, ScaNN) to find high-scoring items without evaluating the full catalog. Index freshness — how quickly a viral upload enters the candidate pool — is often the difference between feeling "live" and feeling stale.

Cold start: new users and new items

User cold start — first visit, empty history. Mitigations: onboarding quizzes, demographic defaults, trending/popular fallbacks, or context (locale, referral source). Solana Garden's homepage collectibles hub uses similar ideas: show broadly appealing defaults until wallet history and local preferences exist.

Item cold start — brand-new listing with zero clicks. Content features and creator reputation help; some platforms give a small exploration budget of impressions to measure true interest before burying the item.

Cold start is where recommendation quality feels most unfair. A great video with no initial push may never escape the long tail. Creators optimize thumbnails and titles because the ranker often sees metadata before engagement.

Exploration, exploitation, and bandits

Recommenders face a reinforcement-learning tension: exploit what already works for you vs explore to learn if you might like something new. Pure exploitation converges to a narrow feed; pure exploration feels random.

Multi-armed bandit algorithms allocate a fraction of slots to uncertain items and shift traffic toward winners. Epsilon-greedy, Thompson sampling, and contextual bandits (conditioned on user context) appear in ad placement, notification timing, and "try this creator" modules.

Exploration must be bounded. Showing irrelevant items trains the model on negative feedback and erodes trust. Product teams cap exploration rate per session and exclude sensitive categories from experiments.

Feedback loops and filter bubbles

Recommendations change behavior; behavior becomes tomorrow's training data. That feedback loop can amplify early randomness: a slight initial boost snowballs into dominance. It also creates filter bubbles — feeds that reinforce existing views because outrage and comfort both drive engagement.

Mitigations include diversity constraints in re-ranking ("do not show five identical thumbnails"), periodic injection of out-of-cluster items, and separate models optimized for long-term satisfaction rather than immediate clicks. None are perfect; all trade revenue for breadth.

Privacy interacts here. Cross-site tracking and browser fingerprinting let ad networks enrich user profiles without explicit consent — a different threat model from first-party recommendation on data you volunteered on one service.

Production realities

Research papers optimize offline metrics (NDCG, recall@k). Production optimizes latency, infrastructure cost, and guardrails:

Latency budgets — ranking must finish in tens of milliseconds; heavy transformers run offline or on small candidate sets.
Position bias — users click top results because they are on top, not only because they are best. Training debiases clicks or uses randomized experiments.
Gaming and fraud — fake engagement, click farms, and SEO-for-marketplaces poison labels; anomaly detection is part of the stack.
Freshness vs stability — hourly model retrains catch trends but drift; daily models are stable but slow on breaking news.
Explainability — regulators and users ask "why this?" Sparse linear models and attention highlights help; deep nets often do not.

Large language models add a new layer: natural-language interfaces that retrieve and summarize items ( context windows limit how much catalog metadata fits in one prompt). The ranker still decides what enters that context — recommendation does not disappear; it moves upstream.

Practical checklist

When you build or audit a personalized surface:

Define the objective explicitly — clicks, watch time, revenue, retention — and know they conflict.
Measure cold-start cohorts separately from power users.
Log not just impressions and clicks but position, surface, and experiment arm.
Cap exploration and enforce diversity where homogeneity harms trust.
Test offline metrics against live A/B results — offline winners often fail online.

Related on Solana Garden: RAG explained, LLM context windows, browser fingerprinting, Explainers hub.