Guide

Multi-armed bandits explained

You run an online store and want to know which homepage banner converts best. A classic A/B test splits traffic 50/50 until you have enough data — but half your visitors see the worse variant the entire time. A multi-armed bandit (MAB) takes a smarter approach: it learns while earning, gradually shifting traffic toward the better option while still probing alternatives. Named after a row of slot machines ("one-armed bandits") in a casino, the MAB problem is the cleanest formalization of the exploration vs exploitation tradeoff in reinforcement learning. This guide explains the problem setup, regret as the objective, core algorithms (epsilon-greedy, UCB, Thompson sampling), contextual bandits for personalization, when bandits beat fixed splits, and how to ship them in production.

The problem: K slot machines, unknown payouts

Imagine K arms (variants, ads, recommendations, drug doses). Each arm i has an unknown true mean reward μ_i — click-through rate, revenue per user, survival probability. At each time step t, you pull one arm and observe a stochastic reward drawn from that arm's distribution (Bernoulli click/no-click, Gaussian revenue, etc.).

Your goal: maximize cumulative reward over T pulls. The difficulty is partial observability — you only see the reward of the arm you chose, never the counterfactual "what would have happened if I showed variant B." Every pull spent on a suboptimal arm is regret: the gap between what you earned and what the best arm would have earned.

Formally, regret after T steps is:

R(T) = T · μ* − ∑_t=1^T r_t, where μ* is the mean of the best arm and r_t is the reward at step t. A good algorithm keeps R(T) growing slowly — ideally O(log T) for stationary Bernoulli arms. Bad algorithms (always exploit the first lucky arm, or always explore uniformly) accumulate linear regret.

Exploration vs exploitation

Exploitation means pulling the arm that looks best from past data — maximize immediate expected reward. Exploration means pulling uncertain arms to learn their true means — invest now for better decisions later. Pure exploitation gets stuck on early noise (an arm that got lucky in the first 10 pulls). Pure exploration wastes traffic on arms you already know are weak.

Bandit algorithms encode a principled balance. Unlike offline supervised learning, where you train on a fixed dataset then deploy, bandits are online: the policy that collects data is the same policy that optimizes reward. That makes them ideal when data arrives sequentially and delaying the winner has a real cost.

Epsilon-greedy: the baseline

Epsilon-greedy is the simplest bandit policy. With probability 1 − ε, pull the arm with the highest sample mean (empirical best). With probability ε, pull a uniformly random arm.

Typical starting values: ε = 0.1 (10% exploration). You can decay ε over time — high early (learn fast), low late (earn more). Epsilon-greedy is easy to implement and explain to stakeholders, but it explores uniformly among all suboptimal arms, wasting pulls on arms that are clearly bad. Regret is O(T^2/3) with fixed ε — acceptable for small K but suboptimal at scale.

Implementation sketch: maintain count n_i and sum of rewards S_i per arm. Sample mean μ̂_i = S_i / n_i. On each request, with prob 1−ε choose argmax μ̂_i; else choose random i. Update counters after observing reward.

Upper Confidence Bound (UCB): optimism under uncertainty

UCB1 selects the arm with the highest upper confidence bound on its mean:

UCB_i = μ̂_i + √(2 ln t / n_i)

The second term is an exploration bonus — large when an arm has been pulled rarely (small n_i) or when total time t is large. UCB is "optimistic in the face of uncertainty": it pretends each arm might be the best until evidence proves otherwise. Arms with high uncertainty get tried; arms with consistently low sample means stop receiving bonus and fade out.

UCB1 achieves O(log T) regret for Bernoulli rewards — near-optimal without distributional assumptions beyond bounded rewards. Variants like UCB-V incorporate reward variance for tighter bounds. UCB is deterministic given history (no random tie-breaking), which simplifies debugging and audit logs.

Thompson sampling: Bayesian probability matching

Thompson sampling (Bayesian bandits) maintains a posterior distribution over each arm's mean. For Bernoulli rewards, use a Beta(α, β) prior per arm — start with α = β = 1 (uniform). After observing successes and failures, update α += successes, β += failures. Each round, sample one draw from each arm's Beta posterior and pull the arm with the highest sample.

Thompson sampling explores proportionally to the probability an arm is optimal — if arm B has a 30% chance of being best, it gets roughly 30% of exploratory traffic. Empirically, Thompson sampling often outperforms UCB on finite horizons and handles delayed feedback more gracefully. It is the default choice in many recommendation and ad-ranking systems.

For Gaussian rewards with unknown variance, use Normal-Gamma conjugate priors. For non-conjugate cases, approximate posteriors with MCMC or variational inference — heavier compute, rarely needed for click/conversion bandits.

Contextual bandits: personalization with side information

Plain MABs assume every user is identical — one global best arm. Real products have context: device type, geography, time of day, user history. Contextual bandits choose an arm as a function of context vector x, learning separate reward models per arm or a shared model with arm indicators.

Common approaches:

LinUCB — assume expected reward is linear in x; maintain ridge-regression estimates per arm with UCB-style confidence ellipsoids.
Thompson sampling with logistic regression — Bayesian linear models for click probability given features.
Neural bandits — embed context with a small network; explore via last-layer uncertainty or ensemble disagreement.

Contextual bandits sit between multi-armed bandits and full reinforcement learning: state is observed once per decision (no sequential state transitions), but policies can personalize. They power news feeds, email subject lines, and homepage layouts where user features are rich but the decision is instantaneous.

Bandits vs A/B tests vs full RL

Approach	Traffic split	Best when	Weakness
Fixed A/B test	Constant (e.g. 50/50)	Need clean causal read; few variants; long horizon OK	Wastes traffic on losers during test
Multi-armed bandit	Adaptive toward winner	Many variants; opportunity cost of losers is high; continuous optimization	Harder to report fixed p-values; non-stationary arms confuse it
Contextual bandit	Adaptive per segment	Rich user features; personalization at decision time	Cold-start users; feature pipeline must be live
Full RL (MDP)	Sequential state-action	Decisions affect future states (games, robotics, bidding)	Sample-hungry; engineering complexity

Use a fixed A/B test when legal, compliance, or science demands a pre-registered split with classical hypothesis testing. Use bandits when the cost of showing a suboptimal variant is measurable in revenue or engagement and you have ongoing traffic. Many teams run bandits in production and occasional holdout A/B tests to validate the bandit's winner.

Non-stationarity and delayed rewards

Real systems drift: seasonality, competitor moves, model updates. Stationary bandits assume μ_i is fixed; when it is not, add exponential decay to sufficient statistics (recent pulls weigh more) or use sliding windows. Monitor concept drift signals on arm means.

Delayed feedback breaks the simple update loop — a click may arrive minutes later, a purchase days later. Techniques: update on attribution window close, use propensity-weighted imputation for pending events, or batch Thompson sampling updates on a schedule. Never update an arm's posterior on rewards that belong to a previous policy version without importance weighting.

Algorithm selection guide

Scenario	Recommended algorithm	Why
2–5 arms, MVP, explainability matters	Epsilon-greedy (decaying ε)	Dead simple; stakeholders understand random exploration
5–50 arms, Bernoulli rewards, auditability	UCB1	Logarithmic regret; deterministic selection
Production CTR optimization, finite horizon	Thompson sampling	Strong empirical performance; natural uncertainty quantification
User features available at decision time	LinUCB or contextual Thompson	Personalized arms without full RL stack
Arm means change weekly	Discounted Thompson or sliding-window UCB	Forgets stale evidence

Common mistakes

Peeking and re-randomizing — resetting the experiment when one arm leads mid-test destroys validity; bandits avoid this by design but manual overrides reintroduce bias.
Ignoring multiple comparisons — with 20 arms, one will look good by chance early; UCB/Thompson mitigate but do not eliminate false discovery.
Correlated arms — variants that share users (same session sees multiple) violate independence; use cluster-level randomization.
Survivorship in logging — only logging rewarded arms skews offline analysis; log context, arm, propensity score, and outcome always.
Treating bandits as free lunch — they reduce regret vs fixed splits but still need minimum traffic per arm for stable estimates.
Wrong reward definition — optimizing clicks while the business cares about margin trains a bandit on a proxy that diverges from revenue.

Production checklist

Define the reward metric aligned with business outcome (not just a proxy).
Choose stationary vs discounted updates based on how fast arms drift.
Enforce minimum pulls per arm before allowing full exploitation (warm-up).
Log arm, context features, propensity, reward, and timestamp on every decision.
Cap maximum traffic to any single arm during exploration (safety rails).
Run periodic holdout A/B tests against the bandit's current champion.
Handle delayed conversions with explicit attribution windows.
Monitor arm means and regret estimates on a dashboard; alert on sudden inversions.
Version bandit state — rollback if a deploy corrupts posteriors.
Document exploration rate for compliance (financial product disclosures, etc.).

Key takeaways

Multi-armed bandits formalize learning which option is best while minimizing the cost of trying worse options.
Regret is the right objective — cumulative reward lost vs always picking the true best arm.
Epsilon-greedy is the teaching baseline; UCB and Thompson sampling are production-grade with logarithmic regret.
Contextual bandits add personalization without full MDP reinforcement learning.
Choose bandits over fixed A/B when opportunity cost of suboptimal traffic is high and exploration can be continuous.