Guide
Mixture of experts explained: sparse LLMs, routing, and inference trade-offs
A mixture of experts (MoE) model replaces each dense feed-forward block in a transformer with a bank of parallel expert subnets plus a router that picks which experts run for each token. The headline promise: many more total parameters without proportionally more compute per forward pass — because only a small subset of experts activates on any given token. Models like Mixtral, DeepSeek-MoE, and early Switch Transformer variants popularized the pattern for open-weight LLMs. This guide explains how routing works, why training MoE is harder than dense transformers, what changes at inference time (memory vs FLOPs), and how MoE fits next to quantization and standard scaling laws.
Dense vs sparse: the core idea
In a standard decoder-only LLM, every transformer layer has a multi-head self-attention block followed by a dense feed-forward network (FFN) — typically two linear projections with a GELU or SwiGLU activation in between. Every parameter in that FFN participates in every token’s computation.
An MoE layer keeps attention unchanged but swaps the single FFN for
N smaller expert FFNs (often 8, 16, or more) and a
gating network. For each token, the gate outputs a score per
expert; the top-k experts (often k = 1 or
k = 2) run, and their outputs are weighted and summed. If you have
8 experts but activate 2 per token, you use roughly 25% of the expert
FLOPs while still storing all 8 expert weight matrices in memory.
Parameters are not compute
Parameter count and inference cost decouple. A 47B-parameter MoE that activates ~13B parameters per token can match a 13B dense model on latency while carrying more representational capacity — if your hardware can feed weights fast enough and routing overhead stays low. That “if” is where production gets interesting.
How routing works
The router is usually a small linear projection from the hidden state to
N logits — one score per expert. Common patterns:
- Top-1 routing — send the token to exactly one expert (Switch Transformer style). Lowest compute, highest risk of imbalance.
- Top-2 routing — activate two experts and blend outputs (Mixtral 8x7B uses this). Better quality and smoother gradients; ~2x expert FLOPs vs top-1.
- Soft routing — weighted combination of all experts (rare at scale due to cost).
Scores often pass through softmax to get weights. Some designs add noise to logits during training to encourage exploration so experts do not collapse into a single overloaded subnet.
Token-level vs expert parallelism
In distributed training, expert parallelism shards different experts across GPUs. A token routed to expert 5 may need an all-to-all communication step to reach the GPU holding expert 5’s weights. Routing patterns therefore affect network traffic, not just FLOPs. Inference stacks (vLLM, TensorRT-LLM, etc.) implement fused MoE kernels and expert-aware batching to hide this latency.
Training challenges
MoE adds optimization problems dense models avoid:
- Load imbalance — if the router sends 80% of tokens to expert 0, that GPU saturates while others idle. Training slows; experts specialize poorly.
- Expert collapse — some experts receive almost no tokens and stop learning useful features.
- Auxiliary load-balancing loss — most MoE training recipes add a penalty encouraging uniform expert utilization across a batch. Tune too aggressively and quality drops; too weak and routing collapses.
- Capacity factor — limit how many tokens each expert may accept per batch; overflow tokens get dropped or rerouted, trading strict balance for throughput.
From a deep learning perspective, MoE is still backprop-through-router — gradients flow to both experts and gate. The gate learns which specialization to invoke; experts learn what specialization they represent (syntax, code, math, multilingual subspaces, etc.), though interpretability is imperfect.
Notable MoE models and naming
Marketing names can mislead. Mixtral 8x7B has eight experts each sized like a ~7B FFN shard, but only two activate per token — effective active parameters are far below 8 × 7B. Total stored parameters are larger than a single 7B dense model, yet per-token compute sits closer to a ~13B-class dense network.
DeepSeek-MoE and similar architectures refine routing, shared experts, and fine-grained expert splits to push parameter efficiency further. Closed API models (GPT-4 class rumors, Gemini) are widely assumed to use MoE or hybrid sparse patterns, though vendors rarely publish full architectural detail.
Compare MoE against fine-tuning trade-offs: a sparse base may fine-tune differently than dense — adapter layers attach per expert or only to shared components depending on framework support.
Inference: memory bandwidth is the bottleneck
Dense model inference on a single GPU is often memory-bandwidth bound: loading weights dominates latency more than arithmetic. MoE amplifies this tension:
- All experts must fit (or stream from host) even though only
krun per token — VRAM footprint tracks total parameters, not active parameters. - Active FLOPs scale with
k, notN— good for throughput if weights are already resident. - Batching helps — routing different tokens in a batch to different experts improves GPU utilization vs batch-1 serial routing.
- Quantization matters more — INT4/INT8 weight formats shrink the all-experts memory tax; see quantization and inference for GPTQ/AWQ trade-offs after you shrink expert matrices.
For long contexts, attention and KV cache costs are unchanged by MoE — MoE savings apply to the FFN portion of each layer, not to quadratic attention memory.
MoE vs dense: when each wins
| Concern | Dense model | MoE model |
|---|---|---|
| VRAM for weights | Lower total params | Higher — all experts resident |
| Per-token FLOPs | All FFN params active | Only top-k experts active |
| Training complexity | Simpler, mature tooling | Routing balance, expert parallelism |
| Latency at batch 1 | Predictable | Routing + expert dispatch overhead |
| Quality per dollar (API) | Well understood | Often better if provider absorbs routing cost |
Choose dense when you self-host on tight VRAM and need predictable single-user latency. Choose MoE when you want frontier-class quality at moderated active compute — especially via hosted APIs that amortize expert parallelism across many concurrent requests.
MoE in production systems
If you operate LLM apps (RAG chatbots, agents, codegen), MoE mostly appears as a model selection decision rather than something you implement yourself:
- Capacity planning — self-hosting Mixtral-class weights needs multi-GPU or aggressive quantization; do not assume 7B VRAM budgets.
- Eval before swap — MoE and dense models with similar active params can differ on reasoning, tool use, and multilingual tasks. Run your eval suite on real prompts, not just MMLU headlines.
- Routing is opaque — unlike RAG where you control retrieval, you cannot easily force “use the math expert” on a MoE API — specialization is learned, not user-addressable.
- Cost models — API pricing may reflect total model size, active compute, or both; read provider docs instead of inferring from parameter marketing.
Common misconceptions
- “8x7B means 56B active” — false; only top-
kexperts run per token. - “MoE is always faster” — false at batch 1 if memory bandwidth or routing overhead dominates.
- “More experts always help” — without load balancing and sufficient data, extra experts stay undertrained.
- “MoE replaces RAG” — orthogonal; MoE is intra-model sparsity, RAG is external knowledge injection.
- “Experts are human-interpretable modules” — sometimes loosely true, never guaranteed; do not build security policies assuming clean expert boundaries.
Production checklist
- Confirm whether your target model is MoE or dense before sizing GPU memory.
- Benchmark end-to-end latency at your real batch size and context length — not just tokens/sec marketing charts.
- After quantization, re-run task-specific evals; expert layers can quantize differently than attention.
- For self-hosting, verify inference engine MoE support (fused kernels, expert parallelism) before buying hardware.
- Document model choice in agent architectures so cost and quality regressions are traceable when you swap checkpoints.
MoE is one of the main tricks the industry uses to keep scaling machine learning models when dense FFNs become too expensive to run at every layer for every token. Understanding the split between stored parameters and activated parameters keeps you from mis-sizing infrastructure — or overpaying for dense compute when a sparse API would match your workload.
Related reading
- Transformer architecture explained — self-attention, FFN blocks, and where MoE layers plug in
- LLM quantization and inference explained — shrinking expert weights for VRAM-constrained deployment
- LLM context windows explained — KV cache costs MoE does not reduce
- LLM evaluation and benchmarking explained — compare MoE and dense checkpoints on your tasks