Guide

LLM scaling laws explained

A team trains a 7B model on 200B tokens and wonders why a competitor’s 3B model scores higher on their benchmark. The answer is rarely “magic architecture” alone — it is how parameters, data, and compute were balanced against predictable scaling curves. Scaling laws are empirical power-law relationships showing that, within broad regimes, language-model loss improves smoothly as you add model size, training tokens, or FLOPs. Kaplan et al. (2020) showed that bigger models beat bigger datasets alone; Hoffmann et al.’s Chinchilla paper (2022) flipped the recipe toward compute-optimal training; and the post-2023 era added over-training, data quality, and inference economics as first-class variables. This guide explains what scaling laws predict, what they do not, how practitioners budget pre-training and fine-tuning, emergent-ability hype vs measurement, a Harbor Analytics training budget worked example, scaling decision tables, common pitfalls, and a production checklist. For architecture internals see transformer architecture; for deploying smaller models see small language models.

What scaling laws measure

Scaling-law research fits curves of the form L ≈ Cα or L ≈ Nβ, where L is validation cross-entropy loss (proxy for perplexity), N is non-embedding parameter count, D is training tokens, and C is total training compute in FLOPs. The exponents are negative: doubling parameters or data reduces loss by a predictable fraction until you hit data exhaustion, optimization limits, or evaluation saturation.

Three quantities interact:

  • Model size (N) — width and depth of the transformer; larger models memorize patterns more efficiently per token seen.
  • Dataset size (D) — unique tokens (not epochs over the same corpus); repetition eventually yields diminishing returns.
  • Compute (C) — roughly proportional to 6ND FLOPs for standard dense training (constants vary with hardware and attention optimizations like FlashAttention).

Scaling laws describe pre-training loss trends, not guaranteed downstream task scores. A model with lower perplexity can still fail coding or reasoning benchmarks if the data mix or alignment stage is wrong — but loss curves remain the planning tool labs use before spending millions on GPU hours.

Kaplan scaling: when bigger models won

OpenAI’s Kaplan et al. (2020) study on transformer language models established that increasing parameters reduced loss faster than increasing dataset size alone for a fixed compute budget. Under their assumptions, if you had 10× more compute, you should allocate most of it to a larger model rather than training a smaller model on proportionally more data. That recipe influenced GPT-3-era training: very large models, sometimes trained on fewer tokens per parameter than later work would recommend.

Key Kaplan takeaways

  • Loss scales as a power law in N, D, and C across many orders of magnitude.
  • For a fixed compute budget, optimal N and D follow predictable ratios — but those ratios shifted when later work measured iso-loss more carefully.
  • Early stopping and width/depth trade-offs matter; scaling laws are smooth on average, not noise-free for individual training runs.

Kaplan laws explained why frontier labs chased hundred-billion-parameter models. They did not yet answer how many tokens per parameter were optimal — that gap Chinchilla closed.

Chinchilla scaling: compute-optimal balance

DeepMind’s Hoffmann et al. (2022) Chinchilla paper re-derived optimal allocation for a fixed compute budget and found that many contemporary large models were under-trained: too many parameters relative to tokens seen. Chinchilla-70B, trained on ~1.4T tokens, outperformed much larger models trained on less data per parameter. The rule of thumb that emerged: for compute-optimal pre-training, train on roughly 20 tokens per parameter (order-of-magnitude; exact optimum depends on architecture and data).

Chinchilla implications for practitioners

  • Smaller + longer training can beat bigger + shorter at equal compute — the foundation of efficient open models (Llama, Mistral, Phi) that punch above naive parameter comparisons.
  • Data quality and deduplication become central — 20× parameters in unique tokens is a large corpus; repeating low-quality web text does not substitute.
  • Evaluation must track tokens/parameter when comparing checkpoints; leaderboard rank without training recipe context misleads buyers.

Chinchilla shifted the industry from “parameters as marketing” toward tokens seen and compute disclosed (where vendors publish them). It also motivated small language models trained on curated, high-quality data rather than raw scale alone.

Post-Chinchilla era: over-training and inference scaling

After Chinchilla, frontier labs often train beyond the compute-optimal point for a given N — deliberately over-training smaller inference-friendly models on many more tokens than 20×N. Llama 3-class models exemplify the pattern: accept higher pre-training loss per FLOP on the largest feasible N for deployment, then recover capability via massive token exposure and multi-stage alignment.

Training scaling vs inference scaling

Training scales with 6ND; inference scales with active parameters per token (full models) or routed experts ( MoE ), memory bandwidth, and KV cache size. A model optimal for training FLOPs may be too large to serve profitably. Production teams therefore separate:

  • Pre-training scale — what the lab can afford once.
  • Serving scale — latency, cost per 1M tokens, batch size.
  • Distillation / quantization — compress teacher capability into student models ( knowledge distillation, quantization).

Inference-time scaling (more tokens generated at test time, verifier passes, search) is a different axis from parameter scaling — see test-time compute for that curve.

Emergent abilities: hype vs measurement

Papers and blog posts claim that capabilities like multi-step arithmetic emerge suddenly at certain scales. Follow-up work argues many “emergent” jumps are metric artifacts: discontinuous task scores (exact match, pass@1) can look like phase transitions when underlying loss improves smoothly. Sharp capability gains still happen, but often correlate with data mix changes, instruction tuning, or evaluation thresholds — not magic at 100B parameters alone.

Practical read for builders:

  • Do not assume your 13B fine-tune will spontaneously gain reasoning at 70B without data and alignment investment.
  • Use smooth metrics (log likelihood, Brier score) alongside pass/fail benchmarks in evaluation.
  • Treat scaling laws as budget planners, not capability oracles.

Scaling laws for fine-tuning and domain adaptation

Scaling-law literature focuses on pre-training, but analogous trade-offs appear in fine-tuning:

  • More domain data usually helps until overfitting; smaller models need fewer examples to saturate narrow tasks.
  • LoRA rank is not “more is always better” — rank scales effective capacity; match rank to task complexity and base model size.
  • Alignment compute (SFT + preference optimization) can dominate user-visible quality after Chinchilla-optimal pre-training plateaus; see RLHF and DPO.

For most product teams, scaling retrieval and context ( RAG, context engineering ) beats scaling parameters when knowledge is volatile or proprietary.

Worked example: Harbor Analytics pre-training budget

Harbor Analytics plans an internal 1.3B-parameter coder model for SQL and spreadsheet formula assistance. They have 3×1021 FLOPs (~8,000 A100-hours equivalent) and proprietary + open code corpora. Applying Chinchilla-style balancing:

  1. Compute-optimal tokens — target ~20×1.3B ≈ 26B unique tokens at iso-compute optimum; they budget 40B tokens (modest over-training) because inference at 1.3B is cheap and they want lower deployment loss.
  2. Data mix — 55% filtered GitHub/Python/SQL, 25% documentation and Stack Overflow dumps (deduplicated), 15% synthetic examples from their tutor pipeline, 5% held-out enterprise schemas (licensed). Quality filtering beats adding 200B raw crawl tokens.
  3. Baseline comparison — a 7B model trained on 10B tokens would consume similar FLOPs but serve 5× slower; Harbor chooses 1.3B + longer training + later LoRA on customer schemas.
  4. Evaluation gates — track validation loss weekly; run HumanEval-style SQL tasks only after loss flattening; abort if loss diverges (learning rate or data bug).
  5. Serving path — INT8 quantization and batch-8 GPU serving; cascade route hard queries to a frontier API via model routing.

Outcome framing: scaling laws told them 1.3B × 40B tokens fits the FLOP box; product requirements (latency, fine-tune surface) picked the exact N; data curation picked effective D.

Scaling decision table

GoalScale whatCaveat
Lower pre-training loss per FLOPBalance N and D near Chinchilla 20:1 tokens/paramUnique tokens; dedupe matters.
Best benchmark at fixed serve costOver-train smaller N; distill from teacherAlignment stage may dominate scores.
Proprietary knowledge without retrainRAG + context, not parameter scaleRetrieval quality caps answers.
Multilingual coverageScale data per language, not just total DEnglish-heavy mix hides weak locales.
Reasoning-heavy tasksData mix (code, math) + test-time computeParams alone rarely fix reasoning.
Edge deploymentSmaller N + quantization + SLM trainingMemory bandwidth bound on phone NPUs.
Fast iteration for startupsFine-tune open Chinchilla-style baseDo not pre-train without unique data moat.
Regulated attributionScale curated docs in RAG, log sourcesBigger model increases hallucination risk without grounding.

Common pitfalls

  • Parameter count marketing — 70B trained on 2T tokens vs 8B on 8T tokens are incomparable without tokens/param and data recipe.
  • Epoching tiny data — repeating 1B tokens 100 times is not equivalent to 100B unique tokens; loss may improve then memorization hurts generalization.
  • Ignoring inference economics — training-optimal giants can bankrupt serving; plan quantization and routing early.
  • Assuming emergent jumps — buying 2× parameters expecting automatic coding gains without code-heavy training data.
  • Single benchmark optimization — scaling loss on pile validation does not guarantee win on your customer’s JSON extraction task.
  • Undisclosed compute — comparing models without FLOPs or tokens seen invites false conclusions in procurement.
  • Skipping alignment budget — Chinchilla-optimal base still needs SFT/DPO for assistant behavior.
  • Data pollution — benchmark leakage inflates scores without true capability scaling.

Production checklist

  • Document target compute budget in FLOPs or GPU-hours before choosing N.
  • Estimate unique tokens available; deduplicate and filter before scaling epochs.
  • Compute tokens-per-parameter ratio; compare to Chinchilla ~20:1 baseline.
  • Plot validation loss vs tokens; stop or adjust LR when curve knees.
  • Benchmark on smooth metrics (loss, calibrated probs) plus task pass rates.
  • Model serving cost per 1M tokens at expected QPS before committing to N.
  • Plan distillation or quantization if training N exceeds serve N.
  • Reserve alignment compute (SFT, preferences) in the project plan.
  • For product teams: default to fine-tune + RAG unless data moat justifies pre-train.
  • Log data mix percentages and revision hashes for reproducibility audits.

Key takeaways

  • Scaling laws are smooth power-law relationships between parameters, data, compute, and pre-training loss.
  • Kaplan favored larger models; Chinchilla rebalanced toward more tokens per parameter at fixed compute.
  • Post-Chinchilla practice often over-trains smaller deployable models and invests in data quality.
  • Inference and test-time compute are separate scaling axes from pre-training FLOPs.
  • Most teams should scale retrieval, fine-tuning, and routing before attempting frontier pre-training.

Related reading