Skip to content

Cheatsheet: Scaling laws

loss(compute) ~ C^(-alpha_C)
loss(parameters) ~ N^(-alpha_N)
loss(data) ~ D^(-alpha_D)

Straight on log-log; doubling each reduces loss by a predictable, modest factor.

Kaplan (2020)Chinchilla (2022)
RecommendationSpend on parameters; moderate dataN and D scale equally with compute
Rule of thumbBig model, moderate tokensD* ~= 20 * N* (~20 tokens per parameter)
Canonical modelGPT-3: 175B / 300B tokensChinchilla: 70B / 1.4T tokens
Same-compute outcomeChinchilla (70B) beat the larger Gopher (280B); also outperformed GPT-3
Given budget C:
6 * N * D = C (from lesson 2)
D / N = 20 (Chinchilla)
-> N ~= sqrt(C / 120)
D ~= 20 * N

Example: C = 6e22 FLOPs -> N^2 = 5e20 -> N ~= 22B, D ~= 440B tokens.

Chinchilla minimizes: training loss for a training budget
Reality minimizes: total cost = training + inference-over-lifetime

For models that will serve many inference tokens:

  • Push past Chinchilla-optimal on data (over-train).
  • Same final loss with a smaller model.
  • Pay more upfront training compute; save on every served token.
  • Most 7-14B modern open models are over-trained for exactly this.
UseHow
Pick (N, D)Apply the Chinchilla rule + 6ND = C
Predict loss before spendingFit on small/medium, extrapolate
Sanity-check a runIf loss diverges from the fit, look at data/optimizer/code
Judge architecture changesA real win improves the exponent, not just the prefactor
  • Predict cross-entropy loss, not downstream task performance.
  • Assume clean data; bad data shifts the curves.
  • Exponents have been re-fit (Kaplan -> Chinchilla -> further refinement).
  • Extrapolation far from fits is a strong claim; review at very large scale.
  • Power law: y proportional to x^(-alpha); straight on log-log.
  • Compute-optimal: the (N, D) minimizing loss for a fixed compute budget.
  • Chinchilla-optimal: the specific compute-optimal answer (D ~ 20N).
  • Over-training: training past compute-optimal on data, trading training cost for inference savings.
  • Exponent vs prefactor: a scaling-law fit has both; only an exponent improvement survives scaling.
  • Stanford CS336, Lectures 9 and 11 (Scaling laws), by Hashimoto and Liang. cs336.stanford.edu. Independent structural mirror in original prose; see references.