Skip to content

Cheatsheet: Why scale matters: scaling laws and Chinchilla

Scaling laws (Kaplan 2020):
loss falls predictably with more compute, more params, more data.
Chinchilla rule (Hoffmann 2022):
with a fixed compute budget, train ~20 tokens per parameter.
Together they explain why GPT-3 was undertrained
and why frontier labs after 2022 cite trillions of training tokens.
PropertyDetail
What it isFloating-point operation: a single multiplication or addition over a decimal number
Why this unitPretraining cost (dollars and time) scales with the total operation count
Order of magnitudeTraining a frontier-class LLM is on the order of 10^25 FLOPs
Rough ruleTraining cost scales like O(parameters × tokens)
Watch forFLOPs is a count; FLOP/s (with a slash) is a rate (operations per second). Same letters, different concepts.
KnobEmpirical claimCaveat
More computeLoss falls predictablySmooth power-law curve, not flat
More parametersLoss falls predictablySame shape
More tokensLoss falls predictablySame shape

Two practical takeaways from the Kaplan era:

RecommendationWhat it implied
Bigger is reliably betterScale up parameters; performance keeps improving
Bigger is more sample-efficient per tokenAt equal data, a larger model extracts more signal per token

Both implicitly assumed effectively unlimited data. That assumption is what Chinchilla tested.

Constraint: fixed compute budget
Optimization question: best balance of model size and training data?
Answer: ~20 training tokens per parameter
10B params → ~200B tokens
70B params → ~1.4T tokens
175B params → ~3.5T tokens

The rule holds across orders of magnitude. The Chinchilla paper was the demonstration; “Chinchilla” is the working name of the rule.

QuestionConstraintAnswer
Kaplan (2020)“What model size is most sample-efficient?”Implicitly unlimited data; answer: bigger
Chinchilla (2022)“What’s the compute-optimal split?”Explicitly fixed compute; answer: ~20 tokens per parameter

Same scaling-laws empirical foundation, different optimization questions. They reconcile by recognizing the optimization-target difference.

GPT-3:
parameters: 175 billion
training tokens: 300 billion
actual ratio: 300B / 175B ≈ 1.7 tokens per parameter
Chinchilla target:
20 tokens per parameter
for 175B params, target = 175 × 20 = 3.5 trillion tokens
Verdict: GPT-3 received roughly one-tenth of the data Chinchilla recommends.
"Really undertrained," in the lecturer's words.
PhenomenonWhat it tells you
Press release: “N billion parameters”Half a description. Ask: how many tokens?
Press release: “trained on T trillion tokens”The Chinchilla rebalancing showing up. The data side is no longer hidden.
Two models, same params, different feelTokens-per-parameter is one variable that explains some of the difference.
Pre-2022 frontier model with low token countProbably data-undertrained by Chinchilla. (The rule was published in 2022, so older teams were not ignoring it; they did not yet have it.)
PitfallReality
Scaling laws give a specific loss numberNo, they predict the curve shape. Constants depend on architecture, data, tokenizer, optimization setup.
Chinchilla means smaller models are always betterNo, it says for a fixed compute budget, smaller-model + more-data is more efficient. With more compute, the optimal model is still larger; the data target grows in proportion.
FLOPs and FLOP/s are interchangeableNo. FLOPs is a count of operations performed; FLOP/s is the rate a GPU can perform. Use context to disambiguate.
Lower pretraining loss = better at my taskCorrelated, not equal. Pretraining loss is next-token-prediction loss on the corpus distribution. Task-specific quality is downstream and depends on tuning + evaluation.
  • FLOPs (sometimes written FLOPS, capital S): floating-point operations, the count-based unit of pretraining cost.
  • FLOP/s: floating-point operations per second; the rate of compute a GPU can perform.
  • Scaling laws: the empirical finding (Kaplan et al. 2020) that loss falls predictably with more compute, more parameters, and more training data, in smooth power-law curves.
  • Sample efficiency (in this context): how much loss reduction per training token a model produces. Bigger models are more sample-efficient per token (Kaplan).
  • Chinchilla rule: the compute-optimal balance of roughly 20 training tokens per model parameter (Hoffmann et al. 2022).
  • Compute-optimal: the model-and-data combination that minimizes loss for a fixed FLOPs budget.
  • Data-undertrained: a model trained on fewer tokens than the Chinchilla rule recommends for its parameter count. Most pre-2022 frontier models fit this description.
  • Pre-Chinchilla era: the period (roughly 2019-2022) before the Chinchilla rule was widely adopted, characterized by parameter-scaling outpacing data-scaling.

Scale produces predictable improvement.
Chinchilla pinned the optimal split at 20 tokens per parameter.
Pre-Chinchilla pretraining left compute on the table.