Scaling laws and Chinchilla: cheatsheet

The one idea that matters

Scaling laws (Kaplan 2020):
  loss falls predictably with more compute, more params, more data.

Chinchilla rule (Hoffmann 2022):
  with a fixed compute budget, train ~20 tokens per parameter.

Together they explain why GPT-3 was undertrained
and why frontier labs after 2022 cite trillions of training tokens.

FLOPs, the unit of pretraining cost

Property	Detail
What it is	Floating-point operation: a single multiplication or addition over a decimal number
Why this unit	Pretraining cost (dollars and time) scales with the total operation count
Order of magnitude	Training a frontier-class LLM is on the order of `10^25` FLOPs
Rough rule	Training cost scales like `O(parameters × tokens)`
Watch for	FLOPs is a count; FLOP/s (with a slash) is a rate (operations per second). Same letters, different concepts.

The Kaplan scaling laws (2020)

Knob	Empirical claim	Caveat
More compute	Loss falls predictably	Smooth power-law curve, not flat
More parameters	Loss falls predictably	Same shape
More tokens	Loss falls predictably	Same shape

Two practical takeaways from the Kaplan era:

Recommendation	What it implied
Bigger is reliably better	Scale up parameters; performance keeps improving
Bigger is more sample-efficient per token	At equal data, a larger model extracts more signal per token

Both implicitly assumed effectively unlimited data. That assumption is what Chinchilla tested.

The Chinchilla rule (2022)

Constraint:           fixed compute budget
Optimization question: best balance of model size and training data?
Answer:                ~20 training tokens per parameter

  10B params  →  ~200B tokens
  70B params  →  ~1.4T tokens
  175B params →  ~3.5T tokens

The rule holds across orders of magnitude. The Chinchilla paper was the demonstration; “Chinchilla” is the working name of the rule.

Kaplan vs Chinchilla, the reconciliation

Question	Constraint	Answer
Kaplan (2020)	“What model size is most sample-efficient?”	Implicitly unlimited data; answer: bigger
Chinchilla (2022)	“What’s the compute-optimal split?”	Explicitly fixed compute; answer: ~20 tokens per parameter

Same scaling-laws empirical foundation, different optimization questions. They reconcile by recognizing the optimization-target difference.

The GPT-3 worked check

GPT-3:
  parameters: 175 billion
  training tokens: 300 billion
  actual ratio: 300B / 175B ≈ 1.7 tokens per parameter

Chinchilla target:
  20 tokens per parameter
  for 175B params, target = 175 × 20 = 3.5 trillion tokens

Verdict: GPT-3 received roughly one-tenth of the data Chinchilla recommends.
"Really undertrained," in the lecturer's words.

Why this matters when you use AI

Phenomenon	What it tells you
Press release: “N billion parameters”	Half a description. Ask: how many tokens?
Press release: “trained on T trillion tokens”	The Chinchilla rebalancing showing up. The data side is no longer hidden.
Two models, same params, different feel	Tokens-per-parameter is one variable that explains some of the difference.
Pre-2022 frontier model with low token count	Probably data-undertrained by Chinchilla. (The rule was published in 2022, so older teams were not ignoring it; they did not yet have it.)

Pitfalls to dodge

Pitfall	Reality
Scaling laws give a specific loss number	No, they predict the curve shape. Constants depend on architecture, data, tokenizer, optimization setup.
Chinchilla means smaller models are always better	No, it says for a fixed compute budget, smaller-model + more-data is more efficient. With more compute, the optimal model is still larger; the data target grows in proportion.
FLOPs and FLOP/s are interchangeable	No. FLOPs is a count of operations performed; FLOP/s is the rate a GPU can perform. Use context to disambiguate.
Lower pretraining loss = better at my task	Correlated, not equal. Pretraining loss is next-token-prediction loss on the corpus distribution. Task-specific quality is downstream and depends on tuning + evaluation.

Glossary

FLOPs (sometimes written FLOPS, capital S): floating-point operations, the count-based unit of pretraining cost.
FLOP/s: floating-point operations per second; the rate of compute a GPU can perform.
Scaling laws: the empirical finding (Kaplan et al. 2020) that loss falls predictably with more compute, more parameters, and more training data, in smooth power-law curves.
Sample efficiency (in this context): how much loss reduction per training token a model produces. Bigger models are more sample-efficient per token (Kaplan).
Chinchilla rule: the compute-optimal balance of roughly 20 training tokens per model parameter (Hoffmann et al. 2022).
Compute-optimal: the model-and-data combination that minimizes loss for a fixed FLOPs budget.
Data-undertrained: a model trained on fewer tokens than the Chinchilla rule recommends for its parameter count. Most pre-2022 frontier models fit this description.
Pre-Chinchilla era: the period (roughly 2019-2022) before the Chinchilla rule was widely adopted, characterized by parameter-scaling outpacing data-scaling.

Scale produces predictable improvement.
Chinchilla pinned the optimal split at 20 tokens per parameter.
Pre-Chinchilla pretraining left compute on the table.