Summary: Why scale matters: scaling laws and Chinchilla
Pretraining works because of scale. That was the load-bearing claim of the previous lesson, and this lesson is its justification. Two papers, taken together, explain why scale specifically is what makes next-token prediction produce general capability: Kaplan et al. 2020 established that loss falls predictably with more compute, more parameters, and more training data; Hoffmann et al. 2022 (the Chinchilla paper) added the constraint of a fixed compute budget (treating data as something you actually have to allocate, not assume infinite) and pinned the compute-optimal balance at roughly 20 tokens per parameter. Together they explain why the 2019-2024 era of “just build bigger” was leaving compute on the table, and why GPT-3 (175 billion parameters, 300 billion tokens, a 1.7-to-1 ratio) was undertrained by the new standard.
This summary is the scan-it-in-five-minutes version. The full lesson covers FLOPs as the cost unit, the Kaplan scaling laws, the Kaplan-vs-Chinchilla reconciliation, and the GPT-3 worked check.
Core ideas
Section titled “Core ideas”- FLOPs is the unit of pretraining cost. Floating-point operations: how many multiplications and additions over decimal numbers the training loop has to perform. Training a frontier-class LLM is on the order of
10^25FLOPs, the lecturer’s anchor for “extraordinary.” (FLOPs is a count; FLOP/s with a slash is a rate, the operations-per-second a GPU can perform. Easy to confuse.) - FLOPs scales like the product of parameters and tokens. Doubling either makes training roughly twice as expensive. Once you have a fixed budget, you have to decide how to split it between bigger model and more data.
- Scaling laws are an empirical finding (Kaplan et al. 2020). Three claims, all “predictably” qualified. (1) More compute, less loss. (2) More parameters, less loss. (3) More training tokens, less loss. The relationships are smooth power laws, not flat lines or cliffs.
- The smoothness was the surprise. Before this paper, scaling neural networks was a hope-it-works activity. After this paper, scaling became closer to engineering: double the parameters and you can predict roughly how much loss falls.
- Two practical recommendations flowed from scaling laws. Bigger is reliably better (the 2019-2024 framing) and bigger is more sample-efficient per token (Kaplan’s observation that for an equal amount of training data, a larger model gets more out of it). Both recommendations implicitly assumed effectively unlimited data.
- Chinchilla rebalanced once data was treated as finite. Hoffmann et al. 2022 fixed compute budgets and found the optimal model-and-data combination within each budget. The result: roughly 20 training tokens per parameter is compute-optimal across orders of magnitude.
- The two papers do not contradict each other. Kaplan answers “with unlimited data, what’s the most sample-efficient model size?” and gets bigger. Chinchilla answers “with a fixed compute budget, what’s the optimal split?” and gets 20 tokens per parameter. Same scaling-laws empirical foundation, different optimization questions.
- GPT-3 is the worked undertrained example. 175 billion parameters and 300 billion tokens makes a 1.7-to-1 ratio. The Chinchilla rule says it should have been around 20-to-1, or about 3.5 trillion tokens for that parameter count (roughly ten times the data it actually received). The lecturer calls GPT-3 “really undertrained” by this standard.
- The pre-Chinchilla era left compute on the table. Most large pretraining runs of that period had similar problems to GPT-3: too many parameters relative to training data. The same compute budget could have produced a better model with smaller parameter counts and much larger training sets.
- “Y billion parameters” is half a description. Tokens-per-parameter is the other half. A 70B-parameter model trained on 1.4 trillion tokens (Chinchilla-aligned) is not the same as a 70B-parameter model trained on 300 billion tokens. The shorthand of citing parameter count alone hides the data-side variable.
- Modern releases reflect the Chinchilla rebalancing. Many post-Chinchilla frontier-model releases now cite training-token counts in the trillions (a much larger figure than the pre-Chinchilla norm). The data side is no longer hidden.
- Pitfall: scaling laws guarantee curve shape, not a specific loss number. They predict the smooth-power-law shape, not the exact constant. The constants depend on architecture, data, tokenizer, and optimization setup.
- Pitfall: Chinchilla does not say “smaller is always better.” Chinchilla says that for a given compute budget, splitting into a smaller model with more data is more efficient than the reverse. With more compute, the optimal model is still larger; the data target just grows in proportion.
- Pitfall: lower pretraining loss is correlated with capability, not equal to it. Pretraining loss is next-token-prediction loss on text from the corpus distribution. Task-specific quality is downstream and depends on tuning + evaluation choices.
What changes for you
Section titled “What changes for you”When you read about a model “with N billion parameters,” you now ask the second question: how many training tokens? When you read a model release citing trillions of training tokens, you recognize the Chinchilla rebalancing in production. When two models with similar parameter counts feel very different in capability, you have one variable (tokens-per-parameter at training time) that explains some of the difference. The next lesson takes the question of how to actually run a Chinchilla-aligned pretraining run on real hardware: parallelism, ZeRO, Flash Attention, and the engineering tricks that make trillions of tokens of training tractable in practice.
Scale produces predictable improvement.
Chinchilla pinned the optimal split at 20 tokens per parameter.
Pre-Chinchilla pretraining left compute on the table.