Why scale matters: scaling laws and the Chinchilla rule
What you’ll learn
Section titled “What you’ll learn”The previous lesson ended on one claim that did not get justified: pretraining works because of scale. This lesson gives that claim its empirical foundation. The 2020 Kaplan paper (Scaling Laws for Neural Language Models) found that loss falls predictably with more compute, more parameters, and more training data, in smooth power-law-shaped curves. The 2022 Chinchilla paper added the constraint nobody actually has unlimited data, and pinned the compute-optimal balance at roughly 20 tokens per parameter. Together those two results explain a confusing fact about the field: between 2019 and 2024, almost everyone scaled up parameters faster than data, leaving compute on the table. GPT-3 (175 billion parameters, 300 billion tokens, a 1.7-to-1 ratio where the rule says 20-to-1) is the worked example you finish with.
Where this fits
Section titled “Where this fits”This is lesson 2 of Phase 3, How models are trained at scale. Phase 3 builds toward describing what it takes to train a frontier model and why most organizations cannot. This lesson takes the previous lesson’s “pretraining works because of scale” claim and gives it the empirical machinery (FLOPs as the cost unit, the Kaplan scaling laws, the Chinchilla compute-optimal rule). The previous lesson in the phase was the Phase 3 opener on pretraining itself.
Before you start
Section titled “Before you start”Prerequisites: the pretraining lesson (Phase 3 lesson 1). You should be comfortable with what next-token-prediction pretraining is and why it works. No math beyond reading exponents.
By the end, you’ll be able to
Section titled “By the end, you’ll be able to”- Describe scaling laws as the empirical finding that loss falls predictably with more compute, more parameters, and more training data
- Explain why the Kaplan and Chinchilla results reconcile as answers to different optimization questions
- Apply the 20-tokens-per-parameter Chinchilla rule to identify whether a real model was data-undertrained
- Recognize why citing parameter count alone underdescribes a modern language model
Time and difficulty
Section titled “Time and difficulty”- Read time: about 20 minutes
- Practice time: about 15 minutes (a worked compute-optimal calculation on a fresh model, plus flashcards)
- Difficulty: standard