Skip to content

Why scale matters: scaling laws and the Chinchilla rule

The previous lesson ended on one claim that did not get justified: pretraining works because of scale. This lesson gives that claim its empirical foundation. The 2020 Kaplan paper (Scaling Laws for Neural Language Models) found that loss falls predictably with more compute, more parameters, and more training data, in smooth power-law-shaped curves. The 2022 Chinchilla paper added the constraint nobody actually has unlimited data, and pinned the compute-optimal balance at roughly 20 tokens per parameter. Together those two results explain a confusing fact about the field: between 2019 and 2024, almost everyone scaled up parameters faster than data, leaving compute on the table. GPT-3 (175 billion parameters, 300 billion tokens, a 1.7-to-1 ratio where the rule says 20-to-1) is the worked example you finish with.

This is lesson 2 of Phase 3, How models are trained at scale. Phase 3 builds toward describing what it takes to train a frontier model and why most organizations cannot. This lesson takes the previous lesson’s “pretraining works because of scale” claim and gives it the empirical machinery (FLOPs as the cost unit, the Kaplan scaling laws, the Chinchilla compute-optimal rule). The previous lesson in the phase was the Phase 3 opener on pretraining itself.

Prerequisites: the pretraining lesson (Phase 3 lesson 1). You should be comfortable with what next-token-prediction pretraining is and why it works. No math beyond reading exponents.

  • Describe scaling laws as the empirical finding that loss falls predictably with more compute, more parameters, and more training data
  • Explain why the Kaplan and Chinchilla results reconcile as answers to different optimization questions
  • Apply the 20-tokens-per-parameter Chinchilla rule to identify whether a real model was data-undertrained
  • Recognize why citing parameter count alone underdescribes a modern language model
  • Read time: about 20 minutes
  • Practice time: about 15 minutes (a worked compute-optimal calculation on a fresh model, plus flashcards)
  • Difficulty: standard