Cheatsheet: Scaling laws
The empirical form
Section titled “The empirical form”loss(compute) ~ C^(-alpha_C)loss(parameters) ~ N^(-alpha_N)loss(data) ~ D^(-alpha_D)Straight on log-log; doubling each reduces loss by a predictable, modest factor.
Kaplan -> Chinchilla shift
Section titled “Kaplan -> Chinchilla shift”| Kaplan (2020) | Chinchilla (2022) | |
|---|---|---|
| Recommendation | Spend on parameters; moderate data | N and D scale equally with compute |
| Rule of thumb | Big model, moderate tokens | D* ~= 20 * N* (~20 tokens per parameter) |
| Canonical model | GPT-3: 175B / 300B tokens | Chinchilla: 70B / 1.4T tokens |
| Same-compute outcome | Chinchilla (70B) beat the larger Gopher (280B); also outperformed GPT-3 |
Budget calculation (compute-optimal)
Section titled “Budget calculation (compute-optimal)”Given budget C: 6 * N * D = C (from lesson 2) D / N = 20 (Chinchilla)-> N ~= sqrt(C / 120) D ~= 20 * NExample: C = 6e22 FLOPs -> N^2 = 5e20 -> N ~= 22B, D ~= 440B tokens.
Inference-cost adjustment (over-training)
Section titled “Inference-cost adjustment (over-training)”Chinchilla minimizes: training loss for a training budgetReality minimizes: total cost = training + inference-over-lifetimeFor models that will serve many inference tokens:
- Push past Chinchilla-optimal on data (over-train).
- Same final loss with a smaller model.
- Pay more upfront training compute; save on every served token.
- Most 7-14B modern open models are over-trained for exactly this.
What scaling laws let you do
Section titled “What scaling laws let you do”| Use | How |
|---|---|
Pick (N, D) | Apply the Chinchilla rule + 6ND = C |
| Predict loss before spending | Fit on small/medium, extrapolate |
| Sanity-check a run | If loss diverges from the fit, look at data/optimizer/code |
| Judge architecture changes | A real win improves the exponent, not just the prefactor |
Limits
Section titled “Limits”- Predict cross-entropy loss, not downstream task performance.
- Assume clean data; bad data shifts the curves.
- Exponents have been re-fit (Kaplan -> Chinchilla -> further refinement).
- Extrapolation far from fits is a strong claim; review at very large scale.
Words to use precisely
Section titled “Words to use precisely”- Power law: y proportional to x^(-alpha); straight on log-log.
- Compute-optimal: the
(N, D)minimizing loss for a fixed compute budget. - Chinchilla-optimal: the specific compute-optimal answer (
D ~ 20N). - Over-training: training past compute-optimal on data, trading training cost for inference savings.
- Exponent vs prefactor: a scaling-law fit has both; only an exponent improvement survives scaling.
Source
Section titled “Source”- Stanford CS336, Lectures 9 and 11 (Scaling laws), by Hashimoto and Liang.
cs336.stanford.edu. Independent structural mirror in original prose; see references.