Summary: Scaling laws

Phase 3 opens with the bridge from “how do I build and run an LLM” to “how good can it get for this budget?” Scaling laws are the answer: training loss falls as a power law in compute, parameters, and data, on remarkably straight log-log lines across orders of magnitude. Kaplan (2020) first formalized this and recommended spending compute disproportionately on parameters (GPT-3 era). Chinchilla (Hoffmann et al., 2022) corrected it: model size and data should scale roughly equally, with a rule of thumb of D* ~= 20 * N* tokens per parameter. Chinchilla itself (70B / 1.4T tokens) used the same compute as the much larger Gopher (280B) and beat it, and also outperformed GPT-3, and the field shifted. Inference cost then pushes many modern open models past Chinchilla-optimal (over-training for cheaper serving). The laws let you pick (N, D) from a budget, predict performance before spending, sanity-check runs, and judge architectural changes by whether they improve the exponent. This is the scan version; the lesson works the numbers.

Core ideas

Power-law form. loss ~ C^(-alpha_C), loss ~ N^(-alpha_N), loss ~ D^(-alpha_D). Straight log-log lines; doubling each reduces loss by a predictable factor.
Kaplan to Chinchilla. Kaplan (2020) recommended big-model-moderate-data (GPT-3). Chinchilla (2022) found N* and D* both scale as compute^0.5; rule of thumb D* ~= 20 * N*. Smaller model on more data, at the same compute, wins (Chinchilla 70B beat the larger Gopher 280B).
The rule for (N, D) from a compute budget. Set 6ND ~ C and D / N ~= 20. Solving: roughly N ~= sqrt(C/120) and D ~= 20N. Chinchilla-optimal starting point.
Over-training for inference economics. Modern open models often go past Chinchilla-optimal on data: a smaller model trained longer is cheaper to serve, paying upfront training for per-token inference savings.
Predictive power. Fit at small/medium scales, extrapolate; choose architecture by exponent improvements (not just one-size wins); sanity-check runs in flight.
Honest limits. Predict cross-entropy, not benchmarks (capability sometimes appears in jumps); assume clean data; exponents have been re-fit and may again; far extrapolation is a strong claim.

What changes for you

This lesson is the bridge that turns architecture + cost accounting + systems into an answerable budget question. With Chinchilla’s D ~= 20N and lesson 2’s 6ND ~= C, “how big should I build it” becomes a one-line calculation, and the inference-cost adjustment explains why real open models often train past that line. It also reframes the architecture work: a new optimizer, attention variant, or normalization is only a meaningful win if it improves the scaling exponent; constant-factor improvements at one scale are weaker evidence. Carry that exponent-first instinct into every “we beat baseline” claim you read. The next lesson, evaluation, takes the same critical eye to what the loss-on-paper is actually measuring.

Scaling laws turn “how big” from folklore into arithmetic: 6ND = C and D / N ~= 20 is the Chinchilla starting point, adjusted upward in tokens when inference cost is a serious factor. The rest of Phase 3 refines what counts as “good” in those equations.