Scaling laws, in brief

What you’ll learn

Phase 3 opens with the bridge from “how do I build and run an LLM” to “how good can it get for this budget?” The source curriculum is Stanford CS336, Lectures 9 and 11, by Tatsunori Hashimoto and Percy Liang, with lectures freely available on YouTube and the course at cs336.stanford.edu. Per Phase 0, the two scaling-laws lectures are collapsed here.

You will learn the empirical power-law form of LLM loss (in compute, parameters, and data); the Kaplan-to-Chinchilla shift and the resulting D* ~= 20 * N* rule of thumb; how to compute a Chinchilla-optimal (N, D) from a compute budget using lesson 2’s 6ND; why inference cost pushes many modern open models past Chinchilla-optimal in practice; and how to judge architectural changes by whether they improve the scaling exponent, not just the prefactor at one size.

Where this fits

This is lesson 9 of 14, the first lesson of Phase 3 (scale, data, and alignment). It uses lesson 2’s compute accounting and lesson 3’s parameter sizing as inputs, and it sets up the rest of Phase 3 by reframing “how big and how long?” as an answerable budget calculation. The next lesson (evaluation) takes a critical look at what the loss-on-paper actually correlates with for downstream capability.

Before you start

Prerequisites: lesson 2 (the 6ND training-compute estimate, used directly in the Chinchilla calculation) and lesson 3 (the parameter-count formula and the size dials this lesson allocates). Comfort with power-laws (log-log straight lines) helps; the lesson explains them, but the intuition speeds up the read.

About the math

Light but real. The calculations are arithmetic with large numbers: 6ND = C plus D / N ~= 20 gives a quadratic in N that you solve directly (N ~= sqrt(C/120)). The “power law” is described by what it predicts, not derived from physics; it is an empirical regularity that holds across runs.

By the end, you’ll be able to

The single capability this lesson builds: explain what scaling laws are and how they guide compute, model-size, and data decisions. Concretely, you will be able to:

Describe the power-law form of LM scaling (loss vs compute, parameters, data)
Distinguish Kaplan from Chinchilla and state the D ~ 20N rule
Compute compute-optimal (N, D) from a budget using 6ND and Chinchilla
Explain why inference cost pushes models past Chinchilla-optimal
Judge architectural changes by whether they improve the scaling exponent

Time and difficulty

Read time: about 14 minutes
Practice time: about 12 minutes (compute a Chinchilla-optimal split + inference-cost reasoning, plus flashcards)
Difficulty: deep (Stage C; arithmetic-heavy but no calculus, reasoning through compute and inference economics)