Skip to content

Scaling laws, predicting what bigger gets you

Phase 3 opens with the bridge from “how do I build and run an LLM” to “how good can it get for this budget?” The source curriculum is Stanford CS336, Lectures 9 and 11, by Tatsunori Hashimoto and Percy Liang, with lectures freely available on YouTube and the course at cs336.stanford.edu. Per Phase 0, the two scaling-laws lectures are collapsed here.

You will learn the empirical power-law form of LLM loss (in compute, parameters, and data); the Kaplan-to-Chinchilla shift and the resulting D* ~= 20 * N* rule of thumb; how to compute a Chinchilla-optimal (N, D) from a compute budget using lesson 2’s 6ND; why inference cost pushes many modern open models past Chinchilla-optimal in practice; and how to judge architectural changes by whether they improve the scaling exponent, not just the prefactor at one size.

This is lesson 9 of 14, the first lesson of Phase 3 (scale, data, and alignment). It uses lesson 2’s compute accounting and lesson 3’s parameter sizing as inputs, and it sets up the rest of Phase 3 by reframing “how big and how long?” as an answerable budget calculation. The next lesson (evaluation) takes a critical look at what the loss-on-paper actually correlates with for downstream capability.

Prerequisites: lesson 2 (the 6ND training-compute estimate, used directly in the Chinchilla calculation) and lesson 3 (the parameter-count formula and the size dials this lesson allocates). Comfort with power-laws (log-log straight lines) helps; the lesson explains them, but the intuition speeds up the read.

Light but real. The calculations are arithmetic with large numbers: 6ND = C plus D / N ~= 20 gives a quadratic in N that you solve directly (N ~= sqrt(C/120)). The “power law” is described by what it predicts, not derived from physics; it is an empirical regularity that holds across runs.

The single capability this lesson builds: explain what scaling laws are and how they guide compute, model-size, and data decisions. Concretely, you will be able to:

  • Describe the power-law form of LM scaling (loss vs compute, parameters, data)
  • Distinguish Kaplan from Chinchilla and state the D ~ 20N rule
  • Compute compute-optimal (N, D) from a budget using 6ND and Chinchilla
  • Explain why inference cost pushes models past Chinchilla-optimal
  • Judge architectural changes by whether they improve the scaling exponent
  • Read time: about 14 minutes
  • Practice time: about 12 minutes (compute a Chinchilla-optimal split + inference-cost reasoning, plus flashcards)
  • Difficulty: deep (Stage C; arithmetic-heavy but no calculus, reasoning through compute and inference economics)