Skip to content

Lesson: Scaling laws, predicting what bigger gets you

Lesson 2 gave you the cost (6ND for training compute) and lesson 3 gave you the size dials (N is set by d-model and n-layers). The natural question is the one every LLM team eventually asks: for a fixed compute budget, how should I spend it, on a bigger model or on more data? Phase 3 opens with the answer, scaling laws, which turn that question from folklore into arithmetic. Per the Phase 0 mirror, this lesson collapses the two scaling-laws lectures of the course into one.

The starting fact is striking and not at all obvious: when you train a transformer on language and plot the final cross-entropy loss against compute (or against parameters, or against data), you get a remarkably straight line on a log-log plot. The loss falls as a power law in each of these:

loss(compute) ~ C^(-alpha_C)
loss(parameters) ~ N^(-alpha_N)
loss(data) ~ D^(-alpha_D)

where each alpha is a small positive exponent. Doubling compute (or parameters, or data) reduces the loss by a predictable, modest factor. This is not a theorem; it is what hundreds of training runs across orders of magnitude actually show. The smoothness is what lets you do the rest of the lesson.

Kaplan: the first scaling laws, and their answer

Section titled “Kaplan: the first scaling laws, and their answer”

The empirical study that introduced this framing was Kaplan et al. (2020). They trained many transformers across a wide range of sizes and tokens and fit the curves. The headline finding: for a fixed compute budget, you should spend it disproportionately on parameters rather than on tokens. Their guidance roughly was: train a big model on a moderate amount of data.

This recommendation shaped a generation of models. GPT-3 (175 billion parameters, 300 billion tokens) is the canonical Kaplan-era choice: very large, comparatively under-trained on data.

In 2022, Hoffmann et al. (“Training Compute-Optimal Large Language Models,” the Chinchilla paper) redid the analysis at larger scale with cleaner methodology and got a different answer. They held a compute budget fixed (using the 6ND rule) and asked: which (N, D) pair actually minimizes loss?

Their finding upended the field: model size and training data should scale roughly equally with compute. Optimal model size and optimal data both scale as roughly the square root of compute. Concretely, the Chinchilla-optimal ratio is about:

D* ~= 20 * N* (about 20 training tokens per parameter)

To validate, they trained Chinchilla at 70B parameters on 1.4T tokens, using the same compute budget as the much larger Gopher (280B). Chinchilla beat Gopher, and also outperformed GPT-3 and other large models across a wide range of benchmarks, despite being far smaller in parameters and trained on far more data. The lesson was sharp: the previous generation of models was over-parameterized and under-trained. A smaller model trained on much more data was a better use of the compute.

The corrected guidance: given a compute budget C, pick N and D so that 6ND equals C and D divided by N is about 20. That single rule, paired with the lesson-3 architecture and lesson-2 accounting, gives you a defensible starting point for a training run.

Post-Chinchilla, the field rapidly moved toward smaller, more-tokens models. But a second consideration crept in that the original analysis did not address: inference cost. A model is trained once but served forever; serving cost scales with parameters (per-token compute and KV-cache memory). For a model that will run a huge number of inference tokens, it can be economically optimal to over-train (training past Chinchilla-optimal on more data, producing a smaller model than compute-optimal at the same final loss), because every saved inference is worth a lot of upfront training compute.

This is why many modern open models are trained well beyond the 20-tokens-per-parameter line: a 7-to-8-billion-parameter model trained on a few trillion tokens is not Chinchilla-optimal for training, but is excellent for inference economics. Chinchilla minimizes training loss for a training budget; reality minimizes total cost including serving. Both are scaling-law arguments; they just optimize different things.

Beyond the recipe, scaling laws give you a tool that is unusual in machine learning: prediction before you spend. You fit the curves on a handful of small and medium runs, then extrapolate. With that you can:

  • Pick a compute-optimal N and D for a target compute budget, as above.
  • Predict the loss of a target-scale run before committing to it, and decide whether the predicted loss is worth the cost.
  • Sanity-check that a training run is on track: if a 70B run’s loss curve diverges from what the law predicts, something is wrong (data, optimizer, code) and you should look before burning the rest of the budget.
  • Reason about architectural changes (a new attention variant, a different optimizer): if the law’s exponent improves, the change is genuinely beneficial at scale; if only the prefactor moves, the gain may not survive scaling.

That predictive power is why scaling laws went from a curiosity to one of the most-cited results in the field.

The honest picture, as the course also gives it: scaling laws are empirical regularities, not physical laws. They have known and important limits.

  • They predict cross-entropy loss, not downstream task performance. A model with lower loss is usually better on benchmarks, but the relationship is noisy, and some downstream capabilities appear in jumps rather than smoothly with loss.
  • They assume clean, representative data. A worse data distribution lowers the curve’s prefactor; mixing toxic or low-quality data can break the assumed regularity. The data lessons that come next are where this matters.
  • They have been re-fit and refined many times (Chinchilla corrected Kaplan; later work has refined Chinchilla in turn). Treat published exponents as approximate; refit on your own data if you are running at scale.
  • The implicit assumption that “loss continues to decrease smoothly forever” is itself a strong claim. At very large scale, new bottlenecks appear (data exhaustion, optimizer regime changes, evaluation saturation), and any extrapolation should be reviewed when you reach a regime far from where the law was fit.

None of these limits make scaling laws less useful; they make them more honest to apply.

Scaling laws are the bridge between the systems half of this track (Phases 1 and 2: build, count, run fast) and the next set of decisions (how good can this get, with this much compute, this much data?). They give you a defensible answer to the most important budget question (how do I split a fixed budget between bigger and longer?), and the Chinchilla rule plus the inference-cost adjustment is the framework most modern open-model teams reason with. They also reframe the architecture work: a new tweak is only meaningfully better if it improves the scaling-law exponent, not just the loss at one size; that is why “we beat baseline at our scale” is a weaker claim than “we improved the exponent.” The next lesson, evaluation, takes the same critical eye to the metrics scaling laws are predicting against.

  • Loss follows a power law in compute, parameters, and data: on log-log plots, the curves are remarkably straight. Doubling each reduces loss by a predictable, modest factor.
  • Kaplan (2020) suggested spending compute disproportionately on parameters; Chinchilla (2022) corrected this: optimal model size and data should scale roughly equally with compute. Rule of thumb: about 20 training tokens per parameter.
  • The Chinchilla shift moved the field toward smaller, more-tokens models; inference cost then pushes many modern open models past Chinchilla-optimal (over-trained for cheaper serving). Both are scaling-law arguments optimizing different things.
  • Scaling laws give predictive power: fit curves at small/medium scales, extrapolate to the target. Use them to choose N and D, predict loss before committing, and sanity-check runs in flight.
  • Architectural changes should improve the exponent, not just the prefactor at one scale. A win that does not survive scaling is a weaker result.
  • Limits to apply honestly: loss is not exactly downstream capability, data quality moves the curve, exponents have been re-fit and may again, and any extrapolation should be reviewed at very large scale.

Scaling laws turn “how big should I make it?” from folklore into arithmetic: setting 6ND equal to your compute budget C with about 20 tokens per parameter (D divided by N about 20) is the Chinchilla starting point, adjusted upward in tokens when inference cost is a serious factor. The Phase 3 work that follows, evaluation, data, and post-training, refines what counts as “good” in those equations.