Skip to content

Lesson: Why scale matters: scaling laws and Chinchilla

The previous lesson ended on one claim that did not get justified: pretraining works because of scale. Predict the next token, repeated billions of times, on the open internet: somehow, the result is something fluent in language and broadly knowledgeable about the world. Why does that work? And, harder: how much scale is enough? Once you have decided to throw compute at next-token prediction, what is the right way to spend that compute?

The short version: scale works, and for a few years the field was scaling the wrong axis.

Pretraining cost is measured in FLOPs (floating-point operations): how many multiplications and additions over decimal numbers the training loop has to perform. Training a frontier-class LLM is on the order of 10^25 FLOPs, the lecturer’s anchor for “extraordinary.”

The reason FLOPs matters is that it captures both the model size (parameters) and the data size (tokens) into a single number. As a rough rule of thumb: training cost scales like the product of parameters and tokens. More of either makes training more expensive. Once you have a fixed FLOPs budget, you have to decide how to split it between “bigger model” and “more data,” and that turns out to be the central question of this lesson.

In 2020, a team at OpenAI published a paper titled “Scaling Laws for Neural Language Models” (Kaplan et al.). The paper ran a large number of experiments varying three knobs: model size (parameters), training-set size (tokens), and compute (FLOPs). The headline result was empirical and simple to state.

  • More compute reliably reduces loss on next-token prediction. Throw more FLOPs at the same model and data shape, and the model gets predictably better at predicting the next token. The improvement is not a step function or a plateau; it follows a smooth curve.
  • More parameters reliably reduce loss. Make the model bigger, holding everything else, and loss goes down predictably.
  • More training tokens reliably reduce loss. Train on more data, holding everything else, and loss goes down predictably.

The headline phrase is “predictably.” Before this paper, scaling neural networks was a hope-it-works kind of activity. The Kaplan paper turned it into something closer to engineering: if you double the parameters, you can predict roughly how much loss will fall. If you double the data, same. The relationships are smooth power laws, not flat lines or cliff edges.

Two consequences flowed from this for the field.

The first: bigger is reliably better. Between roughly 2019 and 2024, the public face of LLM progress was a series of model releases each substantially larger than the last. Companies built things bigger and bigger because the scaling-laws empirical evidence said performance would keep improving. The Stanford lecturer flags this as the period’s defining trend.

The second: bigger is more “sample-efficient” per token. The Kaplan paper observed that for an equal amount of training data, a larger model gets more out of it. The lecturer’s framing: “for an equal amount of tokens that is processed, you will have a better performance with a bigger model compared to a smaller one.” If you have a fixed amount of data to train on, scaling up the model wins.

These two consequences, taken together, point in one direction: build bigger.

Here is where the picture gets more interesting. The two consequences above are answers to a particular question: given unlimited data, what is the best thing to do with my compute? The answer is: more parameters, every time.

But unlimited data is not the situation anyone actually has. Pretraining-quality text is finite. Compute budgets are finite. Money is finite. Time is finite. The real question is the constrained one: given a fixed compute budget, how should I split it between model size and training data to get the best model possible?

The Kaplan paper did not answer that question directly. Asking it produced a different paper, two years later, with a different headline.

In 2022, a team at DeepMind published a paper that asked the constrained version of the question. They fixed compute budgets and tried many model-and-data combinations within each budget, then plotted the resulting loss to find the minimum. The minimum, across a wide range of compute budgets, fell at a remarkably consistent ratio.

The ratio: roughly 20 tokens per parameter.

If your model has 10 billion parameters, the compute-optimal training set size is around 200 billion tokens. If your model has 70 billion parameters, the compute-optimal set is around 1.4 trillion tokens. The pattern holds across orders of magnitude. The Stanford lecturer summarizes this as: “if you have an amount of training set size that’s about 20 times the model size, then you’re spending your compute in an optimal way.”

The DeepMind team trained a model called Chinchilla with these proportions and used it as the demonstration. The 20:1 token-to-parameter ratio became a working rule of thumb the field still references by that name.

The two papers do not contradict each other

Section titled “The two papers do not contradict each other”

This is where careful readers stall. Kaplan said “bigger is better, including per token.” Chinchilla said “more tokens with a smaller model is optimal.” These look like opposite claims. They are not.

They are answers to different optimization questions.

  • Kaplan’s question: with unlimited data, what is the most sample-efficient model size? Answer: bigger, all the way up.
  • Chinchilla’s question: with a fixed compute budget, what is the optimal balance of model size and training data? Answer: roughly 20 tokens per parameter.

The two questions have the same scaling-laws empirical foundation. They lead to different practical recommendations because they are optimizing under different constraints. The 2019-2024 “bigger is reliably better” period happened because people were implicitly optimizing under the Kaplan question, often with the unstated assumption that data was effectively unlimited. The Chinchilla paper rebalanced the picture by making the data constraint explicit.

A useful way to hold both claims in your head: scaling laws are the empirical foundation; Kaplan and Chinchilla are different optimization recipes on top of that foundation.

A worked check: was GPT-3 compute-optimal?

Section titled “A worked check: was GPT-3 compute-optimal?”

The Stanford lecturer uses GPT-3 as the worked example. GPT-3 had roughly 175 billion parameters and was trained on roughly 300 billion tokens. Apply the Chinchilla rule:

parameters: 175,000,000,000
tokens: 300,000,000,000
actual ratio: 300B / 175B ≈ 1.7 tokens per parameter
target ratio: 20 tokens per parameter

GPT-3 was trained on roughly 1.7 tokens per parameter, where the Chinchilla rule says it should have been trained on roughly 20. The Chinchilla-aligned target for a 175B-parameter model is around 175B × 20 = 3.5 trillion tokens, more than ten times what GPT-3 actually received. By that standard, GPT-3 was, in the lecturer’s words, “really undertrained.” The same compute budget could in principle have produced a better model with a smaller parameter count and a much larger training set, or a similarly-sized model with substantially more data.

Applied to the broader pre-Chinchilla era, most large pretraining runs of that period were data-undertrained relative to their parameter count under the new rule of thumb. The “bigger is reliably better” period was leaving compute on the table.

Kaplan and Chinchilla are about training compute. By 2024, a third axis became impossible to ignore: inference-time scaling, which asks how a model’s output quality changes when you spend more compute per query at inference time rather than during training.

The clearest reference is Snell, Lee, Xu, and Kumar (Google DeepMind, 2024), arxiv 2408.03314. Their headline: for many tasks, you can match the quality of a much larger model by spending more inference compute on a smaller one. Best-of-N sampling (generate N candidates, pick the best via a scorer), majority voting, sequential refinement, and the long internal reasoning chains that define modern reasoning models (covered in Phase 6) are all instances of trading inference compute for output quality.

This shifts the framing of the scaling story. Training-side scaling (Kaplan, Chinchilla) tells you what model to build given a compute budget at training time. Inference-time scaling tells you that you can sometimes choose to deploy a smaller well-trained model and spend the saved budget on per-query inference compute. The two axes are complementary, not competing. When you read about a model that “thinks longer before answering,” you are reading about inference-time scaling. When a 2026 model card cites both training tokens and per-query reasoning depth, you now know why both numbers matter.

The scaling-laws picture is invisible at runtime, but it shapes what you experience.

  • “X has Y billion parameters” is not a complete description of capability. If two models have the same parameter count but were trained on very different amounts of data, they will behave differently. A 70B-parameter model trained on 1.4T tokens (Chinchilla-aligned) is not the same as a 70B-parameter model trained on 300B tokens. The shorthand of citing parameter count alone hides the data-side variable.
  • “Bigger model” stopped being the obvious next move after Chinchilla. Frontier labs did not stop scaling, but the kind of scaling rebalanced toward more data per parameter. When you read a model release citing many trillions of training tokens (a much larger figure than the pre-Chinchilla norm), that is the data side of the rebalancing showing up; the parameter count is no longer the whole story.
  • A model that feels “well-trained” probably had the data to support it. Some assistants give the impression of being broad and reliable while others feel narrower and more brittle even at the same parameter scale. The data side of training is one of the variables behind that perception, alongside the post-training tuning we will cover in Phase 4.

A few mistakes Daniel-shaped readers tend to make on this material. Naming them up front is faster than catching them later.

“Scaling laws guarantee a particular loss number for a particular size.” They predict the shape of the curve (smooth, predictable, power-law-ish), not the exact value. In practice the constant factors depend on the architecture, the data, the tokenizer, and the optimization setup. The qualitative claim (“more compute, predictably less loss”) is robust; the precise quantitative claim is not portable across setups.

“Chinchilla means smaller models are always better.” No. Chinchilla means that for a given compute budget, splitting the budget into a smaller model and more data is more efficient than the reverse. If you have more compute, the optimal model is still larger; the data target just grows in proportion. Chinchilla does not say “stop building large models.” It says “if you build a large model, train it on enough tokens.”

“FLOPs and FLOP/s are the same thing.” The first is an amount of compute used (a count). The second is a rate of compute available (operations per second); you will see the rate version in GPU spec sheets and hardware comparisons. They use the same letters in roughly the same order, which is unfortunate. When you read a paper or a model card, identify which one is meant from context.

“Loss falling means the model is getting better at the task I care about.” Pretraining loss is next-token-prediction loss on text from the corpus distribution. Falling pretraining loss correlates with general capability, but task-specific quality (how well the model summarizes a legal document, for instance) is a downstream measurement the scaling-laws curve does not directly predict. Phase 4’s tuning + Phase 7’s evaluation methodology cover the gap.

  • Scaling laws are an empirical finding, not a theory. Kaplan et al. 2020 ran experiments and found that loss falls smoothly and predictably with more compute, more parameters, and more training data. The smoothness is the surprise; before this paper, scaling neural networks was less predictable.
  • Two practical recommendations flowed from scaling laws. Bigger is reliably better (the 2019-2024 framing) and bigger is more sample-efficient per token (the Kaplan-era observation). Both assumed effectively unlimited data.
  • Chinchilla rebalanced the picture once data was treated as finite. With a fixed compute budget, the compute-optimal balance is roughly 20 training tokens per parameter. A 175B-parameter model deserves about 3.5T tokens; GPT-3 had only 300B, making it undertrained by this standard.
  • The two papers are reconcilable, not contradictory. Kaplan answers “with unlimited data, what’s the most sample-efficient model size?” Chinchilla answers “with a fixed compute budget, what’s the optimal split?” Same empirical foundation, different optimization questions.
  • “Y billion parameters” is half a description. Tokens-per-parameter is the other half. Modern model releases citing 15-trillion-token training sets are the visible side of the Chinchilla rebalancing in production.

Once you have settled the question of how much compute and data, the next question is how do you actually run that training loop on real hardware. Trillions of tokens, billions of parameters, weeks to months of training: there is no single GPU that fits the model, and there is no naive setup that runs efficiently across many GPUs. The next lesson covers the engineering tricks (parallelism, ZeRO, Flash Attention) that make compute-optimal pretraining tractable in practice.

Scale produces predictable improvement.
Chinchilla pinned the optimal split at 20 tokens per parameter.
Pre-Chinchilla pretraining left compute on the table.