References: Why scale matters: scaling laws and Chinchilla

Source material

Source material:
• Stanford CME 295: Transformers & Large Language Models, Autumn 2025
  Instructor: Afshine Amidi & Shervine Amidi, Stanford University
  Course site: https://cme295.stanford.edu/
  Cheatsheet: https://cme295.stanford.edu/cheatsheet/
  Source lecture (Lecture 4, LLM training):
    https://www.youtube.com/watch?v=VlA_jt_3Qc4
  License (lecture videos): as published on Stanford's public YouTube channel
  License (Amidi cheatsheets): MIT
This lesson adapts the FLOPs + scaling-laws + Chinchilla section of
Stanford CME 295 Lecture 4 (~13m50s to ~20m14s, the central pedagogical
arc of the lecture). The lecture continues into parallelism + Flash
Attention (covered in Phase 3, lesson 3) and quantization (Phase 3,
lesson 4). Clawdemy provides original notes, summaries, and quizzes
derived from this material for educational purposes. All rights to the
original lectures remain with Stanford and the instructors.

Going deeper

A short list, chosen for durability.

“Scaling Laws for Neural Language Models”, Kaplan et al., 2020. The original scaling-laws paper. The smooth-power-law claim, the sample-efficiency-of-larger-models observation, and the framing of compute, parameters, and data as the three knobs. Sections 3 and 4 are the empirical core.
“Training Compute-Optimal Large Language Models”, Hoffmann et al., 2022. The Chinchilla paper. Asks the constrained-compute version of the scaling-laws question and finds the 20-to-1 tokens-per-parameter rule. Trained a smaller model (called Chinchilla) on more data to demonstrate the rule. The “GPT-3 is undertrained” finding comes from this paper.
Andrej Karpathy, “Intro to Large Language Models”. One-hour video that covers the Kaplan-vs-Chinchilla picture in lay terms, with a clear explanation of why “trillions of tokens” became the press-release shorthand it is today. Pairs well with this lesson if you want to hear the same picture in a different voice.
“The Llama 3 Herd of Models”, Grattafiori et al., 2024. A modern frontier-model paper that explicitly cites its training-token count (15 trillion). The data side of the Chinchilla rebalancing showing up in production. Section 3 on training data is the relevant part.
Stanford CME 295 cheatsheet by the Amidi twins. MIT-licensed. Single-page reference for the same material in their dense visual style.

Adjacent topics

Topics that build on or sit beside this one.

The compute-optimal-vs-inference-optimal distinction. Chinchilla optimizes training compute. In production, models are also constrained by inference cost (faster, smaller models cost less to serve). Some modern frontier models deliberately train past the Chinchilla optimum (more tokens than 20-to-1) to get a smaller, more inference-efficient model at the same training-loss budget. The Llama 3 paper above is one example.
Why scaling laws hold (theoretical work). The empirical Kaplan finding has prompted theoretical work attempting to predict the constants and the exponents from architectural and statistical principles. Search terms: “neural scaling laws theory,” “scaling-law derivations.” Mostly outside this lesson’s scope but worth knowing the field exists.
The data-quality side of pretraining. Chinchilla treats tokens as fungible. They are not: deduplicated high-quality tokens are worth more than scraped low-quality tokens. The RefinedWeb paper (cited in lesson 1’s references) is a starting point. This is one variable that complicates the otherwise clean Chinchilla rule in practice.
Phase 3 lesson 3 preview. Once you know the model size and the token target, the next question is how to actually run that training across many GPUs. Parallelism (data, model, pipeline), ZeRO, Flash Attention. That is the next lesson.
Phase 4 preview: tuning is small relative to pretraining. The Chinchilla rule and the FLOPs anchor in this lesson are about pretraining specifically. Post-pretraining stages (instruction tuning, RLHF, DPO, all of Phase 4) cost orders of magnitude less compute. The expensive part of training is the part this lesson is about.

Original sources

The primary papers, in chronological order.

“Scaling Laws for Neural Language Models”, Kaplan et al., 2020. The empirical foundation.
“Language Models are Few-Shot Learners”, Brown et al., 2020. GPT-3. The 175B-parameter, 300B-token model used as the worked undertrained example here.
“Training Compute-Optimal Large Language Models”, Hoffmann et al., 2022. The Chinchilla rebalancing.

Community discussion

None selected for this lesson. The scaling-laws + Chinchilla space at the level of this lesson is consolidated in the academic literature and well-summarized in the Karpathy video above. Durable references will be added at a future quarterly review if any consolidate.