References: Scaling laws
Source material
Section titled “Source material”Source curriculum (structural mirror, cited as further study):• Stanford CS336, "Language Modeling from Scratch", Lectures 9 and 11: Scaling laws Instructors: Tatsunori Hashimoto and Percy Liang (Stanford) Course page: https://cs336.stanford.edu/ Lecture videos: YouTube playlist https://www.youtube.com/playlist?list=PLoROMvodv4rMqXOcazWaTUHhq-yembLCV License: no explicit license is published on the course site; lecture videos are on YouTube under standard terms; slides are public on GitHub without a stated license. Required attribution: "Based on the structure of Stanford CS336, 'Language Modeling from Scratch,' by Tatsunori Hashimoto and Percy Liang (cs336.stanford.edu). This is an independent structural mirror in original prose; it reproduces no course materials, and Stanford does not endorse it."This lesson collapses the two scaling-laws lectures (9 and 11) per the Phase 0mirror. Clawdemy's lessons are original prose that follows the pedagogical arcof the course. Because the source publishes no explicit license, we cite itas a recommended companion and reproduce none of its materials.Watch this next
Section titled “Watch this next”- Stanford CS336, Lectures 9 and 11: Scaling laws by Hashimoto and Liang. The two lectures this lesson collapses, with worked fits and the Chinchilla derivation in more depth.
Going deeper
Section titled “Going deeper”A short, durable list. Each link is a specific next step, not a generic pile.
-
“Scaling Laws for Neural Language Models” by Kaplan et al. (2020). The first systematic scaling-laws paper for transformers. Useful both for the methodology and as the recommendation Chinchilla corrected.
-
“Training Compute-Optimal Large Language Models” by Hoffmann et al. (2022), the Chinchilla paper. The correction that shifted the field. The figure showing parameter-count vs token-count for optimal training is worth memorizing.
-
Computer-optimal vs production-optimal models by Harm de Vries (2023). A clear blog argument for why open models often train past Chinchilla-optimal, with the inference-cost math worked out.
Adjacent topics
Section titled “Adjacent topics”Where this connects inside the track.
-
Counting the cost (lesson 2). The
6NDcompute estimate is the input to the Chinchilla budget calculation. Scaling laws make the accounting actionable. -
The Transformer architecture (lesson 3). Architectural changes (different attention, optimizer, normalization) are judged at scale by whether they improve the scaling exponent.
-
Evaluation (lesson 10). Scaling laws predict cross-entropy loss; the next lesson is the critical look at what that loss actually correlates with for downstream tasks.