References: Scaling laws

Source material

Source curriculum (structural mirror, cited as further study):
• Stanford CS336, "Language Modeling from Scratch", Lectures 9 and 11:
    Scaling laws
  Instructors: Tatsunori Hashimoto and Percy Liang (Stanford)
  Course page: https://cs336.stanford.edu/
  Lecture videos: YouTube playlist
    https://www.youtube.com/playlist?list=PLoROMvodv4rMqXOcazWaTUHhq-yembLCV
  License: no explicit license is published on the course site; lecture
    videos are on YouTube under standard terms; slides are public on GitHub
    without a stated license.
  Required attribution: "Based on the structure of Stanford CS336,
    'Language Modeling from Scratch,' by Tatsunori Hashimoto and Percy Liang
    (cs336.stanford.edu). This is an independent structural mirror in
    original prose; it reproduces no course materials, and Stanford does
    not endorse it."
This lesson collapses the two scaling-laws lectures (9 and 11) per the Phase 0
mirror. Clawdemy's lessons are original prose that follows the pedagogical arc
of the course. Because the source publishes no explicit license, we cite it
as a recommended companion and reproduce none of its materials.

Watch this next

Stanford CS336, Lectures 9 and 11: Scaling laws by Hashimoto and Liang. The two lectures this lesson collapses, with worked fits and the Chinchilla derivation in more depth.

Going deeper

A short, durable list. Each link is a specific next step, not a generic pile.

“Scaling Laws for Neural Language Models” by Kaplan et al. (2020). The first systematic scaling-laws paper for transformers. Useful both for the methodology and as the recommendation Chinchilla corrected.
“Training Compute-Optimal Large Language Models” by Hoffmann et al. (2022), the Chinchilla paper. The correction that shifted the field. The figure showing parameter-count vs token-count for optimal training is worth memorizing.
Computer-optimal vs production-optimal models by Harm de Vries (2023). A clear blog argument for why open models often train past Chinchilla-optimal, with the inference-cost math worked out.

Adjacent topics

Where this connects inside the track.

Counting the cost (lesson 2). The 6ND compute estimate is the input to the Chinchilla budget calculation. Scaling laws make the accounting actionable.
The Transformer architecture (lesson 3). Architectural changes (different attention, optimizer, normalization) are judged at scale by whether they improve the scaling exponent.
Evaluation (lesson 10). Scaling laws predict cross-entropy loss; the next lesson is the critical look at what that loss actually correlates with for downstream tasks.