Skip to content

References: Counting the cost

Source curriculum (structural mirror, cited as further study):
• Stanford CS336, "Language Modeling from Scratch", Lecture 2:
PyTorch (einops), resource accounting (FLOPs, memory, arithmetic intensity)
Instructors: Tatsunori Hashimoto and Percy Liang (Stanford)
Course page: https://cs336.stanford.edu/
Lecture videos: YouTube playlist
https://www.youtube.com/playlist?list=PLoROMvodv4rMqXOcazWaTUHhq-yembLCV
License: no explicit license is published on the course site; lecture
videos are on YouTube under standard terms; slides are public on GitHub
without a stated license.
Required attribution: "Based on the structure of Stanford CS336,
'Language Modeling from Scratch,' by Tatsunori Hashimoto and Percy Liang
(cs336.stanford.edu). This is an independent structural mirror in
original prose; it reproduces no course materials, and Stanford does
not endorse it."
This lesson mirrors the structure of Lecture 2 (resource accounting and
einops). Clawdemy's lessons are original prose that follows the pedagogical
arc of the course. Because the source publishes no explicit license, we cite
it as a recommended companion and reproduce none of its materials. All rights
to the original course materials remain with their creators.

A short, durable list. Each link is a specific next step, not a generic pile.

  • “Scaling Laws for Neural Language Models” by Kaplan et al. (2020). The paper behind the FLOP-and-parameters accounting (the 6ND-style estimates), and the bridge to the scaling-laws lesson later in this track.

  • The einops documentation. The reference for rearrange, reduce, and einsum-style operations. Short, example-driven, and the fastest way to make your tensor code readable.

  • The NVIDIA GPU performance background. A clear primer on compute-bound versus memory-bound operations and arithmetic intensity (the roofline idea), straight from the hardware side.

Where this connects inside the track.

  • What “from scratch” means, and the tokenizer (lesson 1). That lesson named efficiency as the through-line and left the vocabulary-size trade-off to be quantified; this lesson is the accounting that quantifies it.

  • Writing fast kernels: Triton and XLA (lesson 6). The arithmetic-intensity idea here is exactly what kernel fusion exploits; that lesson turns “raise intensity by fusing” into real code.

  • Scaling laws (lesson 9). The 6ND compute estimate is the input to scaling-law reasoning, which decides how to spend a fixed compute budget across model size and data.