References: Counting the cost
Source material
Section titled “Source material”Source curriculum (structural mirror, cited as further study):• Stanford CS336, "Language Modeling from Scratch", Lecture 2: PyTorch (einops), resource accounting (FLOPs, memory, arithmetic intensity) Instructors: Tatsunori Hashimoto and Percy Liang (Stanford) Course page: https://cs336.stanford.edu/ Lecture videos: YouTube playlist https://www.youtube.com/playlist?list=PLoROMvodv4rMqXOcazWaTUHhq-yembLCV License: no explicit license is published on the course site; lecture videos are on YouTube under standard terms; slides are public on GitHub without a stated license. Required attribution: "Based on the structure of Stanford CS336, 'Language Modeling from Scratch,' by Tatsunori Hashimoto and Percy Liang (cs336.stanford.edu). This is an independent structural mirror in original prose; it reproduces no course materials, and Stanford does not endorse it."This lesson mirrors the structure of Lecture 2 (resource accounting andeinops). Clawdemy's lessons are original prose that follows the pedagogicalarc of the course. Because the source publishes no explicit license, we citeit as a recommended companion and reproduce none of its materials. All rightsto the original course materials remain with their creators.Watch this next
Section titled “Watch this next”- Stanford CS336, Lecture 2: resource accounting by Hashimoto and Liang. The lecture this lesson mirrors. It works through FLOP and memory accounting in detail and uses einops throughout, the moving version of the counting here.
Going deeper
Section titled “Going deeper”A short, durable list. Each link is a specific next step, not a generic pile.
-
“Scaling Laws for Neural Language Models” by Kaplan et al. (2020). The paper behind the FLOP-and-parameters accounting (the
6ND-style estimates), and the bridge to the scaling-laws lesson later in this track. -
The einops documentation. The reference for
rearrange,reduce, andeinsum-style operations. Short, example-driven, and the fastest way to make your tensor code readable. -
The NVIDIA GPU performance background. A clear primer on compute-bound versus memory-bound operations and arithmetic intensity (the roofline idea), straight from the hardware side.
Adjacent topics
Section titled “Adjacent topics”Where this connects inside the track.
-
What “from scratch” means, and the tokenizer (lesson 1). That lesson named efficiency as the through-line and left the vocabulary-size trade-off to be quantified; this lesson is the accounting that quantifies it.
-
Writing fast kernels: Triton and XLA (lesson 6). The arithmetic-intensity idea here is exactly what kernel fusion exploits; that lesson turns “raise intensity by fusing” into real code.
-
Scaling laws (lesson 9). The
6NDcompute estimate is the input to scaling-law reasoning, which decides how to spend a fixed compute budget across model size and data.