References: How models run on hardware
Source material
Section titled “Source material”Source curriculum (structural mirror, cited as further study):• Stanford CS336, "Language Modeling from Scratch", Lecture 5: GPUs, TPUs Instructors: Tatsunori Hashimoto and Percy Liang (Stanford) Course page: https://cs336.stanford.edu/ Lecture videos: YouTube playlist https://www.youtube.com/playlist?list=PLoROMvodv4rMqXOcazWaTUHhq-yembLCV License: no explicit license is published on the course site; lecture videos are on YouTube under standard terms; slides are public on GitHub without a stated license. Required attribution: "Based on the structure of Stanford CS336, 'Language Modeling from Scratch,' by Tatsunori Hashimoto and Percy Liang (cs336.stanford.edu). This is an independent structural mirror in original prose; it reproduces no course materials, and Stanford does not endorse it."This lesson mirrors the structure of Lecture 5 (GPU/TPU execution model andmemory hierarchy). Clawdemy's lessons are original prose that follows thepedagogical arc of the course. Because the source publishes no explicitlicense, we cite it as a recommended companion and reproduce none of itsmaterials. All rights to the original course materials remain with theircreators.Watch this next
Section titled “Watch this next”- Stanford CS336, Lecture 5: GPUs and TPUs by Hashimoto and Liang. The lecture this lesson mirrors. It goes further into bandwidth numbers and the precise mapping of compute units, useful when you start writing code that has to hit those tiles.
Going deeper
Section titled “Going deeper”A short, durable list. Each link is a specific next step, not a generic pile.
-
NVIDIA’s GPU performance background. A clear primer on memory-bound vs compute-bound operations, tensor-core arithmetic, and the roofline model. The vendor’s own explanation of the picture this lesson gives.
-
“In-Datacenter Performance Analysis of a Tensor Processing Unit” by Jouppi et al. (2017). The original TPU paper, with the systolic-array diagram and the engineering case behind the matmul bet. Surprisingly readable.
-
The Anatomy of a High-Performance Matrix-Multiplication by Goto and van de Geijn (2008). Predates GPUs but is the canonical explanation of why matmul is tiled the way it is, the abstract version of the memory-hierarchy story here.
Adjacent topics
Section titled “Adjacent topics”Where this connects inside the track.
-
Counting the cost (lesson 2). Arithmetic intensity from there is the lens this lesson uses physically: the gap between HBM and the compute units is what makes the number matter.
-
Writing fast kernels: Triton and XLA (lesson 6). The next lesson turns “stage data into fast memory and reuse it” into code, with custom kernels that fuse operations to raise intensity. The FlashAttention idea.
-
Parallelism (lesson 7). Once a single GPU is fully used, the next move is many of them; that lesson uses this hardware picture to explain why each parallelism scheme exists.