How models run on hardware, GPUs and TPUs
What you’ll learn
Section titled “What you’ll learn”Phase 1 built the model; Phase 2 makes it run fast, and this opener takes apart the machine that runs it. The source curriculum is Stanford CS336, Lecture 5, by Tatsunori Hashimoto and Percy Liang, with lectures freely available on YouTube and the course at cs336.stanford.edu.
You will see how a GPU executes math (SIMT, streaming multiprocessors, tensor cores) and what reaches its peak FLOPs; the memory hierarchy that decides whether the math is fed (HBM, SRAM, registers); the physical reason arithmetic intensity from lesson 2 matters; how a TPU’s systolic array makes the same matmul bet differently; and how to diagnose a job that runs far below its GPU’s published peak.
Where this fits
Section titled “Where this fits”This is lesson 5 of 14, the first lesson of Phase 2 (systems and efficiency). It bridges Phase 1’s accounting to the hardware that makes the accounting concrete. The next lesson (kernels) turns “stage data into fast memory and reuse it” into code; the parallelism lesson uses this hardware picture to explain why each scheme exists.
Before you start
Section titled “Before you start”Prerequisites: lesson 2 (the FLOP and memory accounting and the arithmetic-intensity idea this lesson grounds in real chips). Familiarity with the architecture from lesson 3 helps for the “why hardware shapes design” section. No installs are needed; this is a conceptual lesson on the chip.
About the math
Section titled “About the math”None. The lesson explains hardware structure and the physical reason arithmetic intensity matters, without new formulas. Order-of-magnitude comparisons (HBM bandwidth vs compute throughput) are the only quantitative arguments.
By the end, you’ll be able to
Section titled “By the end, you’ll be able to”The single capability this lesson builds: explain at a working level how GPUs and TPUs execute model computation, and why that shapes design choices. Concretely, you will be able to:
- Describe the GPU execution model (SIMT, SMs, warps, tensor cores)
- Name the GPU memory hierarchy and what each tier holds
- Explain why arithmetic intensity is a physical property of memory speeds
- Describe how a TPU’s systolic array differs from a GPU
- Diagnose why a job runs below the GPU’s published peak FLOPs
Time and difficulty
Section titled “Time and difficulty”- Read time: about 13 minutes
- Practice time: about 10 minutes (diagnose-the-slow-job + TPU/mixed-precision reasoning, plus flashcards)
- Difficulty: deep (Stage C; conceptual hardware lesson, reads through lesson 2’s accounting)