GPUs and TPUs: brief

What you’ll learn

Phase 1 built the model; Phase 2 makes it run fast, and this opener takes apart the machine that runs it. The source curriculum is Stanford CS336, Lecture 5, by Tatsunori Hashimoto and Percy Liang, with lectures freely available on YouTube and the course at cs336.stanford.edu.

You will see how a GPU executes math (SIMT, streaming multiprocessors, tensor cores) and what reaches its peak FLOPs; the memory hierarchy that decides whether the math is fed (HBM, SRAM, registers); the physical reason arithmetic intensity from lesson 2 matters; how a TPU’s systolic array makes the same matmul bet differently; and how to diagnose a job that runs far below its GPU’s published peak.

Where this fits

This is lesson 5 of 14, the first lesson of Phase 2 (systems and efficiency). It bridges Phase 1’s accounting to the hardware that makes the accounting concrete. The next lesson (kernels) turns “stage data into fast memory and reuse it” into code; the parallelism lesson uses this hardware picture to explain why each scheme exists.

Before you start

Prerequisites: lesson 2 (the FLOP and memory accounting and the arithmetic-intensity idea this lesson grounds in real chips). Familiarity with the architecture from lesson 3 helps for the “why hardware shapes design” section. No installs are needed; this is a conceptual lesson on the chip.

About the math

None. The lesson explains hardware structure and the physical reason arithmetic intensity matters, without new formulas. Order-of-magnitude comparisons (HBM bandwidth vs compute throughput) are the only quantitative arguments.

By the end, you’ll be able to

The single capability this lesson builds: explain at a working level how GPUs and TPUs execute model computation, and why that shapes design choices. Concretely, you will be able to:

Describe the GPU execution model (SIMT, SMs, warps, tensor cores)
Name the GPU memory hierarchy and what each tier holds
Explain why arithmetic intensity is a physical property of memory speeds
Describe how a TPU’s systolic array differs from a GPU
Diagnose why a job runs below the GPU’s published peak FLOPs

Time and difficulty

Read time: about 13 minutes
Practice time: about 10 minutes (diagnose-the-slow-job + TPU/mixed-precision reasoning, plus flashcards)
Difficulty: deep (Stage C; conceptual hardware lesson, reads through lesson 2’s accounting)