Counting the cost, FLOPs, memory, and arithmetic intensity
What you’ll learn
Section titled “What you’ll learn”Lesson 1 named efficiency as the through-line of this track; this lesson is the accounting that makes it real. You will learn to estimate a model’s cost on paper before spending any compute. The source curriculum is Stanford CS336, Lecture 2, by Tatsunori Hashimoto and Percy Liang, with lectures freely available on YouTube and the course at cs336.stanford.edu.
You will learn to count matmul FLOPs and apply the 6ND rule to scope a training run; estimate training memory (parameters, gradients, optimizer states, and activations, the 16N rule of thumb); understand arithmetic intensity and the compute-bound versus memory-bound distinction; see why a GPU often runs far below its advertised peak FLOPs; and read the tensor-reshaping code that dominates model implementations, written legibly with einops.
Where this fits
Section titled “Where this fits”This is lesson 2 of 14, the second lesson of Phase 1 (the model). It deliberately comes before the architecture lesson, because every later design decision (model size, training length, kernels, parallelism, scaling) is argued in terms of FLOPs and memory. It quantifies the vocabulary-size trade-off lesson 1 left open, and its arithmetic-intensity idea sets up the kernels and parallelism lessons in Phase 2.
Before you start
Section titled “Before you start”Prerequisites: lesson 1 of this track (the from-scratch overview, where efficiency was named the through-line). You should be comfortable with basic arithmetic with large numbers (scientific notation) and with reading PyTorch tensor operations. No new installs are needed to follow along; the einops examples are short.
About the math
Section titled “About the math”Real but light: arithmetic, not calculus. You will multiply and divide large numbers (FLOPs, bytes, seconds) to estimate cost, and reason about ratios (arithmetic intensity). Every formula is a counting argument (2mnk for a matmul, 6ND for training, 16N for memory), explained from where it comes, not asserted.
By the end, you’ll be able to
Section titled “By the end, you’ll be able to”The single capability this lesson builds: account for a model’s compute and memory cost (FLOPs, memory, arithmetic intensity), and read PyTorch tensor operations with einops. Concretely, you will be able to:
- Estimate matmul FLOPs and apply the 6ND rule to scope a training run
- Estimate training memory (parameters, gradients, optimizer states; the 16N estimate)
- Explain arithmetic intensity and the compute-bound vs memory-bound distinction
- Explain why a GPU often runs below its peak FLOPs
- Read tensor operations written with einops
Time and difficulty
Section titled “Time and difficulty”- Read time: about 14 minutes
- Practice time: about 12 minutes (scope a training run’s FLOPs and memory by hand, plus flashcards)
- Difficulty: deep (Stage C; arithmetic-heavy but no calculus, the reasoning is counting)