Why precision matters: quantization and mixed precision
What you’ll learn
Section titled “What you’ll learn”A weight stored in 16 bits takes half the memory of the same weight in 32 bits, and the GPU can run roughly twice as fast on it. The natural question is whether you actually need the extra precision. The answer turns out to be a careful no, and the technique that exploits the answer closes Phase 3. This lesson covers what floating-point precision means in practice (sign, exponent, mantissa), what you save by using fewer bits, what quantization is as a post-training operation, and how mixed precision training keeps weights at high precision while running the heavy operations at lower precision in the same step. The intuition behind the asymmetry is the load-bearing idea: weights accumulate (so they need to stay precise) while activations and gradients are temporary (so they can tolerate the noise).
Where this fits
Section titled “Where this fits”This is lesson 4 of Phase 3, How models are trained at scale, and the phase closer. The previous three lessons covered pretraining as one objective at extraordinary scale, the Chinchilla rule for spending compute optimally, and the parallelism + Flash Attention engineering that distributes memory across and inside GPUs. This lesson adds the fourth memory lever: the precision of the numbers themselves. The previous lesson in the phase was Why pretraining is a memory engineering problem (parallelism and Flash Attention).
Before you start
Section titled “Before you start”Prerequisites: the parallelism and Flash Attention lesson (Phase 3 lesson 3). You should be comfortable with the memory-during-training picture (parameters + gradients + Adam optimizer states + activations) and the GPU memory hierarchy (HBM vs SRAM). No floating-point background needed; the bit structure is introduced from scratch.
By the end, you’ll be able to
Section titled “By the end, you’ll be able to”- Describe how a floating-point number is represented as sign, exponent, and mantissa bits
- Explain why lower-precision representations save both memory and compute time
- Distinguish quantization (post-training conversion) from mixed precision training (per-step precision allocation)
- Recognize the asymmetry behind mixed precision: weights stay in high precision, activations and gradients can tolerate lower precision
Time and difficulty
Section titled “Time and difficulty”- Read time: about 18 minutes (shorter than lessons 1-3 because the topic is narrower)
- Practice time: about 12 minutes (worked memory-savings calculation plus flashcards)
- Difficulty: standard