Weight update arithmetic: FP32 (so the update itself is precise)
The asymmetry has a clean intuition. Weights accumulate over training; precision errors compound across millions of steps. Activations and gradients are temporary (recomputed each step) and the underlying data is itself statistically noisy, so they can tolerate the noise.
Result per the original paper: not noticeably degraded performance, substantial memory savings, faster training.
Same memory footprint, different bit allocation. BF16 has FP32’s exponent range; FP16 does not. BF16 is more robust at scale.
Mixed precision means each layer is at a different precision
No. Mixed precision means some operations (forward / backward arithmetic) run in low precision while master weight storage and weight updates run in high precision. Same parameter exists at multiple precisions during one step.
Quantization always works
Often it works at FP32 → FP16. Below that, results depend on the model. INT4 quantization that works for a strong frontier model may noticeably degrade a smaller one.
Quantize-then-fine-tune is the same as fine-tune-then-quantize
They produce different artifacts. Quantize-then-fine-tune lets fine-tuning compensate for the rounding. Fine-tune-then-quantize may degrade more under the rounding.
Floating-point number: a number stored as bits divided into sign, exponent, and mantissa.
FP32: single-precision floating point, 32 bits per number. Historical default.
FP16: half-precision floating point, 16 bits per number.
FP64: double-precision floating point, 64 bits per number. Rare in LLMs.
BF16 (brain float 16): 16-bit format with FP32’s exponent range and fewer mantissa bits than FP16.
FP8: 8-bit floating-point format. Newest of the four flavors covered.
INT8 / INT4: integer quantization formats (no fractional bits). Used for aggressive compression at deployment time.
Quantization: post-training conversion from one precision representation to another (almost always higher to lower).
Mixed precision training: a training technique using different precisions in different parts of one training step (weights at FP32, operations at FP16, updates at FP32).
Precision is the third memory lever. Quantization shrinks deployed models; mixed precision shrinks training. Weights stay sharp; activations and gradients tolerate noise.