Skip to content

Cheatsheet: Why precision matters: quantization and mixed precision

Lower precision = less memory per number + faster compute on the GPU.
Quantization (post-training) : shrink a trained model
Mixed precision training : shrink training itself
Weights stay precise (drift accumulates).
Activations and gradients tolerate noise (data is noisy too).
ComponentWhat it encodes
Sign1 bit. Positive or negative.
ExponentA small group of bits. Magnitude (like scientific notation).
MantissaThe rest. The digits, i.e. precision/granularity.

More mantissa bits → more precision. Fewer mantissa bits → more rounding when close numbers get represented identically.

FormatBitsMemoryNotes
FP32 (single precision)32BaselineThe historical default for general numerical computing.
FP16 (half precision)160.5x FP32Half memory, less precise, smaller numerical range than FP32.
FP64 (double precision)642x FP32More precision than FP32. Rare in LLM training.
BF16 (brain float 16)160.5x FP32Same memory as FP16; bits divided differently (more exponent, fewer mantissa). FP32-like range. Preferred over FP16 for training large models.
LeverDetail
Memory per parameterEach step down in precision halves storage. 70B-parameter model: 280 GB in FP32, 140 GB in FP16, 70 GB in FP8/INT8.
Compute throughputLecturer’s anchor: FP64 at 34 teraflops, FP32 doubling, lower precisions continuing. Each step down ≈ 2x speedup.
Both stackSmaller and faster on the same hardware.
Train at high precision (typically FP32 or BF16).
Convert weights to lower precision before deploying.
Hope that the rounding error does not change behavior noticeably.
StepDifficulty
FP32 → FP16 / BF16Usually essentially free for modern LLMs
FP32 → FP8Requires more care; calibration helps
FP32 → INT8Common for inference on consumer hardware
FP32 → INT4Aggressive; sometimes degrades long-form reasoning or math

Mixed precision training (during training)

Section titled “Mixed precision training (during training)”
Master copy of weights: FP32 (precision protects against drift)
Forward + backward pass ops: FP16 (data is noisy; tolerates rounding)
Weight update arithmetic: FP32 (so the update itself is precise)

The asymmetry has a clean intuition. Weights accumulate over training; precision errors compound across millions of steps. Activations and gradients are temporary (recomputed each step) and the underlying data is itself statistically noisy, so they can tolerate the noise.

Result per the original paper: not noticeably degraded performance, substantial memory savings, faster training.

PhenomenonWhat it tells you
Model running on a laptop or phoneAlmost certainly quantized below FP16 (often INT8 or INT4). FP32 would not fit.
Locally-downloaded quantized model behaves differently from the hosted versionPrecision side of “same model” not actually being the same artifact.
A quantized model “feels dumber” on hard tasksAggressive low-precision (INT4 and below) sometimes degrades long arithmetic and careful reasoning more than other capabilities.
Press release citing FP8 / BF16 / INT4 supportThe bytes-per-number side of training and deployment economics.
PitfallReality
FP16 and BF16 are interchangeableSame memory footprint, different bit allocation. BF16 has FP32’s exponent range; FP16 does not. BF16 is more robust at scale.
Mixed precision means each layer is at a different precisionNo. Mixed precision means some operations (forward / backward arithmetic) run in low precision while master weight storage and weight updates run in high precision. Same parameter exists at multiple precisions during one step.
Quantization always worksOften it works at FP32 → FP16. Below that, results depend on the model. INT4 quantization that works for a strong frontier model may noticeably degrade a smaller one.
Quantize-then-fine-tune is the same as fine-tune-then-quantizeThey produce different artifacts. Quantize-then-fine-tune lets fine-tuning compensate for the rounding. Fine-tune-then-quantize may degrade more under the rounding.
  • Floating-point number: a number stored as bits divided into sign, exponent, and mantissa.
  • FP32: single-precision floating point, 32 bits per number. Historical default.
  • FP16: half-precision floating point, 16 bits per number.
  • FP64: double-precision floating point, 64 bits per number. Rare in LLMs.
  • BF16 (brain float 16): 16-bit format with FP32’s exponent range and fewer mantissa bits than FP16.
  • FP8: 8-bit floating-point format. Newest of the four flavors covered.
  • INT8 / INT4: integer quantization formats (no fractional bits). Used for aggressive compression at deployment time.
  • Quantization: post-training conversion from one precision representation to another (almost always higher to lower).
  • Mixed precision training: a training technique using different precisions in different parts of one training step (weights at FP32, operations at FP16, updates at FP32).

Precision is the third memory lever.
Quantization shrinks deployed models; mixed precision shrinks training.
Weights stay sharp; activations and gradients tolerate noise.