Skip to content

References: Why precision matters: quantization and mixed precision

Source material:
• Stanford CME 295: Transformers & Large Language Models, Autumn 2025
Instructor: Afshine Amidi & Shervine Amidi, Stanford University
Course site: https://cme295.stanford.edu/
Cheatsheet: https://cme295.stanford.edu/cheatsheet/
Source lecture (Lecture 4, LLM training):
https://www.youtube.com/watch?v=VlA_jt_3Qc4
License (lecture videos): as published on Stanford's public YouTube channel
License (Amidi cheatsheets): MIT
This lesson adapts the quantization + mixed-precision section of
Stanford CME 295 Lecture 4 (~52m25s to ~58m14s, the closing engineering
arc of the lecture before the Q&A). With this lesson, our adaptation of
Lecture 4 is complete and Phase 3 closes. Clawdemy provides original
notes, summaries, and quizzes derived from this material for educational
purposes. All rights to the original lectures remain with Stanford and
the instructors.

A short list, chosen for durability.

  • “Mixed Precision Training”, Micikevicius et al., 2017. The original mixed-precision paper. The FP32-master-weights, FP16-forward/backward, FP32-weight-update pattern this lesson covers comes from here. Section 3 walks through the loss-scaling trick that makes the technique stable in practice.

  • “LLM.int8()”, Dettmers et al., 2022. The first widely-deployed INT8 quantization scheme for transformer language models that did not noticeably degrade capability. Section 3 has the outlier-feature-handling trick. Pairs with the body’s “INT8 is a popular choice for inference on consumer hardware” claim.

  • “GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers”, Frantar et al., 2022. The GPTQ scheme used widely in open-source quantized model releases. Drives much of the INT4 quantization Daniel sees on Hugging Face today.

  • “FP8 Formats for Deep Learning”, Micikevicius et al., 2022. Two FP8 variants (E4M3 and E5M2) used in modern frontier training. The newest of the precision rungs the Stanford lecturer pointed at on the GPU spec sheet.

  • NVIDIA’s mixed-precision training guide. The practical-engineering side of this lesson. APIs, loss scaling, BF16 vs FP16 selection, common bugs. Useful if you ever read or write training code.

  • Stanford CME 295 cheatsheet by the Amidi twins. MIT-licensed. Single-page reference for the same material in their dense visual style. The cheatsheet’s “optimizations” section names quantization alongside distillation.

  • Quantization formats in the open-weights ecosystem. GGUF, AWQ, GPTQ, EXL2 are the names you will see attached to downloadable quantized models. Each is a different file format and a different quantization algorithm; they are not interchangeable. Practical search terms: “quantized LLM formats comparison.”

  • Why frontier training uses BF16 over FP16. The BF16 vs FP16 comparison this lesson hinted at is treated more deeply in NVIDIA’s training documentation. The short version: at the gradient scales that frontier training reaches, FP16 underflows or overflows on too many values, while BF16’s wider range stays stable. BF16 became the default for training large models because of this.

  • Calibration in INT8/INT4 quantization. Aggressive quantization typically requires running representative data through the model before quantizing, so the algorithm can pick the integer scaling factors that preserve the most information. The LLM.int8() and GPTQ papers above are good entry points.

  • Mixed precision plus ZeRO plus Flash Attention. Frontier training uses all of Phase 3’s techniques together. The interaction between mixed precision and ZeRO is non-trivial: ZeRO partitions weights, gradients, and optimizer states across GPUs, and the mixed-precision math (which has FP32 master weights plus FP16 working copies) needs to be threaded through the partitioning carefully. The PyTorch FSDP documentation covers this.

  • Phase 4 preview. Tuning a base model into a usable assistant happens at much smaller scale than pretraining. The precision tricks in this lesson still apply but the dollar stakes drop dramatically. Phase 4 starts with instruction tuning, the simplest of the post-training stages.

The primary papers, in chronological order.

None selected for this lesson. The precision-and-quantization space at the level of this lesson is consolidated in the academic literature and in the documentation of the major training frameworks. Durable references will be added at a future quarterly review if any consolidate.