References: Why pretraining is a memory engineering problem (parallelism and Flash Attention)
Source material
Section titled “Source material”Source material:• Stanford CME 295: Transformers & Large Language Models, Autumn 2025 Instructor: Afshine Amidi & Shervine Amidi, Stanford University Course site: https://cme295.stanford.edu/ Cheatsheet: https://cme295.stanford.edu/cheatsheet/ Source lecture (Lecture 4, LLM training): https://www.youtube.com/watch?v=VlA_jt_3Qc4 License (lecture videos): as published on Stanford's public YouTube channel License (Amidi cheatsheets): MITThis lesson adapts the parallelism + ZeRO + Flash Attention section ofStanford CME 295 Lecture 4 (~26m34s to ~50m02s, the central engineeringarc of the lecture). The lecture continues into quantization (covered inPhase 3, lesson 4). Clawdemy provides original notes, summaries, andquizzes derived from this material for educational purposes. All rightsto the original lectures remain with Stanford and the instructors.Going deeper
Section titled “Going deeper”A short list, chosen for durability.
-
“FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness”, Dao, Fu, Ermon, Rudra, Ré, 2022. The original Flash Attention paper. Section 3 covers the tiling and the softmax-block-by-block math; section 4 has the speedup benchmarks. The “IO-awareness” framing in the title is exactly the SRAM-vs-HBM data-movement story this lesson covers.
-
“ZeRO: Memory Optimizations Toward Training Trillion Parameter Models”, Rajbhandari et al., 2020. The ZeRO paper. The three-level ZeRO-1/2/3 stack and the memory-vs-communication trade-offs across them. The companion to the Stanford lecturer’s discussion.
-
“Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism”, Shoeybi et al., 2019. The canonical tensor-parallelism paper for transformer training. Section 3 walks through the matrix-multiplication splitting in practice. Megatron-LM has become the de facto training framework for many frontier-scale runs.
-
The PyTorch FSDP documentation. PyTorch’s implementation of ZeRO-3 (called Fully Sharded Data Parallel, FSDP). Useful as the practical-engineering side of what the Stanford lecture covers conceptually. If you ever read or write training code, this is the API you will see.
-
Stanford CME 295 cheatsheet by the Amidi twins. MIT-licensed. Single-page reference for the same material in their dense visual style. The Amidi cheatsheet treats parallelism briefly; this lesson goes deeper.
Adjacent topics
Section titled “Adjacent topics”Topics that build on or sit beside this one.
-
Mixed precision training and quantization. This lesson assumed each parameter takes 2 bytes (FP16) or 4 bytes (FP32). The next lesson (Phase 3, lesson 4) covers what happens when you store weights in even lower precision (FP8, INT8, INT4). The two axes (parallelism distributes bytes across GPUs; quantization makes each byte smaller) compose: a 70-billion-parameter model in INT8 with ZeRO-3 across 64 GPUs is the kind of stack a serious open-weights run looks like.
-
Why frontier training takes weeks to months. This lesson’s “weeks to months on large GPU clusters” framing is mostly a function of compute-throughput math (FLOPs per token times trillions of tokens). The communication overhead of ZeRO-3 + tensor parallelism + pipeline parallelism is a non-trivial fraction of that wall time. Frontier-class engineering teams spend significant effort tuning the parallelism topology to keep the GPUs as busy as possible.
-
Multi-query attention and grouped-query attention. Phase 2’s attention efficiency tricks lesson covers MQA / GQA, which reduce the KV-cache memory at inference time. Flash Attention reduces attention’s training-time memory cost; MQA/GQA reduces its inference-time memory cost. They address different stages but stack cleanly.
-
The KV cache. During inference (not training), the keys and values for previous tokens are cached so each new token does not require recomputing the whole context. The KV cache itself can become a major memory consumer for long-context inference. Phase 6 lessons on inference and serving will revisit this directly.
-
Phase 4 preview: tuning is much smaller. All the techniques in this lesson are about pretraining specifically. Post-pretraining stages (instruction tuning, RLHF, DPO, all of Phase 4) cost orders of magnitude less compute and typically run on much smaller hardware setups. Plain data parallelism is often enough for a tuning run.
Original sources
Section titled “Original sources”The primary papers, in chronological order.
- “Megatron-LM”, Shoeybi et al., 2019. Tensor parallelism for transformer training.
- “ZeRO”, Rajbhandari et al., 2020. Redundancy elimination across data-parallel GPUs.
- “GPipe”, Huang et al., 2018. The earlier reference for pipeline parallelism (predates the LLM era but the techniques carry forward).
- “FlashAttention”, Dao et al., 2022. The IO-aware exact attention.
- “FlashAttention-2”, Dao, 2023. Refinements that further reduce data movement and parallelize across thread blocks. Worth reading after the original.
Community discussion
Section titled “Community discussion”None selected for this lesson. The parallelism + ZeRO + Flash Attention space is consolidated in the academic literature and in the production code of the major training frameworks (Megatron-LM, DeepSpeed, FSDP). Durable references will be added at a future quarterly review if any consolidate.