References: Why pretraining is a memory engineering problem (parallelism and Flash Attention)

Source material

Source material:
• Stanford CME 295: Transformers & Large Language Models, Autumn 2025
  Instructor: Afshine Amidi & Shervine Amidi, Stanford University
  Course site: https://cme295.stanford.edu/
  Cheatsheet: https://cme295.stanford.edu/cheatsheet/
  Source lecture (Lecture 4, LLM training):
    https://www.youtube.com/watch?v=VlA_jt_3Qc4
  License (lecture videos): as published on Stanford's public YouTube channel
  License (Amidi cheatsheets): MIT
This lesson adapts the parallelism + ZeRO + Flash Attention section of
Stanford CME 295 Lecture 4 (~26m34s to ~50m02s, the central engineering
arc of the lecture). The lecture continues into quantization (covered in
Phase 3, lesson 4). Clawdemy provides original notes, summaries, and
quizzes derived from this material for educational purposes. All rights
to the original lectures remain with Stanford and the instructors.

Going deeper

A short list, chosen for durability.

“FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness”, Dao, Fu, Ermon, Rudra, Ré, 2022. The original Flash Attention paper. Section 3 covers the tiling and the softmax-block-by-block math; section 4 has the speedup benchmarks. The “IO-awareness” framing in the title is exactly the SRAM-vs-HBM data-movement story this lesson covers.
“ZeRO: Memory Optimizations Toward Training Trillion Parameter Models”, Rajbhandari et al., 2020. The ZeRO paper. The three-level ZeRO-1/2/3 stack and the memory-vs-communication trade-offs across them. The companion to the Stanford lecturer’s discussion.
“Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism”, Shoeybi et al., 2019. The canonical tensor-parallelism paper for transformer training. Section 3 walks through the matrix-multiplication splitting in practice. Megatron-LM has become the de facto training framework for many frontier-scale runs.
The PyTorch FSDP documentation. PyTorch’s implementation of ZeRO-3 (called Fully Sharded Data Parallel, FSDP). Useful as the practical-engineering side of what the Stanford lecture covers conceptually. If you ever read or write training code, this is the API you will see.
Stanford CME 295 cheatsheet by the Amidi twins. MIT-licensed. Single-page reference for the same material in their dense visual style. The Amidi cheatsheet treats parallelism briefly; this lesson goes deeper.

Adjacent topics

Topics that build on or sit beside this one.

Mixed precision training and quantization. This lesson assumed each parameter takes 2 bytes (FP16) or 4 bytes (FP32). The next lesson (Phase 3, lesson 4) covers what happens when you store weights in even lower precision (FP8, INT8, INT4). The two axes (parallelism distributes bytes across GPUs; quantization makes each byte smaller) compose: a 70-billion-parameter model in INT8 with ZeRO-3 across 64 GPUs is the kind of stack a serious open-weights run looks like.
Why frontier training takes weeks to months. This lesson’s “weeks to months on large GPU clusters” framing is mostly a function of compute-throughput math (FLOPs per token times trillions of tokens). The communication overhead of ZeRO-3 + tensor parallelism + pipeline parallelism is a non-trivial fraction of that wall time. Frontier-class engineering teams spend significant effort tuning the parallelism topology to keep the GPUs as busy as possible.
Multi-query attention and grouped-query attention. Phase 2’s attention efficiency tricks lesson covers MQA / GQA, which reduce the KV-cache memory at inference time. Flash Attention reduces attention’s training-time memory cost; MQA/GQA reduces its inference-time memory cost. They address different stages but stack cleanly.
The KV cache. During inference (not training), the keys and values for previous tokens are cached so each new token does not require recomputing the whole context. The KV cache itself can become a major memory consumer for long-context inference. Phase 6 lessons on inference and serving will revisit this directly.
Phase 4 preview: tuning is much smaller. All the techniques in this lesson are about pretraining specifically. Post-pretraining stages (instruction tuning, RLHF, DPO, all of Phase 4) cost orders of magnitude less compute and typically run on much smaller hardware setups. Plain data parallelism is often enough for a tuning run.

Original sources

The primary papers, in chronological order.

“Megatron-LM”, Shoeybi et al., 2019. Tensor parallelism for transformer training.
“ZeRO”, Rajbhandari et al., 2020. Redundancy elimination across data-parallel GPUs.
“GPipe”, Huang et al., 2018. The earlier reference for pipeline parallelism (predates the LLM era but the techniques carry forward).
“FlashAttention”, Dao et al., 2022. The IO-aware exact attention.
“FlashAttention-2”, Dao, 2023. Refinements that further reduce data movement and parallelize across thread blocks. Worth reading after the original.

Community discussion

None selected for this lesson. The parallelism + ZeRO + Flash Attention space is consolidated in the academic literature and in the production code of the major training frameworks (Megatron-LM, DeepSpeed, FSDP). Durable references will be added at a future quarterly review if any consolidate.