Skip to content

References: Why pretraining is a memory engineering problem (parallelism and Flash Attention)

Source material:
• Stanford CME 295: Transformers & Large Language Models, Autumn 2025
Instructor: Afshine Amidi & Shervine Amidi, Stanford University
Course site: https://cme295.stanford.edu/
Cheatsheet: https://cme295.stanford.edu/cheatsheet/
Source lecture (Lecture 4, LLM training):
https://www.youtube.com/watch?v=VlA_jt_3Qc4
License (lecture videos): as published on Stanford's public YouTube channel
License (Amidi cheatsheets): MIT
This lesson adapts the parallelism + ZeRO + Flash Attention section of
Stanford CME 295 Lecture 4 (~26m34s to ~50m02s, the central engineering
arc of the lecture). The lecture continues into quantization (covered in
Phase 3, lesson 4). Clawdemy provides original notes, summaries, and
quizzes derived from this material for educational purposes. All rights
to the original lectures remain with Stanford and the instructors.

A short list, chosen for durability.

Topics that build on or sit beside this one.

  • Mixed precision training and quantization. This lesson assumed each parameter takes 2 bytes (FP16) or 4 bytes (FP32). The next lesson (Phase 3, lesson 4) covers what happens when you store weights in even lower precision (FP8, INT8, INT4). The two axes (parallelism distributes bytes across GPUs; quantization makes each byte smaller) compose: a 70-billion-parameter model in INT8 with ZeRO-3 across 64 GPUs is the kind of stack a serious open-weights run looks like.

  • Why frontier training takes weeks to months. This lesson’s “weeks to months on large GPU clusters” framing is mostly a function of compute-throughput math (FLOPs per token times trillions of tokens). The communication overhead of ZeRO-3 + tensor parallelism + pipeline parallelism is a non-trivial fraction of that wall time. Frontier-class engineering teams spend significant effort tuning the parallelism topology to keep the GPUs as busy as possible.

  • Multi-query attention and grouped-query attention. Phase 2’s attention efficiency tricks lesson covers MQA / GQA, which reduce the KV-cache memory at inference time. Flash Attention reduces attention’s training-time memory cost; MQA/GQA reduces its inference-time memory cost. They address different stages but stack cleanly.

  • The KV cache. During inference (not training), the keys and values for previous tokens are cached so each new token does not require recomputing the whole context. The KV cache itself can become a major memory consumer for long-context inference. Phase 6 lessons on inference and serving will revisit this directly.

  • Phase 4 preview: tuning is much smaller. All the techniques in this lesson are about pretraining specifically. Post-pretraining stages (instruction tuning, RLHF, DPO, all of Phase 4) cost orders of magnitude less compute and typically run on much smaller hardware setups. Plain data parallelism is often enough for a tuning run.

The primary papers, in chronological order.

  • “Megatron-LM”, Shoeybi et al., 2019. Tensor parallelism for transformer training.
  • “ZeRO”, Rajbhandari et al., 2020. Redundancy elimination across data-parallel GPUs.
  • “GPipe”, Huang et al., 2018. The earlier reference for pipeline parallelism (predates the LLM era but the techniques carry forward).
  • “FlashAttention”, Dao et al., 2022. The IO-aware exact attention.
  • “FlashAttention-2”, Dao, 2023. Refinements that further reduce data movement and parallelize across thread blocks. Worth reading after the original.

None selected for this lesson. The parallelism + ZeRO + Flash Attention space is consolidated in the academic literature and in the production code of the major training frameworks (Megatron-LM, DeepSpeed, FSDP). Durable references will be added at a future quarterly review if any consolidate.