Training across many devices, parallelism
What you’ll learn
Section titled “What you’ll learn”Lesson 2’s memory accounting (16N bytes for parameters, gradients, and optimizer states in fp32) already exceeds a single GPU for a 7-billion-parameter model. Frontier models are vastly larger. This lesson is how you spread the work across many devices. The source curriculum is Stanford CS336, Lectures 7 and 8, by Tatsunori Hashimoto and Percy Liang, with lectures freely available on YouTube and the course at cs336.stanford.edu. Per the Phase 0 mirror, the two parallelism lectures are collapsed into one lesson.
You will distinguish the three classic schemes (data, tensor, and pipeline parallelism) by what they split and what they communicate; understand FSDP / ZeRO sharding and its trade-off; learn the within-node vs across-nodes placement rule that decides where TP and PP each fit; see 3D parallelism as the combination used at frontier scale; and map a model that does not fit one device or one node to an actionable parallelism configuration.
Where this fits
Section titled “Where this fits”This is lesson 7 of 14, the third lesson of Phase 2 (systems and efficiency). It makes the lesson-2 accounting actionable at scale and uses lesson 5’s hardware picture (within-node vs across-nodes interconnect speeds) to motivate the placement rules. The next lesson closes Phase 2 with inference, where parallelism returns in a different shape (splitting the serving load) and the KV cache from lesson 4 is the central concern.
Before you start
Section titled “Before you start”Prerequisites: lesson 2 (the 16N memory accounting that triggers parallelism in the first place) and lesson 5 (the within-node vs across-nodes interconnect picture that decides where TP lives). Familiarity with the basic collective operations (all-reduce, all-gather) helps but is not strictly required; this lesson explains them by what they do.
About the math
Section titled “About the math”None new. The lesson uses lesson 2’s 16N accounting and reasons about which axis of the work each scheme splits. No formulas to derive.
By the end, you’ll be able to
Section titled “By the end, you’ll be able to”The single capability this lesson builds: distinguish the main forms of parallelism (data, tensor, pipeline) and when each is used to train a large model. Concretely, you will be able to:
- Distinguish data, tensor, and pipeline parallelism by what they split and what they communicate
- Explain FSDP / ZeRO sharding and its trade-off
- Apply the within-node-vs-across-nodes rule to place TP and PP
- Describe 3D parallelism and when it is needed
- Map a model that doesn’t fit one device or one node to a parallelism configuration
Time and difficulty
Section titled “Time and difficulty”- Read time: about 14 minutes
- Practice time: about 10 minutes (pick-the-scheme exercise for several cluster scenarios, plus flashcards)
- Difficulty: deep (Stage C; broad systems lesson, reads through lesson 2’s accounting and lesson 5’s hardware picture)