Skip to content

Training across many devices, parallelism

Lesson 2’s memory accounting (16N bytes for parameters, gradients, and optimizer states in fp32) already exceeds a single GPU for a 7-billion-parameter model. Frontier models are vastly larger. This lesson is how you spread the work across many devices. The source curriculum is Stanford CS336, Lectures 7 and 8, by Tatsunori Hashimoto and Percy Liang, with lectures freely available on YouTube and the course at cs336.stanford.edu. Per the Phase 0 mirror, the two parallelism lectures are collapsed into one lesson.

You will distinguish the three classic schemes (data, tensor, and pipeline parallelism) by what they split and what they communicate; understand FSDP / ZeRO sharding and its trade-off; learn the within-node vs across-nodes placement rule that decides where TP and PP each fit; see 3D parallelism as the combination used at frontier scale; and map a model that does not fit one device or one node to an actionable parallelism configuration.

This is lesson 7 of 14, the third lesson of Phase 2 (systems and efficiency). It makes the lesson-2 accounting actionable at scale and uses lesson 5’s hardware picture (within-node vs across-nodes interconnect speeds) to motivate the placement rules. The next lesson closes Phase 2 with inference, where parallelism returns in a different shape (splitting the serving load) and the KV cache from lesson 4 is the central concern.

Prerequisites: lesson 2 (the 16N memory accounting that triggers parallelism in the first place) and lesson 5 (the within-node vs across-nodes interconnect picture that decides where TP lives). Familiarity with the basic collective operations (all-reduce, all-gather) helps but is not strictly required; this lesson explains them by what they do.

None new. The lesson uses lesson 2’s 16N accounting and reasons about which axis of the work each scheme splits. No formulas to derive.

The single capability this lesson builds: distinguish the main forms of parallelism (data, tensor, pipeline) and when each is used to train a large model. Concretely, you will be able to:

  • Distinguish data, tensor, and pipeline parallelism by what they split and what they communicate
  • Explain FSDP / ZeRO sharding and its trade-off
  • Apply the within-node-vs-across-nodes rule to place TP and PP
  • Describe 3D parallelism and when it is needed
  • Map a model that doesn’t fit one device or one node to a parallelism configuration
  • Read time: about 14 minutes
  • Practice time: about 10 minutes (pick-the-scheme exercise for several cluster scenarios, plus flashcards)
  • Difficulty: deep (Stage C; broad systems lesson, reads through lesson 2’s accounting and lesson 5’s hardware picture)