References: Parallelism
Source material
Section titled “Source material”Source curriculum (structural mirror, cited as further study):• Stanford CS336, "Language Modeling from Scratch", Lectures 7-8: Parallelism Instructors: Tatsunori Hashimoto and Percy Liang (Stanford) Course page: https://cs336.stanford.edu/ Lecture videos: YouTube playlist https://www.youtube.com/playlist?list=PLoROMvodv4rMqXOcazWaTUHhq-yembLCV License: no explicit license is published on the course site; lecture videos are on YouTube under standard terms; slides are public on GitHub without a stated license. Required attribution: "Based on the structure of Stanford CS336, 'Language Modeling from Scratch,' by Tatsunori Hashimoto and Percy Liang (cs336.stanford.edu). This is an independent structural mirror in original prose; it reproduces no course materials, and Stanford does not endorse it."This lesson mirrors the structure of Lectures 7 and 8 (parallelism). Twolectures are collapsed here because they cover the same material acrossschemes. Clawdemy's lessons are original prose that follows the pedagogicalarc of the course. Because the source publishes no explicit license, wecite it as a recommended companion and reproduce none of its materials. Allrights to the original course materials remain with their creators.Watch this next
Section titled “Watch this next”- Stanford CS336, Lectures 7 and 8: Parallelism by Hashimoto and Liang. The two lectures this lesson collapses. They walk the schemes in more depth, with the communication patterns drawn explicitly.
Going deeper
Section titled “Going deeper”A short, durable list. Each link is a specific next step, not a generic pile.
-
“Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism” by Shoeybi et al. (2019). The classic tensor-parallelism paper, with the FFN-split and attention-head-split recipes. The clearest worked example of TP.
-
“ZeRO: Memory Optimizations Toward Training Trillion Parameter Models” by Rajbhandari et al. (2019). The paper behind the sharded-optimizer idea (ZeRO-1/2/3, the basis of FSDP). Reads cleanly; the staged-sharding section is the key picture.
-
“PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel” by Zhao et al. (2023). The PyTorch FSDP paper, with the practical implementation choices that make sharded DP work end to end.
Adjacent topics
Section titled “Adjacent topics”Where this connects inside the track.
-
Counting the cost (lesson 2). The
16Nmemory rule is what triggers parallelism in the first place. This lesson is how that accounting becomes an actionable cluster configuration. -
How models run on hardware (lesson 5). The within-node-vs-across-nodes interconnect speeds are the physical reason TP lives within nodes and PP crosses them.
-
Inference (lesson 8). Closes Phase 2 by serving the trained model fast; parallelism returns in a different shape (splitting the serving load), with the KV cache from lesson 4 as the central concern.