Skip to content

Cheatsheet: Parallelism

SchemeWhat it splitsCommunicationFits where
Data (DP)The batch (full model on each device)All-reduce gradients once per stepAnywhere; needs model to fit one device
Tensor (TP)Each layer’s tensors (FFN width, heads)All-reduce/all-gather inside every layerWithin a node (fast interconnect)
Pipeline (PP)The layers (stages on different devices)Activations at stage boundaries onlyAcross nodes (tolerates slower fabric)
  • Shards parameters, gradients, and optimizer states across the DP ranks.
  • For each layer’s forward/backward, the ranks gather its parameters, compute, and discard the gathered weights.
  • Memory savings of model-splitting without per-layer tensor surgery (TP) or pipeline scheduling (PP).
  • Cost: extra communication (param gather + release), largely overlapped with compute.
Does the model fit on ONE GPU (per lesson 2's 16N)?
Yes -> DP alone. Done.
No -> Add FSDP/ZeRO (or TP within node).
Does it fit across ONE node?
Yes -> TP within node + DP across nodes (or FSDP).
No -> 3D: TP within nodes + PP across nodes + DP on top.
TP_size x PP_size x DP_size = total devices
| |
| +-- batch scale
+-- one node's GPUs (where the fast interconnect is)
PP_size = number of pipeline stages (across nodes)

Each axis attacks a different bottleneck (per-layer memory / cross-node memory / batch scale).

A pipeline stalls at the start/end of a batch as it fills and drains. Fix: split each batch into many microbatches that overlap in the pipeline (assembly-line style). Smaller microbatches -> smaller bubbles + more communication overhead; tune to balance.

If lesson-2 says…Add
Doesn’t fit on one GPUFSDP/ZeRO or TP
Doesn’t fit on one nodeTP within + PP across
Want more batch scaleDP on top
  • All-reduce / all-gather: collective communication ops; their frequency and size are the costs of TP and gradient sync in DP.
  • Stage / microbatch: a contiguous group of layers on one device / a small piece of the batch flowing through the pipeline.
  • FSDP / ZeRO: sharded data parallel; sharded params/grads/optimizer states.
  • 3D parallelism: DP × TP × PP combination for frontier-scale training.
  • Stanford CS336, Lectures 7+8 (Parallelism), by Hashimoto and Liang. cs336.stanford.edu. Independent structural mirror in original prose; see references.