Cheatsheet: Parallelism
The three classic schemes
Section titled “The three classic schemes”| Scheme | What it splits | Communication | Fits where |
|---|---|---|---|
| Data (DP) | The batch (full model on each device) | All-reduce gradients once per step | Anywhere; needs model to fit one device |
| Tensor (TP) | Each layer’s tensors (FFN width, heads) | All-reduce/all-gather inside every layer | Within a node (fast interconnect) |
| Pipeline (PP) | The layers (stages on different devices) | Activations at stage boundaries only | Across nodes (tolerates slower fabric) |
FSDP / ZeRO (the modern default)
Section titled “FSDP / ZeRO (the modern default)”- Shards parameters, gradients, and optimizer states across the DP ranks.
- For each layer’s forward/backward, the ranks gather its parameters, compute, and discard the gathered weights.
- Memory savings of model-splitting without per-layer tensor surgery (TP) or pipeline scheduling (PP).
- Cost: extra communication (param gather + release), largely overlapped with compute.
Decision tree
Section titled “Decision tree”Does the model fit on ONE GPU (per lesson 2's 16N)? Yes -> DP alone. Done. No -> Add FSDP/ZeRO (or TP within node). Does it fit across ONE node? Yes -> TP within node + DP across nodes (or FSDP). No -> 3D: TP within nodes + PP across nodes + DP on top.3D parallelism (largest scale)
Section titled “3D parallelism (largest scale)”TP_size x PP_size x DP_size = total devices | | | +-- batch scale +-- one node's GPUs (where the fast interconnect is)PP_size = number of pipeline stages (across nodes)Each axis attacks a different bottleneck (per-layer memory / cross-node memory / batch scale).
Pipeline bubbles -> microbatching
Section titled “Pipeline bubbles -> microbatching”A pipeline stalls at the start/end of a batch as it fills and drains. Fix: split each batch into many microbatches that overlap in the pipeline (assembly-line style). Smaller microbatches -> smaller bubbles + more communication overhead; tune to balance.
Mapping to lesson 2’s accounting
Section titled “Mapping to lesson 2’s accounting”| If lesson-2 says… | Add |
|---|---|
| Doesn’t fit on one GPU | FSDP/ZeRO or TP |
| Doesn’t fit on one node | TP within + PP across |
| Want more batch scale | DP on top |
Words to use precisely
Section titled “Words to use precisely”- All-reduce / all-gather: collective communication ops; their frequency and size are the costs of TP and gradient sync in DP.
- Stage / microbatch: a contiguous group of layers on one device / a small piece of the batch flowing through the pipeline.
- FSDP / ZeRO: sharded data parallel; sharded params/grads/optimizer states.
- 3D parallelism: DP × TP × PP combination for frontier-scale training.
Source
Section titled “Source”- Stanford CS336, Lectures 7+8 (Parallelism), by Hashimoto and Liang.
cs336.stanford.edu. Independent structural mirror in original prose; see references.