Cheatsheet: How models run on hardware
GPU execution model (SIMT)
Section titled “GPU execution model (SIMT)”- SMs (streaming multiprocessors): tens to hundreds per chip, each with many arithmetic units.
- Warps: groups of ~32 threads running the same instruction on different data (lockstep).
- Tensor cores: per-SM units doing a small matmul per cycle in fp16/bf16; the bulk of the published peak FLOPs.
Reach peak FLOPs only with large, tile-aligned matmuls in mixed precision.
Memory hierarchy (the bottleneck)
Section titled “Memory hierarchy (the bottleneck)”| Tier | Size | Speed | Holds |
|---|---|---|---|
| HBM (main) | tens of GB | ~TB/s (slow vs compute) | weights, gradients, optimizer states, activations |
| SRAM / shared | tens-100s KB / SM | ~10x HBM | tiles staged for tensor cores |
| Registers | tiny / thread | fastest | per-thread scalars |
Arithmetic intensity = FLOPs per byte moved.
- Low (read once, multiply once) -> memory-bound, tensor cores idle on HBM.
- High (tile reused in SRAM across many multiplies) -> compute-bound, near peak.
TPU: a systolic-array bet
Section titled “TPU: a systolic-array bet”2D grid of multiply-accumulate units (e.g. 128x128 or 256x256).Inputs and partial sums march across the grid each cycle;each loaded value is reused across many multiplies as it moves.A hardware bet that matmul is the dominant op. Used in pods with fast interconnect. Same principle as GPU SRAM staging: reuse data in fast memory.
How hardware shapes architecture (lesson 3 in this light)
Section titled “How hardware shapes architecture (lesson 3 in this light)”- Big matmuls in attention (Q,K,V,O) and FFN: where peak FLOPs lives.
- Mixed precision (bf16/fp16): how peak FLOPs is reached.
- Tile-friendly hidden sizes (multiples of 64/128): align to tensor-core tiles.
- The choice of architecture is co-designed with the chip.
Diagnostic: “running at X% of peak”
Section titled “Diagnostic: “running at X% of peak””When a job runs far below the GPU spec:
- Precision wrong (fp32 instead of bf16/fp16) -> tensor cores not engaged.
- Matmul shapes small or misaligned -> miss tensor-core tiles.
- Memory-bound surroundings (many small ops between matmuls) -> stall on HBM.
- Batch too small to amortize data movement.
Compute is rarely the bottleneck; feeding it is.
Words to use precisely
Section titled “Words to use precisely”- SIMT / SM / warp: GPU’s lockstep execution model.
- Tensor core: per-SM matmul unit (fp16/bf16); source of most published peak FLOPs.
- HBM / SRAM / registers: GPU memory tiers; speed and size trade off.
- Systolic array: TPU’s 2D MAC grid; data flows through, reusing values.
- Mixed precision: math in bf16/fp16, sensitive accumulations in fp32; near-free speedup.
Source
Section titled “Source”- Stanford CS336, Lecture 5 (GPUs, TPUs), by Hashimoto and Liang.
cs336.stanford.edu. Independent structural mirror in original prose; see references.