Skip to content

Cheatsheet: How models run on hardware

  • SMs (streaming multiprocessors): tens to hundreds per chip, each with many arithmetic units.
  • Warps: groups of ~32 threads running the same instruction on different data (lockstep).
  • Tensor cores: per-SM units doing a small matmul per cycle in fp16/bf16; the bulk of the published peak FLOPs.

Reach peak FLOPs only with large, tile-aligned matmuls in mixed precision.

TierSizeSpeedHolds
HBM (main)tens of GB~TB/s (slow vs compute)weights, gradients, optimizer states, activations
SRAM / sharedtens-100s KB / SM~10x HBMtiles staged for tensor cores
Registerstiny / threadfastestper-thread scalars

Arithmetic intensity = FLOPs per byte moved.

  • Low (read once, multiply once) -> memory-bound, tensor cores idle on HBM.
  • High (tile reused in SRAM across many multiplies) -> compute-bound, near peak.
2D grid of multiply-accumulate units (e.g. 128x128 or 256x256).
Inputs and partial sums march across the grid each cycle;
each loaded value is reused across many multiplies as it moves.

A hardware bet that matmul is the dominant op. Used in pods with fast interconnect. Same principle as GPU SRAM staging: reuse data in fast memory.

How hardware shapes architecture (lesson 3 in this light)

Section titled “How hardware shapes architecture (lesson 3 in this light)”
  • Big matmuls in attention (Q,K,V,O) and FFN: where peak FLOPs lives.
  • Mixed precision (bf16/fp16): how peak FLOPs is reached.
  • Tile-friendly hidden sizes (multiples of 64/128): align to tensor-core tiles.
  • The choice of architecture is co-designed with the chip.

When a job runs far below the GPU spec:

  1. Precision wrong (fp32 instead of bf16/fp16) -> tensor cores not engaged.
  2. Matmul shapes small or misaligned -> miss tensor-core tiles.
  3. Memory-bound surroundings (many small ops between matmuls) -> stall on HBM.
  4. Batch too small to amortize data movement.

Compute is rarely the bottleneck; feeding it is.

  • SIMT / SM / warp: GPU’s lockstep execution model.
  • Tensor core: per-SM matmul unit (fp16/bf16); source of most published peak FLOPs.
  • HBM / SRAM / registers: GPU memory tiers; speed and size trade off.
  • Systolic array: TPU’s 2D MAC grid; data flows through, reusing values.
  • Mixed precision: math in bf16/fp16, sensitive accumulations in fp32; near-free speedup.
  • Stanford CS336, Lecture 5 (GPUs, TPUs), by Hashimoto and Liang. cs336.stanford.edu. Independent structural mirror in original prose; see references.