The lecturer is brief on why the field moved. The widely-cited explanation in the literature is that pre-norm keeps the residual stream’s magnitude better controlled as networks get deeper.
LayerNorm: per-token normalization that subtracts the mean of a vector’s components, divides by the standard deviation, then applies learnable rescale (gamma) and shift (beta).
RMSNorm (Root Mean Square Normalization): per-token normalization that divides a vector by the root mean square of its components, then applies learnable rescale (gamma) only. No mean subtraction, no shift.
BatchNorm: normalization across the batch dimension; each component normalized against the same component in other vectors in the batch. Common in CV; not used in transformers.
Post-norm: the original 2017 transformer’s placement: LayerNorm(x + SubLayer(x)). LayerNorm sits after the residual addition.
Pre-norm: the modern placement: x + SubLayer(LayerNorm(x)). LayerNorm sits before the sub-layer.
Sub-layer: in a transformer block, either the attention layer or the feed-forward network.
Internal covariate shift: the keyword the literature uses for the underlying problem normalization addresses. The distribution of activations shifts as the network trains, making the next layer’s job harder.
Gamma, beta: learnable per-component parameters in LayerNorm (rescale and shift). RMSNorm keeps gamma, drops beta.
LayerNorm rescales the activations. Pre-norm moves where it sits. RMSNorm changes what it computes.