Skip to content

Cheatsheet: Why transformers need stability to learn: LayerNorm, pre-norm, and RMSNorm

The "Add & Norm" boxes rescale activations into a usable range
between sub-layers. Modern transformers changed two things:
WHERE the LayerNorm sits → pre-norm (before each sub-layer)
WHAT the LayerNorm computes → RMSNorm (no mean, no shift)
mean(x) = (x_1 + x_2 + ... + x_d) / d
std(x) = sqrt( sum((x_i - mean(x))^2) / d )
normalized = (x - mean(x)) / std(x)
output = gamma * normalized + beta

Per-token. Two learnable parameters: gamma (rescale), beta (shift).

rms(x) = sqrt( (x_1^2 + x_2^2 + ... + x_d^2) / d )
normalized = x / rms(x)
output = gamma * normalized

Per-token. One learnable parameter: gamma. No mean subtraction. No shift.

LayerNormBatchNorm
Axis of normalizationAcross the feature dimension (one vector at a time)Across the batch dimension (one feature across many vectors)
Depends on batch composition?No (per-token)Yes
Train vs inference statisticsSameDiffer (the batch is different at inference)
CV intuition”One vector across many components""One component across many vectors”
Used in transformers?Yes (default)No
Post-norm (original 2017)Pre-norm (modern)
Formulaoutput = LayerNorm(x + SubLayer(x))output = x + SubLayer(LayerNorm(x))
LayerNorm placementAfter the residual additionBefore the sub-layer
Used in modern LLMs?RareDefault
Lecture’s framing”What the original transformer paper used""Nowadays we use a prenorm version”

The lecturer is brief on why the field moved. The widely-cited explanation in the literature is that pre-norm keeps the residual stream’s magnitude better controlled as networks get deeper.

LayerNormRMSNorm
Mean subtraction?YesNo
DivisorStandard deviationRoot mean square of components
Learnable rescale (gamma)?YesYes
Learnable shift (beta)?YesNo
Convergence propertiesBaselineComparable to LayerNorm (per the lecturer)
Parameter countTwo learnable vectors per norm layerOne learnable vector per norm layer
Compute costHigher (more arithmetic steps)Lower
Used in modern open-weight LLMs?Less commonDefault
Phrase in a model cardWhat it means
Pre-LayerNormPre-norm placement + LayerNorm computation. Older modern style; less common today.
Pre-RMSNormPre-norm placement + RMSNorm computation. The current default for most modern open-weight LLMs.
LayerNorm with learnable affineStandard LayerNorm with both gamma and beta. The 2017 original.
RMSNormThe simpler scheme; usually paired with pre-norm.
PitfallReality
LayerNorm and BatchNorm are the same thingNo. Same general idea (rescale activations), different axis. LayerNorm is per-token across features; BatchNorm is per-feature across the batch.
RMSNorm is fundamentally different from LayerNormNo. It is a simplification: skip mean subtraction, skip shift. Same per-token normalization.
Pre-norm is universally betterTrue for modern LLM-scale networks; less clearly true for shallow networks or specific architectures tuned for post-norm.
The “Add & Norm” boxes are decorationNo. They improve convergence and shorten training time at every model size. The mechanism is small; removing it makes training noticeably worse.
BatchNorm “works” if you tune itIt introduces train-vs-inference statistics differences that LayerNorm avoids. Cleaner to use the right tool.
  • LayerNorm: per-token normalization that subtracts the mean of a vector’s components, divides by the standard deviation, then applies learnable rescale (gamma) and shift (beta).
  • RMSNorm (Root Mean Square Normalization): per-token normalization that divides a vector by the root mean square of its components, then applies learnable rescale (gamma) only. No mean subtraction, no shift.
  • BatchNorm: normalization across the batch dimension; each component normalized against the same component in other vectors in the batch. Common in CV; not used in transformers.
  • Post-norm: the original 2017 transformer’s placement: LayerNorm(x + SubLayer(x)). LayerNorm sits after the residual addition.
  • Pre-norm: the modern placement: x + SubLayer(LayerNorm(x)). LayerNorm sits before the sub-layer.
  • Sub-layer: in a transformer block, either the attention layer or the feed-forward network.
  • Internal covariate shift: the keyword the literature uses for the underlying problem normalization addresses. The distribution of activations shifts as the network trains, making the next layer’s job harder.
  • Gamma, beta: learnable per-component parameters in LayerNorm (rescale and shift). RMSNorm keeps gamma, drops beta.

LayerNorm rescales the activations.
Pre-norm moves where it sits.
RMSNorm changes what it computes.