Skip to content

Why transformers need stability to learn: LayerNorm, pre-norm, and RMSNorm

This is lesson 5 of Phase 2 (How models think: the transformer architecture) in Track 5 (AI Foundations). The previous lesson assembled everything into a complete transformer block, including the “Add & Norm” boxes that appear in the original architecture diagram. Course materials are at cme295.stanford.edu.

This lesson zooms in on those boxes. It explains what normalization actually does (and corrects the historical internal-covariate-shift story that did not hold up empirically; the currently accepted explanation per Santurkar et al. 2018 is that normalization smooths the loss landscape). It then walks the two ways the field has changed normalization since 2017. The first change is where the LayerNorm sits relative to the sub-layer: post-norm in the 2017 paper, pre-norm in modern LLMs. The second change is what the normalization computes: full LayerNorm with mean-and-std plus a learned scale-and-shift, or RMSNorm which skips the mean subtraction and the learned shift entirely. Both show up in modern model cards (you’ll see “Pre-LayerNorm” or “Pre-RMSNorm” listed as architecture features).

This is lesson 5 of Phase 2, How models think: the transformer architecture. The previous lesson (How modern models inject position into attention (RoPE)) covered the structural shift from input-added to attention-injected position embeddings. This lesson gives normalization its full treatment. The next lesson, How transformers scale to real-world data: sliding windows and KV-cache savings, covers the attention-efficiency improvements that make long-context inference tractable. Position, normalization, and attention efficiency are the three places where modern transformers genuinely diverge from the 2017 architecture; everything else is mostly intact.

Prerequisites: the transformer block lesson is required. We assume you understand what a sub-layer is in the transformer (attention or feed-forward network), what residual connections do, and what the “Add & Norm” box represents in the original architecture diagram. If those terms feel unfamiliar, read the transformer block lesson first.

  • Explain in plain language what layer normalization does to a vector and why it helps training, including the corrected modern framing (loss-landscape smoothing) versus the original internal-covariate-shift story that did not hold up empirically
  • Distinguish LayerNorm from BatchNorm and explain why transformers use the former (per-token, no batch dependence)
  • Identify the structural shift from post-norm to pre-norm and the reason modern transformers use pre-norm (stable training at depth)
  • Describe how RMSNorm differs from LayerNorm (skip mean subtraction, drop the learned shift) and why most modern LLMs use it
  • Read time: about 18 minutes
  • Practice time: about 12 minutes (a worked LayerNorm computation by hand on a small vector, plus a comparison exercise on the post-norm vs pre-norm equations)
  • Difficulty: standard