References: Why transformers need stability to learn: LayerNorm, pre-norm, and RMSNorm

Source material

Source material:
• Stanford CME 295: Transformers & Large Language Models, Autumn 2025
  Instructor: Afshine Amidi & Shervine Amidi, Stanford University
  Course site: https://cme295.stanford.edu/
  Cheatsheet: https://cme295.stanford.edu/cheatsheet/
  Source lecture (Lecture 2, Transformer-based models & tricks):
    https://www.youtube.com/watch?v=yT84Y5zCnaA
  License (lecture videos): as published on Stanford's public YouTube channel
  License (Amidi cheatsheets): MIT
This lesson adapts the layer-normalization section of Stanford CME 295
Lecture 2 (~2620s-3030s), which sits between the position-embeddings
section (covered in our previous lesson) and the attention-efficiency
tricks section (covered in our next lesson). The lecturer treats this
as a brief tour of the parts of the original transformer that have
changed; this lesson preserves that brevity. Clawdemy provides original
notes, summaries, and quizzes derived from this material for educational
purposes. All rights to the original lectures remain with Stanford and
the instructors.

Going deeper

A short list, chosen for durability. Each link is for a specific next step, not a generic “learn more.”

“Layer Normalization”, Ba et al., 2016. The original LayerNorm paper. Introduces the per-token normalization scheme as an alternative to BatchNorm for sequence models. Section 3 derives the formula; section 5 shows the empirical comparisons. Read after this lesson; the formula will already be familiar.
“Root Mean Square Layer Normalization”, Zhang & Sennrich, 2019. The RMSNorm paper. Argues that the recentering step in LayerNorm (the mean subtraction) is not the load-bearing part; only the rescaling matters. Empirical results show comparable convergence with fewer parameters and faster computation. Short and readable.
“On Layer Normalization in the Transformer Architecture”, Xiong et al., 2020. The post-norm vs pre-norm paper. Provides the theoretical analysis of why pre-norm is more stable at depth than post-norm. The intuition: post-norm makes the gradient at initialization scale with depth, so deeper networks need careful learning-rate warmup; pre-norm sidesteps that. Read after this lesson; it formalizes what the lecture leaves implicit.
Stanford CME 295 cheatsheet by the Amidi twins. MIT-licensed. The cheatsheet’s transformer-tweaks section gives a one-page reference for normalization and the other small-but-real architecture changes.

Adjacent topics

Topics that build on or sit beside this one.

DeepNorm and other normalization variants. The post-norm/pre-norm choice is not the only normalization decision a transformer architect makes. DeepNorm (Wang et al., 2022) proposes a hybrid scheme designed for very deep networks (1000+ layers). Useful context if you read about normalization tricks beyond the LayerNorm/RMSNorm pairing this lesson covers.
Why LayerNorm works (theory). A growing literature studies the actual mechanism by which LayerNorm helps training. The original “internal covariate shift” framing (Ioffe & Szegedy, 2015) has been challenged; more recent papers attribute the effect to smoother loss landscapes or implicit gradient regularization. Search terms: “How Does Batch Normalization Help Optimization” (Santurkar et al.), “Understanding the Disharmony Between Dropout and Batch Normalization.”
Activation function changes that often pair with normalization changes. The shift to pre-norm and RMSNorm in modern LLMs often comes alongside a shift in the FFN’s activation function (from ReLU to GELU to SwiGLU). The choices are independent but tend to co-occur in modern architectures. Worth knowing exists; we cover it briefly in the transformer block lesson from Lecture 1.
Where to go next. The next lesson in this lecture covers attention efficiency tricks (sliding window attention and the MQA/GQA progression). That closes the three-lesson “post-2017 changes that stuck” arc this lecture opens with position embeddings.

Original sources

The primary papers for the techniques covered, in chronological order.

“Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, Ioffe & Szegedy, 2015. The BatchNorm paper. Introduces “internal covariate shift” as the framing for why normalization helps. Read this for the historical context, even though the explanation has been challenged in later work.
“Layer Normalization”, Ba et al., 2016. The LayerNorm paper.
“Attention Is All You Need”, Vaswani et al., 2017. The original transformer paper, which used post-norm + LayerNorm.
“Root Mean Square Layer Normalization”, Zhang & Sennrich, 2019. The RMSNorm paper.
“On Layer Normalization in the Transformer Architecture”, Xiong et al., 2020. The post-norm vs pre-norm analysis.

Community discussion

None selected for this lesson. Normalization is one of the more settled corners of the transformer architecture conversation; the relevant literature has consolidated and the practitioner community has largely landed on the Pre-RMSNorm default. Durable references will be added here at a future quarterly review if the situation changes.