References: Why transformers need stability to learn: LayerNorm, pre-norm, and RMSNorm
Source material
Section titled “Source material”Source material:• Stanford CME 295: Transformers & Large Language Models, Autumn 2025 Instructor: Afshine Amidi & Shervine Amidi, Stanford University Course site: https://cme295.stanford.edu/ Cheatsheet: https://cme295.stanford.edu/cheatsheet/ Source lecture (Lecture 2, Transformer-based models & tricks): https://www.youtube.com/watch?v=yT84Y5zCnaA License (lecture videos): as published on Stanford's public YouTube channel License (Amidi cheatsheets): MITThis lesson adapts the layer-normalization section of Stanford CME 295Lecture 2 (~2620s-3030s), which sits between the position-embeddingssection (covered in our previous lesson) and the attention-efficiencytricks section (covered in our next lesson). The lecturer treats thisas a brief tour of the parts of the original transformer that havechanged; this lesson preserves that brevity. Clawdemy provides originalnotes, summaries, and quizzes derived from this material for educationalpurposes. All rights to the original lectures remain with Stanford andthe instructors.Going deeper
Section titled “Going deeper”A short list, chosen for durability. Each link is for a specific next step, not a generic “learn more.”
-
“Layer Normalization”, Ba et al., 2016. The original LayerNorm paper. Introduces the per-token normalization scheme as an alternative to BatchNorm for sequence models. Section 3 derives the formula; section 5 shows the empirical comparisons. Read after this lesson; the formula will already be familiar.
-
“Root Mean Square Layer Normalization”, Zhang & Sennrich, 2019. The RMSNorm paper. Argues that the recentering step in LayerNorm (the mean subtraction) is not the load-bearing part; only the rescaling matters. Empirical results show comparable convergence with fewer parameters and faster computation. Short and readable.
-
“On Layer Normalization in the Transformer Architecture”, Xiong et al., 2020. The post-norm vs pre-norm paper. Provides the theoretical analysis of why pre-norm is more stable at depth than post-norm. The intuition: post-norm makes the gradient at initialization scale with depth, so deeper networks need careful learning-rate warmup; pre-norm sidesteps that. Read after this lesson; it formalizes what the lecture leaves implicit.
-
Stanford CME 295 cheatsheet by the Amidi twins. MIT-licensed. The cheatsheet’s transformer-tweaks section gives a one-page reference for normalization and the other small-but-real architecture changes.
Adjacent topics
Section titled “Adjacent topics”Topics that build on or sit beside this one.
-
DeepNorm and other normalization variants. The post-norm/pre-norm choice is not the only normalization decision a transformer architect makes. DeepNorm (Wang et al., 2022) proposes a hybrid scheme designed for very deep networks (1000+ layers). Useful context if you read about normalization tricks beyond the LayerNorm/RMSNorm pairing this lesson covers.
-
Why LayerNorm works (theory). A growing literature studies the actual mechanism by which LayerNorm helps training. The original “internal covariate shift” framing (Ioffe & Szegedy, 2015) has been challenged; more recent papers attribute the effect to smoother loss landscapes or implicit gradient regularization. Search terms: “How Does Batch Normalization Help Optimization” (Santurkar et al.), “Understanding the Disharmony Between Dropout and Batch Normalization.”
-
Activation function changes that often pair with normalization changes. The shift to pre-norm and RMSNorm in modern LLMs often comes alongside a shift in the FFN’s activation function (from ReLU to GELU to SwiGLU). The choices are independent but tend to co-occur in modern architectures. Worth knowing exists; we cover it briefly in the transformer block lesson from Lecture 1.
-
Where to go next. The next lesson in this lecture covers attention efficiency tricks (sliding window attention and the MQA/GQA progression). That closes the three-lesson “post-2017 changes that stuck” arc this lecture opens with position embeddings.
Original sources
Section titled “Original sources”The primary papers for the techniques covered, in chronological order.
-
“Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, Ioffe & Szegedy, 2015. The BatchNorm paper. Introduces “internal covariate shift” as the framing for why normalization helps. Read this for the historical context, even though the explanation has been challenged in later work.
-
“Layer Normalization”, Ba et al., 2016. The LayerNorm paper.
-
“Attention Is All You Need”, Vaswani et al., 2017. The original transformer paper, which used post-norm + LayerNorm.
-
“Root Mean Square Layer Normalization”, Zhang & Sennrich, 2019. The RMSNorm paper.
-
“On Layer Normalization in the Transformer Architecture”, Xiong et al., 2020. The post-norm vs pre-norm analysis.
Community discussion
Section titled “Community discussion”None selected for this lesson. Normalization is one of the more settled corners of the transformer architecture conversation; the relevant literature has consolidated and the practitioner community has largely landed on the Pre-RMSNorm default. Durable references will be added here at a future quarterly review if the situation changes.