References: activations, gradients, and BatchNorm
Source material
Section titled “Source material”Source curriculum (structural mirror, cited as further study):• Andrej Karpathy, "Neural Networks: Zero to Hero", Lecture 4: "Building makemore Part 3: Activations & Gradients, BatchNorm" Creator: Andrej Karpathy Video: https://www.youtube.com/watch?v=P6sfmUTpUmc Code repo (makemore): https://github.com/karpathy/makemore (MIT License) Series repo: https://github.com/karpathy/nn-zero-to-hero (MIT License) Series page: https://karpathy.ai/zero-to-hero.html License: makemore and the series code are MIT-licensed; the video is YouTube standard.This lesson covers Lecture 4, where Karpathy diagnoses the initial-loss andsaturation problems with activation/gradient histograms, fixes them with scaledinitialization, and introduces batch normalization. Clawdemy's lessons areoriginal prose following the pedagogical arc of this series; we do not reproduceor transcribe the video or code. The tanh-derivative table and the 3.30 lossbaseline here are ours, built to be checkable by hand. All rights to theoriginal video and code remain with the creator.Watch this next
Section titled “Watch this next”- Building makemore Part 3: Activations & Gradients, BatchNorm (Andrej Karpathy) by Andrej Karpathy. The lecture this lesson mirrors. Karpathy fixes the hockey-stick initial loss, then plots histograms of the activations and gradients to make saturation visible, scales the initialization to cure it, and builds batch normalization step by step. Watching the saturated histogram (everything pinned at the tails) turn healthy after rescaling is the clearest possible picture of what this lesson describes.
Going deeper
Section titled “Going deeper”-
Batch Normalization (Ioffe & Szegedy, 2015) (arXiv). The original paper introducing batch normalization. Worth a skim to see the problem it was framed around (internal covariate shift) and how widely it spread afterward.
-
Delving Deep into Rectifiers (He et al., 2015) (arXiv). The paper behind “Kaiming initialization,” the
1/sqrt(fan-in)-style scaling that keeps activations healthy through depth. -
makemore on GitHub (MIT License) and the Zero to Hero series. The next lecture removes the autograd engine entirely and backpropagates through this network by hand.
Adjacent topics
Section titled “Adjacent topics”Where this sits in the curriculum.
-
The autograd engine (lesson 1). This lesson leans directly on
tanh’s local derivative1 - tanh^2from that lesson: saturation is exactly that derivative going to zero, which starves backprop. If the saturation argument felt fast, rereading how local derivatives drive the backward pass grounds it. -
The MLP language model (lesson 4). This lesson is about making that network train well. The initialization and normalization here are what turn the deeper MLP from “barely learns” into “learns smoothly,” and they apply to any network with hidden layers, not just the language model.