References: activations, gradients, and BatchNorm

Source material

Source curriculum (structural mirror, cited as further study):
• Andrej Karpathy, "Neural Networks: Zero to Hero", Lecture 4:
  "Building makemore Part 3: Activations & Gradients, BatchNorm"
  Creator: Andrej Karpathy
  Video: https://www.youtube.com/watch?v=P6sfmUTpUmc
  Code repo (makemore): https://github.com/karpathy/makemore (MIT License)
  Series repo: https://github.com/karpathy/nn-zero-to-hero (MIT License)
  Series page: https://karpathy.ai/zero-to-hero.html
  License: makemore and the series code are MIT-licensed; the video is YouTube standard.
This lesson covers Lecture 4, where Karpathy diagnoses the initial-loss and
saturation problems with activation/gradient histograms, fixes them with scaled
initialization, and introduces batch normalization. Clawdemy's lessons are
original prose following the pedagogical arc of this series; we do not reproduce
or transcribe the video or code. The tanh-derivative table and the 3.30 loss
baseline here are ours, built to be checkable by hand. All rights to the
original video and code remain with the creator.

Watch this next

Building makemore Part 3: Activations & Gradients, BatchNorm (Andrej Karpathy) by Andrej Karpathy. The lecture this lesson mirrors. Karpathy fixes the hockey-stick initial loss, then plots histograms of the activations and gradients to make saturation visible, scales the initialization to cure it, and builds batch normalization step by step. Watching the saturated histogram (everything pinned at the tails) turn healthy after rescaling is the clearest possible picture of what this lesson describes.

Going deeper

Batch Normalization (Ioffe & Szegedy, 2015) (arXiv). The original paper introducing batch normalization. Worth a skim to see the problem it was framed around (internal covariate shift) and how widely it spread afterward.
Delving Deep into Rectifiers (He et al., 2015) (arXiv). The paper behind “Kaiming initialization,” the 1/sqrt(fan-in)-style scaling that keeps activations healthy through depth.
makemore on GitHub (MIT License) and the Zero to Hero series. The next lecture removes the autograd engine entirely and backpropagates through this network by hand.

Adjacent topics

Where this sits in the curriculum.

The autograd engine (lesson 1). This lesson leans directly on tanh’s local derivative 1 - tanh^2 from that lesson: saturation is exactly that derivative going to zero, which starves backprop. If the saturation argument felt fast, rereading how local derivatives drive the backward pass grounds it.
The MLP language model (lesson 4). This lesson is about making that network train well. The initialization and normalization here are what turn the deeper MLP from “barely learns” into “learns smoothly,” and they apply to any network with hidden layers, not just the language model.