Skip to content

References: activations, gradients, and BatchNorm

Source curriculum (structural mirror, cited as further study):
• Andrej Karpathy, "Neural Networks: Zero to Hero", Lecture 4:
"Building makemore Part 3: Activations & Gradients, BatchNorm"
Creator: Andrej Karpathy
Video: https://www.youtube.com/watch?v=P6sfmUTpUmc
Code repo (makemore): https://github.com/karpathy/makemore (MIT License)
Series repo: https://github.com/karpathy/nn-zero-to-hero (MIT License)
Series page: https://karpathy.ai/zero-to-hero.html
License: makemore and the series code are MIT-licensed; the video is YouTube standard.
This lesson covers Lecture 4, where Karpathy diagnoses the initial-loss and
saturation problems with activation/gradient histograms, fixes them with scaled
initialization, and introduces batch normalization. Clawdemy's lessons are
original prose following the pedagogical arc of this series; we do not reproduce
or transcribe the video or code. The tanh-derivative table and the 3.30 loss
baseline here are ours, built to be checkable by hand. All rights to the
original video and code remain with the creator.
  • Building makemore Part 3: Activations & Gradients, BatchNorm (Andrej Karpathy) by Andrej Karpathy. The lecture this lesson mirrors. Karpathy fixes the hockey-stick initial loss, then plots histograms of the activations and gradients to make saturation visible, scales the initialization to cure it, and builds batch normalization step by step. Watching the saturated histogram (everything pinned at the tails) turn healthy after rescaling is the clearest possible picture of what this lesson describes.

Where this sits in the curriculum.

  • The autograd engine (lesson 1). This lesson leans directly on tanh’s local derivative 1 - tanh^2 from that lesson: saturation is exactly that derivative going to zero, which starves backprop. If the saturation argument felt fast, rereading how local derivatives drive the backward pass grounds it.

  • The MLP language model (lesson 4). This lesson is about making that network train well. The initialization and normalization here are what turn the deeper MLP from “barely learns” into “learns smoothly,” and they apply to any network with hidden layers, not just the language model.