Summary: activations, gradients, and BatchNorm

TL;DR. A correct deep network can still fail to train, and it fails in two diagnosable ways. It starts confidently wrong (random weights make over-confident logits, so the loss spikes far above the -log(1/27) = 3.30 uniform baseline), and its tanh neurons saturate (large pre-activations push outputs into the flat tails where the local derivative 1 - tanh^2 is near zero, so almost no gradient flows and neurons go numb). Fix both with good initialization (small output weights, and scale each layer by 1 / sqrt(fan-in)), or make it automatic with batch normalization (normalize each layer’s pre-activations across the batch, then learn a gain and bias). These are the techniques that make every deep network, transformers included, trainable.

Core ideas

A naive deep net starts confidently wrong. Random weights produce large, spread logits, and softmax makes them a confident distribution, so the loss spikes. A model that knew nothing should guess uniformly and start at -log(1/27) = log(27) = 3.30. Shrinking the output weights so initial logits are near zero fixes the start.
tanh neurons saturate, and that kills the gradient. When a pre-activation is large, the output sits in tanh’s flat tail, where 1 - tanh^2 is nearly zero (at input 2 it is about 0.07; at 3, about 0.01). Backprop multiplies by this, so a saturated neuron passes almost no gradient and stops learning. The best diagnostic is a histogram of a layer’s activations: piled at the tails = trouble.
Initialization fixes it at the source. Pre-activation size grows with the number of inputs summed, so scale each layer’s weights by about 1 / sqrt(number of inputs). That keeps activations at a healthy spread layer to layer, so signals neither explode into saturation nor vanish through depth. This is Kaiming initialization.
Batch normalization makes it automatic. Normalize each layer’s pre-activations to zero mean and unit variance across the minibatch, then apply a learned gain and bias. Activations stay healthy regardless of initialization, at the cost of coupling examples in a batch and needing running statistics at inference.
This is why deep models are hard to train, and how they get trained anyway. In a deep network signals can explode or vanish as they propagate; a network whose gradients vanished learns nothing. Initialization and normalization keep the gradients alive across depth.

What changes for you

“Training was unstable” or “the model didn’t converge” stops being a mystery and becomes a specific, diagnosable situation: somewhere the activations saturated or the gradients vanished. You also understand why normalization layers appear in every modern architecture, every transformer has them (layer normalization), not as decoration but as load-bearing parts that keep signals flowing through depth. The next lesson takes away the autograd engine and has you backpropagate through this whole network by hand, so the gradient flow you have been diagnosing becomes something you can compute and debug yourself, not something a library does behind a curtain.