Skip to content

Cheatsheet: activations, gradients, and BatchNorm

SymptomCauseTell
Loss starts absurdly highRandom weights make large, spread logits -> over-confident softmaxFirst loss far above the uniform baseline
Network barely learnstanh neurons saturated -> near-zero gradientActivation histogram piled at -1 / +1

A model that knows nothing should guess uniformly: 1/27 per character. Its loss is the baseline:

-log(1/27) = log(27) = 3.30 (natural log)

Start there, not at 20+. A much higher start = an initialization problem, not an architecture problem.

tanh’s local derivative is 1 - tanh(x)^2, and it collapses as the input grows:

x = 0: deriv = 1.00 (responsive)
x = 1: deriv = 0.42
x = 2: deriv = 0.07
x = 3: deriv = 0.01 (numb)

Backprop multiplies the incoming gradient by this, so a neuron in the flat tails passes almost no gradient and stops learning. Saturated for every example = a dead neuron.

Pre-activation size grows with the weight scale and the fan-in (number of inputs summed). Counteract it:

scale each layer's initial weights by about 1 / sqrt(number of inputs)

Keeps activations at a healthy, roughly unit spread layer to layer, neither exploding into saturation nor vanishing. Also: start the output weights small so the loss begins at 3.30.

Before tanh, normalize the layer’s pre-activations across the minibatch to zero mean and unit variance, then apply a learned gain and bias so the network can still represent any spread it needs. Keeps activations well-behaved during training regardless of initialization.

  • Couples examples in a batch (each output depends on its batchmates): mild regularizer, common source of bugs.
  • Inference: uses a running average of mean/variance from training, since there is no batch.

A histogram of each layer’s activations (and of the gradients) during training. Piled at the tails = saturation; spread across the middle = healthy. Turns an invisible failure into a picture.

No deep network, including every large language model, trains without solving these two problems. Sane initialization is standard, and normalization layers are everywhere, every transformer has them (a cousin called layer normalization) to keep activations and gradients alive across many layers. “Training diverged / was unstable” is this territory.

Naive deep nets start over-confident and saturate their neurons; fix the start with small output weights, fix saturation by scaling weights as 1/sqrt(fan-in), and make it automatic with batch normalization, the same init-plus-normalization recipe that makes every deep network trainable.