Skip to content

Cheatsheet: Weights, biases, and the squish

activation = squish( weighted sum of inputs + bias )
= squish( w1·a1 + w2·a2 + ... + wn·an + bias )

Same computation in every hidden and output neuron. The network’s power is in how many there are and the exact numbers in each.

PartWhat it isWhat it does
WeightA number on each connectionScales one input: + boosts, - dampens, ~0 ignores
BiasA number on the neuron itselfShifts default eagerness: - cautious, + eager
Activation functionA fixed squashing functionMaps the unbounded result into a usable range

The weights coming into a neuron act like a template; the weighted sum scores how well the input matches it.

FunctionFormulaShapeNote
Sigmoid1 / (1 + e^(-x))Smooth S-curve, maps to (0, 1)Traditional default
ReLUmax(0, x)0 for negative, x for positiveCommon modern default; fast, trains well

Which one is a design choice. Both just keep activations in a usable range; neither is where learning happens.

Inputs 0.5, 0.8, 0.2; weights 0.3, -0.2, 0.5; bias 0.1.

weighted sum + bias = (0.5·0.3) + (0.8·-0.2) + (0.2·0.5) + 0.1
= 0.15 - 0.16 + 0.10 + 0.10 = 0.19
sigmoid(0.19) ≈ 0.547 ReLU(0.19) = 0.19

Counting the knobs (the 784-16-16-10 network)

Section titled “Counting the knobs (the 784-16-16-10 network)”
ConnectionWeights + biasesParameters
Input → hidden 1784·16 + 1612,560
Hidden 1 → hidden 216·16 + 16272
Hidden 2 → output16·10 + 10170
Total~13,002

All weights and biases together are the parameters. Small network: ~13K. Modern networks: billions.

  • “Each neuron is complicated.” No. Multiply, add, add bias, squash. Always the same.
  • “Weight equals bias.” No. Weight scales one input (on a connection); bias shifts the whole neuron.
  • “The smarts are in the activation function.” No. Sigmoid and ReLU never change while learning. The smarts are in the weights and biases.
  • “Billions of parameters means something exotic.” No. Billions of ordinary numbers in the same simple formula. Bigger, not different.

Each neuron is almost embarrassingly simple; the power is in how many there are and the exact numbers tuned into every one.