Skip to content

Summary: Weights, biases, and the squish

Lesson 2 said hidden neurons “get their number from the layer before” and “do the work,” but never said how. This lesson is the how, and it is one small computation that every neuron in the network runs: multiply each incoming activation by a weight, add them up, add a bias, and squash the result into a usable range. Learn it once and you understand every neuron there is. The surprise at the end is the count: even a tiny digit network has about thirteen thousand of these weights and biases, and modern networks have billions. This is the scan-it-in-five-minutes version.

  • Every neuron runs the same three-step computation. Take a weighted sum (each incoming activation times its connection’s weight, all added up), add the neuron’s bias, then pass the total through an activation function. In one line: activation = squish(weighted sum + bias).
  • Weights set attention. Each weight sits on a connection: a large positive weight boosts that input, a negative one dampens it, a near-zero one ignores it. The pattern of weights into a neuron acts like a template, and the weighted sum scores how well the input matches it.
  • The bias sets eagerness. Added after the weighted sum, the bias shifts when the neuron wakes up: negative makes it cautious, positive makes it eager. Think of it as the neuron’s default lean before it looks at any input.
  • The squish keeps numbers in range. The weighted sum plus bias can be any number; the activation function squashes it back into a usable range. Sigmoid (1 / (1 + e^(-x))) is a smooth S-curve into (0, 1); ReLU (max(0, x)) is zero for negatives and the input itself for positives. Which one is a design choice, and neither changes as the network learns.
  • Worked once: inputs 0.5, 0.8, 0.2, weights 0.3, -0.2, 0.5, bias 0.1 give a weighted-sum-plus-bias of 0.19. Sigmoid turns that into about 0.547; ReLU leaves it at 0.19. Same neuron, two squishes, two activations.
  • Weights and biases together are the parameters. Count them for the 784-16-16-10 network: 12,560 + 272 + 170 = about 13,002. A small digit recognizer already needs thirteen thousand numbers set just right; modern networks have billions.
  • A network’s behavior lives in its parameter values, not in the structure (just layers and connections) or the formula (never changes). Two identically built networks behave differently purely because their parameters differ. The power comes from quantity, not from any single neuron being clever.

“Billions of parameters” stops being a vague boast and becomes something exact: billions of plain numbers, each plugged into the same multiply-add-squash you can now run by hand. That reframing also tells you where a model’s apparent intelligence actually sits, not in the architecture and not in the math, but entirely in the specific values tuned into every connection. Which raises the question the rest of the track answers: nobody types thirteen thousand numbers by hand, let alone billions, so where do the right values come from? They are found automatically from examples, a process called training. Lesson 4 first zooms out to see the whole network as one big function with all those knobs, and then Phase 2 opens up how the knobs get set.