Practice: Weights, biases, and the squish

Self-check

Six short questions. Answer each one in your head (or on paper) before opening the collapsible. Trying to retrieve the answer is where the learning sticks; rereading feels productive but does much less.

1. State, in order, the three steps every neuron runs to compute its activation.

Show answer

(1) Take a weighted sum: multiply each incoming activation by its connection’s weight and add them all up. (2) Add the neuron’s bias. (3) Pass the result through an activation function (the “squish”) to bring it into a usable range. In one line: activation = squish(weighted sum + bias). Every hidden and output neuron runs exactly this.

2. What is the difference between a weight and a bias?

Show answer

A weight sits on a connection and scales one specific input: a large positive weight boosts that input, a negative weight dampens it, a near-zero weight ignores it. A bias belongs to the neuron itself and shifts its overall eagerness to activate: negative makes it cautious (hard to wake up), positive makes it eager. Different jobs, different homes.

3. Sigmoid and ReLU are different functions. What do they have in common, and where do they sit in the computation?

Show answer

Both are activation functions: they take the unbounded “weighted sum plus bias” and turn it into a well-behaved activation for the next layer. Sigmoid is a smooth S-curve mapping into (0, 1); ReLU is max(0, x). Which one a network uses is a design choice. Crucially, both are fixed functions that never change as the network learns.

4. A network has “billions of parameters.” What, concretely, are those parameters?

Show answer

The weights and biases, all of them, counted together. A parameter is just one ordinary number plugged into the same multiply-add-squash formula. “Billions of parameters” means billions of those numbers; it is bigger, not different or more exotic. Even the small 784-16-16-10 digit network has about 13,000.

5. If the structure is just layers and the formula never changes, where does a network’s behavior actually live?

Show answer

Entirely in the specific values of the weights and biases. Two networks with identical structure and the same activation function can behave completely differently purely because their parameters are set to different numbers. Finding the right values is what training does, and nobody types them by hand; they are found automatically from examples.

6. Why does the lesson say the network’s power comes from quantity, not from any single neuron?

Show answer

Because each individual neuron does something almost trivial: multiply, add, add a bias, squash. No single step is clever. The impressive behavior emerges from the sheer number of these simple units, all with their parameters tuned to just the right values. Complexity comes from how many, not from how complicated each one is.

Try it yourself, part 1: run a neuron by hand

Pen and paper (a calculator helps for the squish), about 8 minutes. This is the lesson’s core computation with fresh numbers, done both ways.

Setup. A demo neuron with three inputs. The incoming activations are 0.4, 1.0, 0.6, the weights on those connections are 0.5, -0.3, 0.2, and the bias is -0.1.

Step 1. Compute the weighted sum, then add the bias.

Step 2. Squish with ReLU (max(0, x)). What is the activation?

Step 3. Squish with sigmoid (1 / (1 + e^(-x))) instead. Roughly what is the activation? (sigmoid of a small negative number is a little below 0.5.)

Show answer

Step 1. Weighted sum, term by term:

(0.4 · 0.5) + (1.0 · -0.3) + (0.6 · 0.2) = 0.20 - 0.30 + 0.12 = 0.02
then add the bias:  0.02 + (-0.1) = -0.08

So the value going into the squish is -0.08.

Step 2. ReLU: max(0, -0.08) = 0. The activation is exactly 0. Because the weighted-sum-plus-bias came out negative, a ReLU neuron stays completely quiet for this input.

Step 3. Sigmoid: 1 / (1 + e^(0.08)) is about 0.480, just below the halfway mark of 0.5. Same neuron, same inputs and parameters, but sigmoid gives a small positive activation where ReLU gives a flat zero. That is the whole difference between the two squishes, made concrete.

Try it yourself, part 2: count the parameters

About 4 minutes, arithmetic only. Take the network you sized in the last lesson’s practice: input layer of 256 neurons (a 16 by 16 image), two hidden layers of 20 neurons each, and an output layer of 10. How many parameters (weights plus biases) does it have? Remember: each neuron in a layer has one weight per neuron in the previous layer, plus one bias of its own.

Show answer

Go layer by layer (weights = neurons-in-this-layer times neurons-in-previous-layer; biases = neurons-in-this-layer):

input(256) → hidden1(20):   256·20 + 20  = 5,140
hidden1(20) → hidden2(20):    20·20 + 20  =   420
hidden2(20) → output(10):     20·10 + 10  =   210
total parameters             = 5,140 + 420 + 210 = 5,770

So a network with only 306 neurons (from last lesson) carries 5,770 parameters. Notice there are far more parameters than neurons, because the parameters live on the connections, and a fully connected layer has a connection between every pair of neurons in adjacent layers. That is why parameter counts explode so much faster than neuron counts.

Flashcards

Nine cards. Click any card to reveal the answer. Use the Print flashcards button to lay out the full set as one card per page, ready to print or save as a PDF for offline review.

Q. What is the one computation every neuron runs?

activation = squish(weighted sum of inputs + bias). Multiply each incoming activation by its weight, add them up, add a bias, then pass through an activation function. Same move in every hidden and output neuron.

Q. What does a weight do?

It sits on a connection and scales one specific input. Large positive weight boosts that input, negative weight dampens it, near-zero weight ignores it. Weights are how a neuron decides what to pay attention to.

Q. What does a bias do?

It belongs to the neuron itself and shifts its default eagerness to activate. Negative bias makes the neuron cautious (hard to wake up); positive bias makes it eager. It is added after the weighted sum.

Q. What is an activation function, and name two common ones.

A fixed function that squashes the unbounded “weighted sum plus bias” into a usable range. Sigmoid: 1 / (1 + e^(-x)), a smooth S-curve into (0, 1). ReLU: max(0, x), zero for negatives, the input itself for positives.

Q. Does the activation function change as the network learns?

No. Sigmoid and ReLU are fixed and never change. They only keep activations in a usable range. The learning happens in the weights and biases, not in the squish.

Q. What are a network's 'parameters'?

All its weights and biases, counted together. Each is one ordinary number in the multiply-add-squash formula. The small 784-16-16-10 digit network has about 13,002; modern networks have billions.

Q. How do you count the parameters between two fully connected layers?

Weights = (neurons in this layer) times (neurons in the previous layer); biases = (neurons in this layer). Add them. Example: 784 inputs into 16 neurons = 784·16 + 16 = 12,560 parameters.

Q. Where does a network's behavior actually live?

In the specific values of its weights and biases, not in the structure (just layers) or the formula (never changes). Two identically structured networks behave differently purely because their parameters differ.

Q. Why does the lesson say power comes from quantity, not single neurons?

Each neuron does something almost trivial: multiply, add, add bias, squash. No single step is clever. The impressive behavior comes from the sheer number of simple units with parameters tuned to the right values.