What learning really means

Lesson 4 left us with a cliffhanger. We saw that a network with random weights produces random nonsense, and that the very same network with well-tuned weights reads handwritten digits reliably. Everything depends on landing on good values for those roughly 13,000 weights and biases. So the question that has been building for four lessons is finally unavoidable: how do you find the good numbers?

You cannot search for “better” if you have no way to tell better from worse. So before any clever method, we need something simpler and more basic: a way to measure how wrong the network is right now, as a single number. Get that number, and “improve the network” turns into “make that number smaller,” which is something we can actually chase.

A scorecard for wrongness

The tool that gives us that number is called the cost function. Sometimes it is called the loss, but cost is the name we will use. Its whole job is to look at how the network is currently behaving and return one number that says, in effect, “this is how badly you are doing.” High cost means the network is far from what we want. Low cost means it is close. Zero cost would mean it is perfect on everything we tested.

To build it, we first need to say what the right answer even looks like in the network’s own terms. Recall that the output layer is 10 neurons, one per digit. For an image that is actually a 3, the answer we want is for the “3” neuron to be fully on and every other neuron to be fully off. Written as ten numbers, the desired output is:

[0, 0, 0, 1, 0, 0, 0, 0, 0, 0]

A list that is 1 in the correct slot and 0 everywhere else is called a one-hot output. It is just a tidy way to write “this answer, none of the others.”

Computing the cost, by hand

Now compare what we want against what the network actually gave. The recipe (the one the 3Blue1Brown series uses, and a common, intuitive choice) is: for each of the ten outputs, take the difference between what the network said and what we wanted, square it, and add all ten squares together.

Suppose for our “3” image the network outputs:

network: [0.02, 0.01, 0.05, 0.92, 0.03, 0.04, 0.01, 0.02, 0.01, 0.02]
desired: [0,    0,    0,    1,    0,    0,    0,    0,    0,    0   ]

Take each difference, square it, and sum:

(0.02)² + (0.01)² + (0.05)² + (0.92 - 1)² + (0.03)² + (0.04)²
       + (0.01)² + (0.02)² + (0.01)² + (0.02)²
= 0.0004 + 0.0001 + 0.0025 + 0.0064 + 0.0009 + 0.0016
       + 0.0001 + 0.0004 + 0.0001 + 0.0004
≈ 0.0129

A cost of about 0.013, which is small. That is the scorecard telling us the network did well on this image: it put 0.92 on the right answer and kept everything else near zero, so each difference was tiny and the squares were tinier.

Now watch what a bad answer does to the number. Suppose instead the network were hopelessly unsure and output 0.1 for every digit, a perfect shrug:

network: [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]
desired: [0,   0,   0,   1,   0,   0,   0,   0,   0,   0  ]

Nine of the outputs are 0.1 where we wanted 0, and one is 0.1 where we wanted 1:

9 · (0.1)² + (0.1 - 1)² = 9 · 0.01 + 0.81 = 0.09 + 0.81 = 0.90

A cost of 0.90, far higher than 0.013. The squaring is doing deliberate work here: the one badly-wrong output, 0.1 where we needed 1, contributes 0.81 all by itself, because squaring a big miss makes it loom large. The cost is a wrongness budget, and confident-but-correct spends almost none of it while uncertain-or-wrong runs it way up.

One image is not the whole story, though. The real cost is this same calculation run across the entire pile of labeled training images and then averaged. A network that nails one image but flubs the other 59,999 has a high average cost. The cost function scores the network’s behavior over everything we are training it on, in one number.

The reframe: cost is a function of the knobs

Here is the move that turns this from bookkeeping into the engine of the whole track. Ask yourself: for a fixed pile of training images, what is the cost actually a function of?

Not the images. Those are fixed; we are not changing the training set. The only things free to change are the network’s weights and biases. Nudge a weight and the network’s outputs shift, so the differences shift, so the cost shifts. The cost depends entirely on the choice of weights and biases.

So we can write the cost as a function of the knobs, the weights and biases:

C(w, b)

Its inputs are every weight and every bias in the network, and its output is that single wrongness number. For our small digit network, the cost takes about 13,000 numbers in and returns one number out. It is a function from the entire parameter space to a single score.

Sit with how strange and useful that is. In lesson 4 we said the network itself is a function that turns an image into a guess. Now we have a second function stacked on top, the cost, that turns a whole setting of the knobs into a grade. The input to the network is an image; the input to the cost is an entire network.

Learning is just making the cost small

And now learning is no longer mysterious at all. Training the network means finding values of the weights and biases that make the cost as small as possible. That is the entire conceptual goal. Not “teaching the network about threes,” not “showing it what digits mean,” just: search the space of possible knob settings for one that drives the wrongness number down.

In plain terms, learning is an optimization problem. We have a number we want to minimize and 13,000 dials we can turn. Find the turn of the dials that makes the number smallest, and you have a trained network. There is no comprehension being installed and no magic happening. There is a number going down.

The catch, and the reason the next two lessons exist, is that doing this well is genuinely hard. There are about 13,000 dials, not two. The cost function is a complicated, bumpy thing, not a tidy bowl. You cannot just try every combination; the number of possibilities is beyond astronomical. So you need a method that, from wherever you currently stand in this vast space, can figure out which way is downhill and take a sensible step. That is exactly what the next lessons build.

Why this matters when you use AI

Once you see that training is “minimize a wrongness score,” a lot of how AI behaves becomes legible. When people say a model was “trained,” they mean its parameters were adjusted to push down a cost like this one. When they talk about “loss going down” during training, that falling number is literally this score. The whole multi-week, expensive process of training a large model is, at heart, this same chase: make the wrongness number smaller.

It also tells you where a model’s quality comes from. A model is only ever optimized to do well on what its cost was measured against. Train on a narrow or biased pile of examples and the network will happily minimize its cost on exactly that pile, blind spots and all, because the cost never penalized what it never saw. The score is the only thing the network is chasing, so what you put in the score is what you get. That single idea, that a model becomes good at precisely what it was graded on, explains a surprising amount of both the power and the failures of AI systems you use.

Common pitfalls

Thinking learning means the network “understands” digits. It does not. Learning is turning knobs to make a number smaller. Any sense of understanding is something we read into a low cost, not something installed in the network.

Confusing the cost with the output. The network’s output is 10 numbers per image. The cost is one number that scores how far those outputs are from what we wanted, averaged over the whole training set. Different things.

Thinking low cost on training images means the network is good, full stop. It means the network does well on what it was scored against. How it does on images it never saw is a separate question this lesson does not settle.

Picturing the cost as a function of the image. For a fixed training set, the cost is a function of the weights and biases, not the images. That reframe is the entire point; miss it and the next two lessons will not land.

What you should remember

The cost function returns one number for how wrong the network currently is. For one image: take each output’s difference from the one-hot desired answer, square it, sum the ten. Then average over the whole training set.
Confident-and-correct gives low cost (our worked example, about 0.013); uncertain-or-wrong gives high cost (the all-0.1 shrug, 0.90). Squaring makes big misses dominate.
The cost is a function of the knobs (the weights and biases), mapping the roughly 13,000 of them to a single score. For a fixed training set, only the parameters are free to move.
Learning is minimizing the cost. Training is an optimization problem: find the knob settings that make the wrongness number small. No magic, just a number going down.

Learning is not the network coming to understand anything. It is a search for the knob settings that make one wrongness number as small as it can go.

Next: the cheatsheet puts the cost recipe and the worked numbers on one page. Then lesson 6 gives this search a shape. If the cost is a function over a vast space of possible knob settings, we can picture a landscape across that space, where height is cost, and minimizing cost becomes finding the lowest valley.