Practice: What learning really means

Self-check

Six short questions. Answer each one in your head (or on paper) before opening the collapsible. Trying to retrieve the answer is where the learning sticks; rereading feels productive but does much less.

1. What does the cost function give you, and what do high and low values mean?

Show answer

One number that says how wrong the network currently is. High cost means it is far from what we want; low cost means it is close; zero would mean perfect on everything tested. It is a scorecard for wrongness, computed over the whole training set.

2. What is a one-hot output, and why use it?

Show answer

A list of outputs that is 1 in the correct slot and 0 everywhere else, like [0,0,0,1,0,0,0,0,0,0] for a “3.” It is a tidy way to write “this answer, none of the others,” giving the cost function a clear target to measure the network’s actual output against.

3. State the cost recipe for a single image.

Show answer

For each of the ten outputs, take the difference between the network’s value and the one-hot desired value, square it, and add all ten squares together. That sum is the cost for that image. The total cost is this same calculation averaged over the entire pile of training images.

4. The lesson writes cost as C(w, b), not C(image). Why does that matter?

Show answer

For a fixed training set, the images do not change, so the only things free to move are the weights and biases. Nudging a parameter shifts the outputs, which shifts the differences, which shifts the cost. So cost is a function of the roughly 13,000 parameters, mapping a whole network to one wrongness score. Missing this reframe is fatal for the next two lessons.

5. In one sentence, what is “learning”?

Show answer

Finding the values of the weights and biases that make C(w, b) as small as possible. It is an optimization problem (about 13,000 dials, one number to minimize), not the network coming to “understand” anything. There is no comprehension installed; there is a number going down.

6. Why does the lesson say a model becomes good at “precisely what it was graded on”?

Show answer

Because the cost is the only thing training pushes down, and the cost is measured against the training examples. The network minimizes its cost on exactly that pile, blind spots and all, because the cost never penalized what it never saw. What you put in the score is what you get, which explains a lot of both the power and the failures of AI systems.

Try it yourself, part 1: compute cost for a good and a bad answer

Pen and paper, about 8 minutes. The image is actually a 7, so the desired one-hot output is [0,0,0,0,0,0,0,1,0,0] (a 1 in slot 7, zero elsewhere). You will score two different network outputs.

Output A (the network is mostly right):

network: [0, 0, 0.3, 0, 0, 0, 0, 0.8, 0, 0.1]

Output B (the network confidently shouts “2”):

network: [0, 0, 0.9, 0, 0, 0, 0, 0.1, 0, 0]

For each, compute the cost: take each output’s difference from the desired value, square it, and sum the ten squares. Then compare.

Show answer

Output A. Only three slots differ from the target, so only three terms are nonzero:

slot 2: (0.3 - 0)² = 0.09
slot 7: (0.8 - 1)² = 0.04
slot 9: (0.1 - 0)² = 0.01
cost = 0.09 + 0.04 + 0.01 = 0.14

A cost of 0.14: low-ish. The network put most of its weight on the right slot (7) and the misses were small.

Output B. Two big misses:

slot 2: (0.9 - 0)² = 0.81   (loud about the wrong digit)
slot 7: (0.1 - 1)² = 0.81   (quiet about the right one)
cost = 0.81 + 0.81 = 1.62

A cost of 1.62: very high, and notably higher than a total shrug (the all-0.1 output from the lesson scored 0.90). That is the squaring at work: a confident wrong answer is penalized harder than honest uncertainty, because both of its big misses get squared into large contributions. The cost function actively prefers “unsure” over “confidently wrong.”

Try it yourself, part 2: graded on what?

About 3 minutes, reasoning only. A team trains a digit recognizer only on tidy, centered, computer-printed digits, and its training cost drops to nearly zero. They are thrilled. Then they deploy it on real handwritten digits and it performs badly. Using only this lesson’s ideas, explain what happened.

Show answer

The network did exactly what training asks: it drove C(w, b) toward zero on the pile it was scored against, which was tidy printed digits. The cost never measured performance on messy handwriting, so nothing pushed the parameters to handle it. The model became good at precisely what it was graded on. Low training cost only certifies performance on the training set; behavior on inputs unlike that set is a separate question the cost never asked. The fix is to grade on (train on) examples that look like what it will actually face.

Flashcards

Nine cards. Click any card to reveal the answer. Use the Print flashcards button to lay out the full set as one card per page, ready to print or save as a PDF for offline review.

Q. What is the cost function?

A single number for how wrong the network currently is. High = far from what we want, low = close, zero = perfect on everything tested. Computed over the whole training set.

Q. What is a one-hot output?

A list of outputs that is 1 in the correct slot and 0 everywhere else, like [0,0,0,1,0,0,0,0,0,0] for a “3.” It is the desired-answer target the cost measures against.

Q. State the cost recipe for one image.

For each of the 10 outputs, take (network value minus desired value), square it, and sum all 10 squares. The total cost averages this over the entire training set.

Q. Why is the cost written C(w, b) rather than C(image)?

For a fixed training set, the images do not change; only the weights and biases are free to move. So cost is a function of the ~13,000 parameters, mapping a whole network to one wrongness score.

Q. What is 'learning', in one sentence?

Finding the weights and biases that make C(w, b) as small as possible. It is an optimization problem (minimize one number by turning ~13,000 dials), not the network understanding anything.

Q. Why does squaring the differences matter?

It makes big misses dominate. A single badly-wrong output (say 0.1 where 1 was wanted) contributes 0.81 on its own, so the cost punishes confident errors far more than small ones.

Q. Which costs more: a total shrug or a confident wrong answer?

A confident wrong answer. The all-0.1 shrug scores 0.90, but confidently shouting the wrong digit can score higher (e.g. 1.62), because squaring punishes its two big misses hardest.

Q. Does low training cost mean the network is good, full stop?

No. It means good on what it was scored against. Performance on images it never saw is a separate question the cost never measured.

Q. Why does a model become good at 'what it was graded on'?

Because the cost is the only thing training pushes down, and the cost is measured against the training examples. The network minimizes cost on exactly that pile, including its blind spots. What you put in the score is what you get.