Skip to content

Summary: What learning really means

Phase 1 ended on a cliffhanger: a network only works once its roughly 13,000 weights and biases are set well, so how do we find good values? You cannot chase “better” without a way to tell better from worse, so this lesson builds that measure: the cost function, a single number for how wrong the network is right now. Once you have it, the whole of learning collapses into one clean idea: adjust the knobs to make that number small. No understanding gets installed; a wrongness number goes down. This is the scan-it-in-five-minutes version.

  • The cost function is a scorecard for wrongness. It returns one number: high means the network is far from what we want, low means close, zero would mean perfect on everything tested.
  • The desired answer is written one-hot. For a “3,” the target output is [0,0,0,1,0,0,0,0,0,0]: 1 in the correct slot, 0 elsewhere. It says “this answer, none of the others.”
  • The recipe: for each of the ten outputs, take the difference from the one-hot target, square it, and sum the ten. Then average over the whole training set. Worked once on a “3” image: a confident-correct output [.02,.01,.05,.92,...] scores about 0.0129 (low), while an all-0.1 shrug scores 0.90 (high). Squaring makes big misses dominate.
  • Cost is a function of the knobs, C(w, b). For a fixed training set the images do not change, so only the weights and biases are free to move. C maps the roughly 13,000 parameters to one score. Stacked on top of the network function f(x; w, b): the network’s input is an image, but the cost’s input is an entire network.
  • Learning is minimizing C(w, b). Training is an optimization problem: find the knob settings that make the wrongness number smallest. Not “teaching it about threes,” just a number going down. It is hard because there are ~13,000 dials, the cost surface is bumpy, and brute force is impossible, which is exactly why the next two lessons exist.
  • A model is only as good as what it was graded on. The network minimizes cost on precisely the examples it was scored against, blind spots and all, because the cost never penalized what it never saw. What you put in the score is what you get.

“Training” and “loss going down” stop being jargon. When people say a model was trained, they mean its parameters were adjusted to push down a cost like this one; the falling “loss” number during training is literally this score. A whole multi-week, expensive training run is, at heart, this same chase: make the wrongness number smaller. It also tells you where a model’s quality and its failures come from, since a model becomes good at exactly what its cost measured and stays blind to what it did not. The lesson leaves the hard part open: with 13,000 dials and a bumpy cost, how do you actually find the downhill direction? Lesson 6 gives the search a shape, picturing the cost as a landscape, and lesson 7 gives the method for walking down it.