What learning really means, in brief

What you’ll learn

Phase 1 ended on a cliffhanger: the same network reads digits or spews nonsense depending entirely on its roughly 13,000 weights and biases, so everything comes down to finding good values for them. But you cannot chase “better” without a way to tell better from worse. This lesson builds that measure.

You will meet the cost function, whose whole job is to return one number for how wrong the network is right now. You will write the desired answer as a one-hot output (a 1 in the correct slot, 0 elsewhere), then compute the cost by hand: take each output’s difference from the target, square it, and sum. You will see a confident-correct output score about 0.0129 and a total shrug score 0.90, and feel why squaring makes big misses dominate. Then comes the reframe that powers the rest of the track: for a fixed training set, the cost is a function of the parameters, written C(w, b), and learning is just finding the (w, b) that makes C small. The lesson closes on why this is hard (about 13,000 dials, a bumpy surface, brute force impossible), which is precisely what lessons 6 and 7 take on.

Where this fits

This is lesson 5, the first of Phase 2 (How a network learns). Phase 1 built the network as a function and reframed the goal as a search through parameter space; this lesson supplies the thing that makes a search possible, a score to chase. Lesson 6 turns that score into a picture, a cost landscape over the space of all parameter settings, and lesson 7 gives the method, gradient descent, for actually walking downhill in it. After this lesson you know what learning is aiming at; the next two show how it gets there.

Before you start

Prerequisite (within this track): lesson 4, The whole network as one function, especially the f(x; w, b) idea that the weights and biases are fixed numbers defining the network. This lesson stacks a second function on top of that one, so the distinction between the per-use input and the fixed parameters needs to be solid. The math is just subtraction, squaring, and adding; a calculator helps in the practice, and no coding is required.

By the end, you’ll be able to

Explain what a cost function is and how a one-hot target is used to score the network’s output
Compute the squared-difference cost of a network output by hand and explain why squaring makes big misses dominate
Explain why, for a fixed training set, the cost is a function of the weights and biases, written C(w, b)
Define learning as finding the weights and biases that minimize the cost, an optimization problem rather than comprehension
Explain why a model becomes good only at what its cost was measured against

Time and difficulty

Read time: about 10 minutes
Practice time: about 14 minutes (computing cost for a good and a confidently-wrong output, a short reasoning drill, and flashcards)
Difficulty: standard