Cheatsheet: What learning really means
The one idea that matters
Section titled “The one idea that matters”cost C(w, b) = one number for how wrong the network is right nowlearning = adjust the weights and biases to make C as small as possibleNo understanding installed. Just a wrongness number going down.
The cost recipe (one image)
Section titled “The cost recipe (one image)”- Write the desired output as one-hot: 1 in the correct slot, 0 elsewhere. (For a “3”:
[0,0,0,1,0,0,0,0,0,0].) - For each of the 10 outputs, take (network value minus desired value).
- Square each difference.
- Sum the 10 squares. That is the cost for this image.
- Average over the whole training set for the total cost.
(Sum of squared differences is the choice the 3B1B series uses; other cost functions exist.)
Worked costs (same “3” image)
Section titled “Worked costs (same “3” image)”| Network output | Reading | Cost |
|---|---|---|
[.02,.01,.05,.92,.03,.04,.01,.02,.01,.02] | confident, correct | ≈ 0.0129 (low) |
[.1,.1,.1,.1,.1,.1,.1,.1,.1,.1] | total shrug | 0.90 (high) |
Bad math, the high one: 9·(0.1)² + (0.1-1)² = 0.09 + 0.81 = 0.90. The single big miss (0.1 where 1 was wanted) contributes 0.81 because squaring makes big misses dominate.
The reframe that powers the track
Section titled “The reframe that powers the track”network: f(x ; w, b) input = an image, output = a guesscost: C(w, b) input = a whole network, output = a wrongness scoreFor a fixed training set, only w and b are free to move. C maps the ~13,000 parameters to one number. Learning = find the (w, b) that minimizes C.
Why it is hard (sets up L6, L7)
Section titled “Why it is hard (sets up L6, L7)”- ~13,000 dials, not 2.
- C is bumpy and complicated, not a tidy bowl.
- Brute force is impossible (the combinations are beyond astronomical).
- Need a method that finds “downhill” from wherever you stand. That is L6 (the landscape) and L7 (gradient descent).
Pitfalls to dodge
Section titled “Pitfalls to dodge”- “Learning means the network understands.” No. Knobs turn; a number drops.
- “Cost is the output.” No. Output is 10 numbers per image; cost is one score over the whole set.
- “Low training cost means good, period.” No. It means good on what it was scored against; unseen images are a separate question.
- “Cost is a function of the image.” No. For a fixed training set, cost is a function of the weights and biases.
The one-line version
Section titled “The one-line version”Learning is a search for the knob settings that make one wrongness number as small as it can go.