Summary: How models actually learn: gradient descent
Gradient descent is how a model finds its best parameters when no formula can: treat the error as a landscape and keep stepping downhill until you reach the bottom. It is the search that lesson 2 left unanswered, and with minor variations it is how nearly every modern model learns, from a two-parameter line to a billion-parameter network. This summary is the scan version of the full lesson.
Core ideas
Section titled “Core ideas”- The error is a landscape. Every setting of the parameters produces some total error, called the loss. Lay the possibilities out as terrain: parameters are the horizontal directions, loss is the height. Training means finding the lowest point.
- You can only feel the local slope. You cannot see the whole landscape, but you can sense which way the ground slopes under your feet. That is enough.
- The downhill rule. Feel which way is downhill, take a step that way, repeat until the ground is flat. The flat bottom is the low-loss point you wanted.
- The gradient is the local slope: which way the error changes if you nudge a parameter, and how steeply. It points uphill, so you step against it. No calculus needed for the intuition.
- The learning rate is the step size. Too large overshoots and may diverge; too small crawls. Choosing it well is an everyday craft.
- The loop: guess the parameters, compute the loss, compute the gradient, step against it (
new = old - learning_rate * gradient), repeat until the steps stop helping. - It finds a minimum, not always the global one. On a bumpy landscape it can settle in a local minimum. At scale it uses small random samples per step (stochastic gradient descent).
What changes for you
Section titled “What changes for you”“The model is training” stops being a mystery and becomes a picture you can hold: a point rolling downhill on a loss landscape, one step at a time, for millions of steps. Every weight in every large model you have heard of got its value this way. It also demystifies a few headlines: the “learning rate” people tune is literally the step size from this lesson, and the reason training sometimes “blows up” is a step size set too large. You now have the engine of machine learning in hand, a way to measure error and a way to drive it down, which is exactly what the next phase builds on as it turns from predicting numbers to predicting categories.