Skip to content

Summary: How models actually learn: gradient descent

Gradient descent is how a model finds its best parameters when no formula can: treat the error as a landscape and keep stepping downhill until you reach the bottom. It is the search that lesson 2 left unanswered, and with minor variations it is how nearly every modern model learns, from a two-parameter line to a billion-parameter network. This summary is the scan version of the full lesson.

  • The error is a landscape. Every setting of the parameters produces some total error, called the loss. Lay the possibilities out as terrain: parameters are the horizontal directions, loss is the height. Training means finding the lowest point.
  • You can only feel the local slope. You cannot see the whole landscape, but you can sense which way the ground slopes under your feet. That is enough.
  • The downhill rule. Feel which way is downhill, take a step that way, repeat until the ground is flat. The flat bottom is the low-loss point you wanted.
  • The gradient is the local slope: which way the error changes if you nudge a parameter, and how steeply. It points uphill, so you step against it. No calculus needed for the intuition.
  • The learning rate is the step size. Too large overshoots and may diverge; too small crawls. Choosing it well is an everyday craft.
  • The loop: guess the parameters, compute the loss, compute the gradient, step against it (new = old - learning_rate * gradient), repeat until the steps stop helping.
  • It finds a minimum, not always the global one. On a bumpy landscape it can settle in a local minimum. At scale it uses small random samples per step (stochastic gradient descent).

“The model is training” stops being a mystery and becomes a picture you can hold: a point rolling downhill on a loss landscape, one step at a time, for millions of steps. Every weight in every large model you have heard of got its value this way. It also demystifies a few headlines: the “learning rate” people tune is literally the step size from this lesson, and the reason training sometimes “blows up” is a step size set too large. You now have the engine of machine learning in hand, a way to measure error and a way to drive it down, which is exactly what the next phase builds on as it turns from predicting numbers to predicting categories.