Lesson: How models actually learn: gradient descent
The last lesson left a question hanging. We said the best-fit line is the one that makes the sum of squared residuals as small as possible, but we only ever compared two lines we already had. How do you actually find the best one, out of the infinitely many you never wrote down?
For a straight line, it turns out there is a direct formula that hands you the answer. But that formula is a luxury that almost nothing else enjoys. The moment a model gets more complicated than a line (logistic regression, a decision boundary, a neural network), there is no formula to solve. You have to search for the answer. Gradient descent is that search, and it is worth understanding deeply, because it is, with minor variations, how essentially every modern machine learning model learns, all the way up to the largest ones.
The error is a landscape
Section titled “The error is a landscape”Start with a picture. Every possible setting of a model’s parameters produces some total error on the data: a big error for a bad fit, a small error for a good one. Imagine laying all those possibilities out as a landscape. The horizontal directions are the parameter values you could choose. The height at each spot is the error those parameters produce. We call that height the loss.
Training the model means finding the lowest point in this landscape, the parameter values where the loss is smallest. The catch is that you cannot see the whole landscape. There are too many possibilities to check them all, and with many parameters the landscape has too many dimensions to picture. You are standing somewhere on a foggy hillside, and all you can sense is the ground right under your feet: which way it slopes, and how steeply.
The downhill rule
Section titled “The downhill rule”That turns out to be enough. If you can feel which way is downhill, you can take a step in that direction, and the loss goes down a little. Then you feel the slope again at your new spot and take another step. Repeat, and you walk steadily down toward a valley. That is the whole idea of gradient descent:
Stand somewhere. Feel which way is downhill. Take a step that way. Repeat until the ground is flat.
The flat ground at the bottom is a place where the loss stops decreasing, which is exactly the low point you were looking for.
The gradient and the step size
Section titled “The gradient and the step size”Two pieces make the downhill rule precise.
The gradient is the slope of the loss under your feet. It answers one question: if I nudge a parameter a little, which way does the error change, and how steeply? The gradient points in the uphill direction, so to go downhill you step against it. You do not need calculus to hold the intuition: the gradient is just “which way, and how hard, the error pushes back when you wiggle a parameter.” A steep slope means a big gradient and a big correction; near the bottom the slope flattens, the gradient shrinks toward zero, and your steps naturally get smaller.
The learning rate is how big a step you take each time. It is a knob you choose, and it matters more than it looks:
- Too large, and you overshoot the bottom of the valley, landing on the far slope, then overshooting back. You bounce around and may never settle, or even climb out and diverge.
- Too small, and you do reach the bottom, but you crawl, taking thousands of tiny steps when dozens would have done.
Picking a learning rate that is large enough to be quick but small enough to be stable is one of the everyday crafts of training models.
The loop, step by step
Section titled “The loop, step by step”Putting it together, gradient descent is a short loop:
1. Start with a guess for the parameters (often random).2. Compute the loss: how wrong is the model right now?3. Compute the gradient: which way is downhill for each parameter?4. Take a step: nudge each parameter a little against its gradient. new value = old value - (learning rate * gradient)5. Repeat from step 2 until the steps stop improving the loss.The stopping point is when the gradient is near zero and the steps no longer lower the loss. You have reached the bottom of a valley.
Worked example: rolling down a bowl
Section titled “Worked example: rolling down a bowl”Take the simplest possible landscape: one parameter, call it the weight w, and a loss that forms a bowl with its lowest point at w equal to 3. (Concretely, the loss is the square of w minus 3, but you do not need the formula, only the picture of a bowl bottoming out at 3.) We will use a learning rate of 0.1 and start at a bad guess of w equal to 0.
At each step we feel the slope, then move against it. Here is the path:
start: w = 0.000 loss = 9.000 (far up the left wall of the bowl)step 1: w = 0.600 loss = 5.760step 2: w = 1.080 loss = 3.686step 3: w = 1.464 loss = 2.359 ... (each step moves w toward 3 and the loss keeps dropping)end: w approaches 3.000 loss approaches 0 (the bottom; slope is flat)Watch two things. The parameter w marches steadily from its bad starting guess of 0 toward the true best value of 3. And the loss falls at every step, fast at first where the slope is steep, then slower as the ground flattens near the bottom. Nobody told the procedure that 3 was the answer. It found the bottom by repeatedly stepping downhill, which is the only thing gradient descent ever does.
Why this matters when you use AI
Section titled “Why this matters when you use AI”This is not a toy that gets replaced by something fancier later. When you hear that a large language model is “training,” this is the procedure running: gradient descent over billions of parameters, for millions of steps, on a loss that measures how wrong the model’s next-word predictions are. The learning rate is a real setting that engineers tune and that can make or break a training run. Every weight in every neural network you have heard of arrived at its value by rolling downhill on a loss landscape, exactly as the weight w rolled toward 3.
It also explains a quiet limitation. Gradient descent finds a valley, not necessarily the deepest one. On a bumpy landscape it can settle into a local minimum, a low spot that is not the lowest spot overall, because from inside it every direction is uphill. Much of the practical art of training is about landscapes shaped so this matters less than you would fear.
Common pitfalls
Section titled “Common pitfalls”- Learning rate too large. The single most common training failure: steps overshoot, the loss bounces or explodes instead of settling.
- Learning rate too small. Training technically works but crawls, wasting time and compute.
- Expecting the global best. Gradient descent reaches a minimum, not guaranteed to be the global minimum. A good result is usually a good-enough valley, not provably the deepest.
- Forgetting it needs a slope. Gradient descent only works when the loss changes smoothly as you nudge parameters. If the loss is flat or jagged, there is no reliable downhill to follow.
A note on scale: stochastic gradient descent
Section titled “A note on scale: stochastic gradient descent”One practical wrinkle worth naming. Computing the exact gradient means looking at every data point each step, which is impossibly slow on a large dataset. The standard fix is stochastic gradient descent: estimate the slope from a small random sample of the data each step instead of all of it. The steps get a little noisy, but they are far cheaper, and the noise often helps the search shake loose from shallow valleys. Nearly all large-scale training uses this variant.
What you should remember
Section titled “What you should remember”- Gradient descent is how models search for their best parameters when no formula hands them the answer, which is almost always.
- The loss is a landscape; the gradient is the local slope. Step against the gradient to go downhill, and repeat.
- The learning rate sets the step size: too large overshoots, too small crawls.
- It finds a minimum, not always the global one, and at scale it runs on small random samples (stochastic gradient descent).
We now hold the two halves of how a model learns: a loss that measures how wrong it is, and gradient descent to drive that loss down. That pairing is the engine under everything that follows. With it in hand, we leave regression behind and turn to the other half of supervised learning. The next phase is classification, and it opens by bending the line you already know how to fit into something that predicts a probability: logistic regression.