Practice: How models actually learn: gradient descent
Self-check
Section titled “Self-check”Seven short questions. Try to answer each one before opening the collapsible.
1. Why do we need gradient descent at all, if linear regression has a formula?
Show answer
The formula is special to the straight line. Almost every other model (logistic regression, neural networks) has no formula that hands you the best parameters, so you have to search for them. Gradient descent is that general-purpose search.
2. In the landscape picture, what are the horizontal directions and what is the height?
Show answer
The horizontal directions are the parameter values you could choose. The height is the loss: how much total error those parameters produce. Training means finding the lowest point.
3. What is the gradient, in plain language?
Show answer
The slope of the loss under your feet: which way the error changes if you nudge a parameter, and how steeply. It points uphill, so you step against it to go down.
4. State the downhill rule as an update equation.
Show answer
new value = old value minus (learning rate times gradient). You move each parameter a little in the opposite direction of its gradient, then repeat.
5. What goes wrong if the learning rate is too large? Too small?
Show answer
Too large: you overshoot the valley, bounce around, and may diverge. Too small: you reach the bottom but crawl, taking far more steps than necessary.
6. How do you know when to stop?
Show answer
When the gradient is near zero and the steps no longer lower the loss. The ground has gone flat, which means you are at the bottom of a valley.
7. What is a local minimum, and why is it a limitation?
Show answer
A valley that is low but not the lowest overall. From inside it, every direction is uphill, so gradient descent settles there even though a deeper valley exists elsewhere. Gradient descent finds a minimum, not always the global one.
Try it yourself: take two steps downhill
Section titled “Try it yourself: take two steps downhill”You are minimizing a bowl-shaped loss whose lowest point is at w = 5. You are told the slope (gradient) at any point is 2 * (w - 5). Start at w = 1 with a learning rate of 0.25. Take two steps using new w = old w - (learning rate * gradient), and check that the loss (w - 5) squared goes down.
Show answer
start: w = 1 gradient = 2*(1 - 5) = -8 loss = (1 - 5)^2 = 16
step 1: new w = 1 - (0.25 * -8) = 1 + 2 = 3 loss = (3 - 5)^2 = 4
step 2: gradient at w=3 = 2*(3 - 5) = -4 new w = 3 - (0.25 * -4) = 3 + 1 = 4 loss = (4 - 5)^2 = 1Two steps moved w from 1 to 3 to 4, marching toward the true minimum at 5, and the loss fell 16 to 4 to 1. Notice the gradient is negative (the slope tilts down to the right), so stepping against it moves w to the right, which is downhill. The procedure never knew 5 was the answer; it just kept stepping downhill.
Try it yourself: diagnose the training run
Section titled “Try it yourself: diagnose the training run”You start training and watch the loss after each step. Instead of going down, it does this: 8, 20, 55, 160, ... (bigger every step, swinging wildly). What is almost certainly wrong, and what do you change?
Show answer
The learning rate is too large. The steps overshoot the bottom of the valley and land farther up the opposite slope each time, so the loss grows instead of shrinking, which is divergence. The fix is to reduce the learning rate (often by a factor of 10) and try again. A loss that climbs or oscillates wildly is the classic signature of too big a step.
Flashcards
Section titled “Flashcards”Ten cards. Click any card to reveal the answer. Use the Print flashcards button for one card per page.
Q. Why is gradient descent needed if linear regression has a formula?
The formula is special to the straight line. Most models have no such formula, so their best parameters must be searched for. Gradient descent is that general search.
Q. What is the loss?
The total error a given set of parameters produces on the data. Pictured as a landscape, it is the height; training means finding the lowest point.
Q. What is the gradient?
The slope of the loss under your feet: which way the error changes if you nudge a parameter, and how steeply. It points uphill, so you step against it.
Q. State the gradient descent update rule.
new value = old value minus (learning rate times gradient). Repeat until the steps stop lowering the loss.
Q. What is the learning rate?
The size of each step. Too large overshoots and may diverge; too small reaches the bottom but crawls.
Q. When does gradient descent stop?
When the gradient is near zero and the steps no longer lower the loss: the ground is flat, so you are at the bottom of a valley.
Q. What is a local minimum?
A valley that is low but not the lowest overall. Gradient descent can settle there because every nearby direction is uphill, even though a deeper valley exists elsewhere.
Q. What is the classic sign the learning rate is too large?
The loss climbs or swings wildly instead of falling. The steps overshoot the valley each time. The fix is to reduce the learning rate.
Q. What is stochastic gradient descent?
Estimating the gradient from a small random sample of the data each step instead of all of it. Steps are noisier but far cheaper; it is how large-scale training runs.
Q. How does gradient descent connect to large AI models?
Training a neural network is gradient descent over billions of parameters for millions of steps, lowering a loss that measures prediction error. Every weight got its value by rolling downhill.