Skip to content

Lesson: Gradient descent, step by step

We have been circling one idea for three lessons, and it is time to land it. Lesson 5 turned learning into a goal: make the cost small. Lesson 6 gave that goal a shape: stand on the cost landscape, and the negative gradient always points in the steepest-downhill direction. Put those together and the method almost writes itself. If downhill is where you want to go, and the negative gradient points downhill, then take a step that way. Then look again and take another. That repeated downhill walk is gradient descent, and it is the algorithm that trains almost every neural network in use.

Here is the whole algorithm in one line, the rule for updating a single knob:

new value = old value − (learning rate) × (that knob's slope)

Or in the symbols from lesson 6, for the full set of parameters at once:

w_new = w_old − learning_rate × ∇C(w_old)

A few things to unpack. The gradient is the bundle of slopes that says, for every one of the roughly 13,000 knobs, which way is uphill and how steep. We subtract it (rather than add) because we want to go downhill, against the uphill direction. And the learning rate is a single number we choose that controls how big each step is.

The crucial part: this rule is applied to all 13,000 knobs at the same time, on every step. Each knob moves a little, in its own best downhill direction, by an amount proportional to its own slope. Knobs that are steeply affecting the cost move more; knobs that barely matter right now move less. One step nudges the entire network at once.

Let us run it by hand on the simplest landscape, the one-knob parabola from lesson 6 whose cost is the knob squared. Its slope at any point is twice the knob’s value (steeper the further you are from zero). We will start at the value 5 and use a learning rate of 0.1.

start: w = 5 C = 25
step 1: w = 5 − 0.1 × (2·5) = 5 − 1.0 = 4.0 C = 16.0
step 2: w = 4 − 0.1 × (2·4) = 4 − 0.8 = 3.2 C = 10.24
step 3: w = 3.2 − 0.1 × (2·3.2) = 3.2 − 0.64 = 2.56 C = 6.55
step 4: w = 2.56 − ... = 2.048 C = 4.19
step 5: w = ... = 1.6384 C = 2.68
step 6: w = ... = 1.31072 C = 1.72
step 7: w = ... = 1.04858 C = 1.10

Watch the cost column: 25, 16, 10.24, 6.55, 4.19, 2.68, 1.72, 1.10. It is sliding steadily toward zero, exactly where the bottom of the parabola is. The steps get smaller as we go, and that is automatic, not something we arranged: as the knob shrinks, its slope (twice the knob) shrinks too, so the rule naturally takes gentler steps as it nears the bottom. Keep going and the knob creeps toward 0 and the cost toward its minimum. That is gradient descent doing its entire job: read the slope, step against it, repeat.

The learning rate looks like a small detail. It is not. Pick it badly and training fails. Watch the same parabola, same start at the value 5, with two bad choices.

Too large (learning rate 2.0). The steps are so big they leap clean over the bottom and land higher than they started:

start: w = 5 C = 25
step 1: w = 5 − 2.0 × (2·5) = 5 − 20 = −15 C = 225
step 2: w = −15 − 2.0 × (2·−15) = −15 + 60 = 45 C = 2025

The cost is going up, not down: 25, then 225, then 2025. The walk is bouncing across the valley and flying apart. This is called divergence, and it is a real failure mode. Too big a step does not just slow you down; it can blow the whole thing up.

Too small (learning rate 0.001). Now the opposite problem:

start: w = 5 C = 25
step 1: w = 5 − 0.001 × (2·5) = 5 − 0.01 = 4.99 C ≈ 24.90

After a full step the cost barely twitched, from 25 to about 24.9. It will get to the bottom eventually, but it might take many thousands of steps to do what the 0.1 rate did in a handful. Safe, but painfully slow.

So the learning rate has a sweet spot: large enough to make real progress, small enough not to overshoot. Finding it is part science and part craft. In practice, modern training methods adjust the step size automatically as they go, and often per knob, but that is a refinement on top of this same core idea, not a different idea.

Gradient descent is therefore a loop, and a short one:

  1. At your current position, compute the gradient (the slope in every direction).
  2. Take one step: update every knob by the rule above.
  3. You are now at a new position, where the slopes are different, so go back to step 1.
  4. Stop when the cost stops dropping meaningfully, or after a fixed number of steps.

Step 3 is worth dwelling on. The gradient is a local reading; it only tells you the slope right where you are standing. Once you move, you are somewhere new with a new slope, so you must recompute. That is why training is many small steps rather than one big calculated jump: each step buys you a fresh, accurate downhill direction for the next.

And what is the gradient actually telling each knob? It is a list with one entry per parameter. The sign of each entry says the direction (turn this knob up or down), and the size of each entry says the amount (turn it a lot or a little). Training the whole network is just this, repeated: nudge 13,000 knobs at once, each in the direction and by the amount that most reduces the cost from where things currently stand.

The one thing this lesson does not explain

Section titled “The one thing this lesson does not explain”

You may have noticed a sleight of hand. The whole algorithm leans on having the gradient, the slope of the cost with respect to every one of the 13,000 knobs. We have been assuming we can just ask for it and get it. But the cost is a complicated function: it runs an image all the way through the network, compares the output to the desired answer, and the result depends on every weight and bias along the way. Computing how the cost would change if you wiggled one particular weight buried deep in the network is not obvious, and doing it for all 13,000 knobs without it taking forever is a genuine problem.

That problem has a famous solution, and it is the subject of the next lesson. For now, treat the gradient as something you can obtain; gradient descent is what you do with it.

One real-world refinement worth naming so it is not a surprise later. Computing the gradient using every single training image on every step is expensive when there are tens of thousands of images. So in practice, each step uses the gradient computed from just a small random handful of examples, which is a good-enough estimate of the true downhill direction and far cheaper. This is called stochastic gradient descent, and it is what almost all real training uses. The conceptual story does not change at all: step downhill, repeat. You are just reading the slope from a sample instead of the whole pile.

This loop is, quite literally, what is happening when a model is “training.” Every headline about a model that took weeks and a fortune in compute to train is describing this walk: compute a gradient, nudge billions of knobs, repeat, billions of times. The “training loss curve” people post is the cost falling step by step, exactly the column of numbers you watched slide from 25 toward 1.

It also explains some failures you may hear about. When training “blows up” or “diverges,” that is often a learning rate set too high, the steps overshooting like our 2.0 example. When training is frustratingly slow, the step size may be too timid. And because each run takes many local steps from a starting point, the same recipe can produce slightly different models on different runs. None of it is mysterious: it is a long, careful walk downhill, and the size of the steps matters enormously.

Thinking one step finishes the job. It does not. Gradient descent is many small steps, each reading a fresh slope at the new position. There is no single leap to the bottom.

Thinking a bigger learning rate is always faster. Past a point it overshoots and the cost climbs instead of falls. Bigger can mean diverge, not converge.

Thinking the gradient knows where the bottom is. It does not. It only knows the slope right where you stand. That local reading is why you must step, recompute, and step again.

Thinking gradient descent computes the gradient. It uses the gradient. Computing it efficiently is a separate problem, and it is what the next lesson is about.

  • The update rule is one line: new value equals old value minus the learning rate times the slope, applied to all roughly 13,000 knobs at once, every step. Subtract, because we go downhill.
  • The learning rate is the step size, and it matters. Too large overshoots and can diverge (our 2.0 run: 25, then 225, then 2025); too small crawls (the 0.001 run barely moved); a good rate of 0.1 slid the cost from 25 toward 1 in a few steps.
  • Training is a loop: compute the gradient here, step, recompute at the new spot, repeat, stop when the cost flattens. The gradient is local, so each step needs a fresh one.
  • The gradient says direction and amount per knob (sign and size). Gradient descent assumes you can get the gradient; actually computing it is the next lesson’s job.

Gradient descent is almost embarrassingly simple: read the slope, step against it, repeat. The hard part was never the walking. It is figuring out the slope.

Next: the cheatsheet puts the update rule and the worked runs on one page. Then lesson 8 answers the question this one leaned on the whole way through. How do you actually compute the gradient of a 13,000-knob network without it taking forever? That is backpropagation.