Practice: Gradient descent, step by step

Self-check

Six short questions. Answer each one in your head (or on paper) before opening the collapsible. Trying to retrieve the answer is where the learning sticks; rereading feels productive but does much less.

1. State the gradient descent update rule for one knob, and explain why it subtracts.

Show answer

new value = old value − (learning rate) × (that knob's slope). It subtracts because the gradient points the steepest uphill direction and we want to go downhill, so we move against it. The rule is applied to all roughly 13,000 knobs at once, on every step.

2. What is the learning rate, and what goes wrong if it is too large or too small?

Show answer

It is a single number we choose that sets how big each step is. Too large and the steps overshoot the bottom and the cost can climb instead of fall (divergence). Too small and each step barely moves, so it can take thousands of steps to make progress (painfully slow). There is a sweet spot: big enough to progress, small enough not to overshoot.

3. Why is gradient descent a repeated loop rather than one big calculated jump?

Show answer

Because the gradient is local: it only tells you the slope right where you are standing. Once you take a step you are somewhere new, where the slopes are different, so you must recompute the gradient and step again. Each step buys a fresh, accurate downhill direction for the next one.

4. The gradient is a list with one entry per knob. What do the sign and the size of each entry tell you?

Show answer

The sign tells the direction (turn this knob up or down); the size tells the amount (turn it a lot or a little). Knobs that steeply affect the cost get moved more; knobs that barely matter right now get moved less. One step nudges the whole network at once.

5. What does gradient descent assume it already has, and which lesson supplies it?

Show answer

It assumes you can obtain the gradient ∇C, the slope of the cost with respect to every one of the ~13,000 knobs. Gradient descent is only what you do with the gradient. Actually computing it efficiently for a deep network is a separate problem, solved by backpropagation (lesson 8).

6. What is stochastic gradient descent, in one sentence?

Show answer

Instead of computing the gradient from every training image on every step (expensive), each step estimates it from just a small random handful of examples, which is good enough and far cheaper. The conceptual loop is identical, step downhill and repeat; you are just reading the slope from a sample.

Try it yourself, part 1: run gradient descent, then break it

Pen and paper (a calculator helps), about 9 minutes. Same simple cost as the lesson: C(w) = w², whose slope at w is 2w.

Part A (a good rate). Start at w = 10 with learning rate 0.1. Apply the update rule three times. Write w and the cost C = w² after each step.

Part B (too large a rate). Start again at w = 10, but now with learning rate 1.0. Apply the rule twice and describe what happens to w and the cost.

Show answer

Part A (learning rate 0.1). Each step is w − 0.1·(2w):

start:  w = 10                        C = 100
step 1: 10 − 0.1·(2·10) = 10 − 2  = 8     C = 64
step 2: 8  − 0.1·(2·8)  = 8 − 1.6 = 6.4   C = 40.96
step 3: 6.4 − 0.1·(2·6.4)= 6.4 − 1.28 = 5.12  C = 26.21

The cost slides down 100 → 64 → 40.96 → 26.21, heading toward 0. (Notice each step just multiplies w by 0.8, and the steps shrink as the slope shrinks near the bottom.)

Part B (learning rate 1.0). Each step is w − 1.0·(2w) = w − 2w = -w:

start:  w = 10            C = 100
step 1: 10 − 2·10 = -10   C = 100
step 2: -10 − 2·(-10) = 10  C = 100

The value just flips between 10 and -10 forever, and the cost is stuck at 100, never improving. This is a second way too-large a rate fails: not always an explosion (like the lesson’s 2.0 run that blew up to 2025), but here an endless bounce across the valley that never settles. Either way, too big a step means the walk does not converge.

Try it yourself, part 2: diagnose the training run

About 3 minutes, reasoning only. For each symptom, name the most likely learning-rate problem and the fix.

The training loss shoots upward and quickly becomes a huge number.
After thousands of steps the loss has barely moved from where it started.
The loss never settles, bouncing up and down around the same level without trending down.

Show answer

Learning rate too large (diverging). The steps overshoot so badly the cost explodes. Fix: lower the learning rate.
Learning rate too small (crawling). Each step moves almost nothing, so progress is painfully slow. Fix: raise the learning rate.
Learning rate too large (oscillating). The steps are big enough to keep leaping across the valley without descending into it (like Part B above). Fix: lower the learning rate. In all three, the step size is the dial to turn, which is why modern training methods adjust it automatically as they go.

Flashcards

Nine cards. Click any card to reveal the answer. Use the Print flashcards button to lay out the full set as one card per page, ready to print or save as a PDF for offline review.

Q. What is the gradient descent update rule?

new value = old value − (learning rate) × (that knob's slope), or w_new = w_old − learning_rate × ∇C. Applied to all ~13,000 knobs at once, every step. It subtracts because we go downhill, against the uphill gradient.

Q. What is the learning rate?

A single number we choose that sets how big each step is. It has a sweet spot: big enough to make real progress, small enough not to overshoot the bottom.

Q. What happens if the learning rate is too large?

The steps overshoot the bottom and the cost can climb instead of fall. It may explode (diverge) or bounce back and forth across the valley without ever settling. Either way it fails to converge.

Q. What happens if the learning rate is too small?

Each step barely moves, so the cost drops only a tiny amount per step. It will reach the bottom eventually, but may take many thousands of steps. Safe but painfully slow.

Q. Why is gradient descent a repeated loop, not one jump?

Because the gradient is local: it only gives the slope right where you stand. After a step you are somewhere new with different slopes, so you recompute the gradient and step again. Each step buys a fresh downhill direction.

Q. What do the sign and size of a gradient entry tell each knob?

Sign gives the direction (turn the knob up or down); size gives the amount (turn it a lot or a little). Steeply-influential knobs move more; barely-relevant ones move less.

Q. What does gradient descent assume it already has?

The gradient ∇C, the slope of the cost with respect to every knob. Gradient descent is only what you do with the gradient; computing it efficiently for a deep network is backpropagation’s job (lesson 8).

Q. What is stochastic gradient descent?

Estimating the gradient from a small random batch of examples each step instead of the whole training set. Far cheaper, good enough, and the same conceptual loop: step downhill, repeat.

Q. What is a 'training loss curve' showing?

The cost falling step by step as gradient descent runs, exactly like the column 25, 16, 10.24, … sliding toward the minimum. A model “converging” is that curve flattening out at a valley floor.