Cheatsheet: Gradient descent, step by step
The one idea that matters
Section titled “The one idea that matters”new value = old value − learning_rate × (that knob's slope)
w_new = w_old − learning_rate × ∇C(w_old)Applied to all ~13,000 knobs at once, every step. Subtract, because downhill. Repeat until the cost flattens.
The training loop
Section titled “The training loop”- At the current position, compute the gradient (slope in every direction).
- Step: update every knob by the rule above.
- New position → new slopes → go back to step 1.
- Stop when cost stops dropping much, or after a set number of steps.
The gradient is local (only the slope right where you stand), so each step needs a fresh one. That is why training is many small steps, not one jump.
Worked run: C(w) = w², start w = 5, learning rate 0.1
Section titled “Worked run: C(w) = w², start w = 5, learning rate 0.1”| Step | w | C |
|---|---|---|
| 0 | 5 | 25 |
| 1 | 4.0 | 16.0 |
| 2 | 3.2 | 10.24 |
| 3 | 2.56 | 6.55 |
| 4 | 2.048 | 4.19 |
| 7 | 1.049 | 1.10 |
Cost slides toward 0. Steps shrink automatically as the slope 2w shrinks near the bottom.
The learning rate tradeoff
Section titled “The learning rate tradeoff”| Rate | First step from w=5 | Result |
|---|---|---|
| 0.1 (good) | 5 → 4.0 | converges smoothly |
| 2.0 (too big) | 5 → −15 (C 225), then 45 (C 2025) | diverges, cost explodes |
| 0.001 (too small) | 5 → 4.99 (C 24.9) | crawls; thousands of steps needed |
Big enough to progress, small enough not to overshoot. Modern methods adapt it automatically (and per knob).
What the gradient tells each knob
Section titled “What the gradient tells each knob”- Sign of the component → direction (turn this knob up or down).
- Size of the component → amount (turn it a lot or a little).
What this lesson does not do
Section titled “What this lesson does not do”It assumes you can get ∇C. Actually computing the slope of a 13,000-knob network efficiently is a separate problem: backpropagation (lesson 8).
Practical note: real training uses stochastic gradient descent, estimating the gradient from a small random batch each step. Cheaper; same conceptual loop.
Pitfalls to dodge
Section titled “Pitfalls to dodge”- “One step finishes it.” No. Many small steps, fresh slope each time.
- “Bigger learning rate is always faster.” No. Too big overshoots and diverges.
- “The gradient knows where the bottom is.” No. It only knows the local slope.
- “Gradient descent computes the gradient.” No. It uses it. Computing it is L8.
The one-line version
Section titled “The one-line version”Read the slope, step against it, repeat. The hard part was never the walking; it is figuring out the slope.