Summary: Gradient descent, step by step
Three lessons built to this. Learning is minimizing the cost; the negative gradient points downhill; this lesson finally takes the walk. The whole algorithm is one line, applied to all the knobs at once and repeated: step against the gradient, look again, step again. That is gradient descent, the method that trains almost every neural network in use. The catch it leaves open, how you actually get the gradient for a 13,000-knob network, is the subject of the next lesson. This is the scan-it-in-five-minutes version.
Core ideas
Section titled “Core ideas”- The update rule is one line:
new value = old value − (learning rate) × (that knob's slope), orw_new = w_old − learning_rate × ∇C. It is applied to all roughly 13,000 knobs at once, every step, and subtracts because we move downhill against the uphill gradient. - Worked run. On
C(w) = w²fromw = 5with learning rate0.1, the cost slides 25 → 16 → 10.24 → 6.55 → 4.19 → 2.68 → 1.72 → 1.10, toward the bottom at 0. The steps shrink on their own, because the slope2wshrinks aswnears zero. - The learning rate is a real choice. Too large overshoots: at rate
2.0the same run explodes (25 → 225 → 2025), and a rate that merely bounces back and forth never settles. Too small crawls: at rate0.001a full step moves the cost only from 25 to about 24.9. The sweet spot is big enough to progress, small enough not to overshoot, and modern methods adjust it automatically. - Training is a loop. Compute the gradient where you stand, step every knob by the rule, arrive somewhere new with different slopes, and repeat; stop when the cost flattens. The gradient is local, so each step needs a fresh one, which is why training is many small steps rather than one jump.
- The gradient says direction and amount per knob. Its sign points each knob up or down; its size says by how much. Training is just this, repeated: nudge thousands (or billions) of knobs at once, each toward lower cost.
- A practical shortcut: stochastic gradient descent. Real training estimates the gradient from a small random batch of examples each step instead of the whole set, much cheaper, same loop.
- The open question. Gradient descent assumes you can get
∇C. Computing the slope of a deep network with respect to all its knobs, efficiently, is a separate problem, solved by backpropagation in the next lesson.
What changes for you
Section titled “What changes for you”This loop is, literally, what “training a model” means. Every headline about a model that took weeks and a fortune in compute is describing this walk: compute a gradient, nudge billions of knobs, repeat, billions of times. The “training loss curve” people share is the cost falling step by step, the same column of numbers you watched slide from 25 toward 1. It even demystifies the failures: training that “blows up” is usually a learning rate set too high, training that crawls is one set too low, and because each run takes many local steps from a random start, the same recipe can yield slightly different models. The walking turns out to be the easy part. The hard part, getting the slope of a 13,000-knob network in the first place, is exactly where Phase 3 begins, with backpropagation.