Cheatsheet: How models actually learn: gradient descent
The landscape picture
Section titled “The landscape picture”| Term | Meaning |
|---|---|
| Loss | total error a given set of parameters produces |
| Landscape | parameters are horizontal directions, loss is the height |
| Goal | find the lowest point (smallest loss) |
| What you sense | only the local slope under your feet |
The two pieces
Section titled “The two pieces”| Piece | What it is | Effect |
|---|---|---|
| Gradient | slope of the loss; which way and how steeply error changes | points uphill; step against it |
| Learning rate | size of each step | too big overshoots; too small crawls |
The update loop
Section titled “The update loop”| Step | Action |
|---|---|
| 1 | Guess the parameters (often random) |
| 2 | Compute the loss (how wrong now?) |
| 3 | Compute the gradient (which way is downhill?) |
| 4 | Step: new = old - (learning_rate * gradient) |
| 5 | Repeat until steps stop lowering the loss |
Worked trace (bowl with minimum at w = 5, learning rate 0.25, gradient = 2*(w-5))
Section titled “Worked trace (bowl with minimum at w = 5, learning rate 0.25, gradient = 2*(w-5))”| Step | w | gradient | loss = (w-5)^2 |
|---|---|---|---|
| start | 1 | -8 | 16 |
| 1 | 3 | -4 | 4 |
| 2 | 4 | -2 | 1 |
| … | toward 5 | toward 0 | toward 0 |
Failure modes and notes
Section titled “Failure modes and notes”| Symptom / idea | Meaning |
|---|---|
| Loss climbs or swings wildly | learning rate too large; reduce it |
| Loss falls painfully slowly | learning rate too small |
| Settles above the true best | stuck in a local minimum |
| Stochastic gradient descent | estimate slope from a random sample each step; how large models train |