Gradient descent: brief

What you’ll learn

Three lessons have been circling one idea, and this is where it lands. Lesson 5 made learning a goal (minimize the cost), lesson 6 gave it a shape (the negative gradient points downhill). Put them together and the method almost writes itself: step downhill, look again, step again. That repeated downhill walk is gradient descent, the algorithm that trains almost every neural network in use.

You will learn the update rule in one line (new value = old value − learning_rate × slope), applied to all roughly 13,000 knobs at once. You will run it by hand on C(w) = w² and watch the cost slide from 25 toward zero, then see how the learning rate can break everything: too large and the steps overshoot and diverge (one run explodes to 2025) or bounce without settling, too small and training crawls. You will frame training as a short loop (compute the gradient here, step, recompute, repeat) and see why the local gradient must be refreshed each step. The lesson names stochastic gradient descent (estimating the gradient from a small random batch) as the real-world shortcut, and ends by flagging its one open assumption: it takes the gradient as given. Computing it efficiently is backpropagation, the start of Phase 3.

Where this fits

This is lesson 7, the last of Phase 2 (How a network learns). Lesson 5 defined the cost, lesson 6 turned it into a landscape with the negative gradient as a downhill compass, and this lesson walks the landscape with the update rule. That completes the learning loop except for one piece: where the gradient comes from. Phase 3 supplies it, with lesson 8 explaining backpropagation intuitively and lesson 9 connecting it to the chain rule (cross-referencing Track 8, Calculus). After this lesson, you know how a network learns, given a gradient; the final phase shows how the gradient is computed.

Before you start

Prerequisite (within this track): lesson 6, The cost landscape, since this lesson walks the terrain that lesson built, stepping along the negative gradient it introduced. If “the negative gradient points steepest downhill” is solid, you are ready. The math is multiplication and subtraction, run a few times; a calculator helps in the practice, and no coding or installation is required.

By the end, you’ll be able to

State the gradient descent update rule and explain why it subtracts the gradient
Run gradient descent by hand on a simple cost and watch the cost fall toward the minimum
Explain the learning rate and how too large (diverge or oscillate) or too small (crawl) a value breaks training
Describe training as a repeated loop and explain why the local gradient must be recomputed each step
Recognize that gradient descent assumes the gradient is available, and name stochastic gradient descent as the practical sampling shortcut

Time and difficulty

Read time: about 11 minutes
Practice time: about 15 minutes (running gradient descent by hand, breaking it with a bad learning rate, a diagnosis drill, and flashcards)
Difficulty: standard