The cost landscape
What you’ll learn
Section titled “What you’ll learn”Lesson 5 left a clean goal and an awkward silence: learning is minimizing the cost C(w, b), but with about 13,000 knobs and a complicated cost, which way do you even step? This lesson hands you a way to see the problem, by turning that abstract space of knob settings into a landscape.
You will picture the cost as terrain: every setting of the weights and biases is a point, and its cost is the height above it, so bad regions rise as hills and good regions sink into valleys. You will see why a real network’s 13,000-dimensional landscape cannot be drawn but every idea (slopes, downhill, valleys) carries over as exact math. You will meet the gradient ∇C, the direction of steepest uphill, and the negative gradient -∇C, the compass that points steepest downhill, and see worked, by hand, why stepping along it lowers the cost: on C(w) = w² at w = 3, and on the bowl C(w1, w2) = w1² + w2² at (3, 4) where the negative gradient [-6, -8] points back to the bottom. The lesson closes on an honest caveat: downhill walking reaches a local minimum, a valley bottom, which may not be the deepest valley anywhere.
Where this fits
Section titled “Where this fits”This is lesson 6, the second of Phase 2 (How a network learns). Lesson 5 defined the cost we are now picturing as terrain; this lesson supplies the shape and the compass (the negative gradient); and lesson 7 takes the actual walk, repeatedly reading the downhill direction and stepping, which is the algorithm the whole chapter is named for, gradient descent. The slope idea this lesson leans on is made precise in Track 8 (Visual Math: Calculus), cross-referenced for anyone who wants the derivative underneath the metaphor.
Before you start
Section titled “Before you start”Prerequisite (within this track): lesson 5, What learning really means, which defines the cost C(w, b) that this lesson pictures as height. If “learning is minimizing a wrongness number over the weights and biases” is solid, you are ready. The only math is squaring and a little multiplication; the gradient is introduced from intuition, with no calculus background assumed (Track 8 is there if you want the precise derivative).
By the end, you’ll be able to
Section titled “By the end, you’ll be able to”- Picture the cost as a landscape where each setting of the weights and biases is a point and its cost is the height
- Explain why a 13,000-dimensional landscape cannot be drawn but the slope-and-valley reasoning carries over unchanged
- Define the gradient as the direction of steepest uphill and the negative gradient as steepest downhill
- Explain why stepping along the negative gradient lowers the cost fastest, and compute the gradient on a simple bowl-shaped cost
- Distinguish a local minimum from the global minimum and explain why downhill walking only reaches a local one
Time and difficulty
Section titled “Time and difficulty”- Read time: about 11 minutes
- Practice time: about 14 minutes (finding the downhill direction by hand in 1D and 2D, a local-versus-global reasoning drill, and flashcards)
- Difficulty: standard