Practice: The cost landscape

Self-check

Six short questions. Answer each one in your head (or on paper) before opening the collapsible. Trying to retrieve the answer is where the learning sticks; rereading feels productive but does much less.

1. What does the “cost landscape” actually mean?

Show answer

It is the cost function pictured as terrain. Every possible setting of the weights and biases is a point, and the cost at that setting is the height above it. High cost is high ground; low cost is low ground. The best setting of the knobs is the bottom of the lowest valley. The landscape is not stored anywhere; it is a way of seeing C(w, b).

2. A real network has about 13,000 knobs, so its landscape lives in 13,000+ dimensions. How do you reason about that?

Show answer

You do not try to picture it. Build the intuition on a 2-knob landscape (a drawable 3D surface) and trust that every idea, slopes, downhill, valleys, lowest points, carries over to high dimensions as exact mathematics. The picture is a crutch for intuition; the math works the same whether or not you can draw it.

3. What is the gradient, and is it a number or a direction?

Show answer

The gradient ∇C is a direction (a vector), not a single number, pointing the way cost increases fastest, the steepest uphill. It is assembled from every knob’s individual slope (“how fast does cost change if I nudge just this one?”), bundled into one arrow. In one dimension it collapses to a single signed slope.

4. Why do we step in the direction of the negative gradient?

Show answer

Because we want cost to drop, and the negative gradient -∇C is the gradient turned around to point steepest downhill. A step that way lowers the cost faster than a step in any other direction. The gradient points uphill; we go the opposite way.

5. What is the difference between a local minimum and the global minimum?

Show answer

A local minimum is a valley bottom: a point where every nearby step leads uphill, so there is no downhill move available from where you stand. The global minimum is the deepest valley anywhere on the whole landscape. Walking downhill only guarantees you reach a local minimum, which may be shallower than the global one.

6. Using the landscape, why is training described as iterative and run for hours or weeks?

Show answer

Because reaching a valley means taking many small downhill steps, not solving an equation in one shot. You repeatedly read the local downhill direction and step a little. The same picture also explains why two runs of the same model can differ (they start at different random points and roll into different valleys) and what “converged” means (the walk reached a valley floor where steps stop lowering cost much).

Try it yourself, part 1: find the downhill direction

Pen and paper, about 8 minutes. You will compute the gradient and the downhill direction at a point, on simple bowl-shaped costs. (Recall: for w² the slope at w is 2w.)

Step 1 (one knob). Cost C(w) = w², a parabola with its bottom at 0. You are standing at w = -4. What is the slope there, which way is downhill (toward larger or smaller w), and does moving that way lower the cost?

Step 2 (two knobs). Cost C(w1, w2) = w1² + w2², a round bowl with its bottom at the origin. You are at the point (-2, 5). Write the gradient and the negative gradient, and say what cost you are currently sitting at.

Show answer

Step 1. The slope is 2w = 2·(-4) = -8. A negative slope means cost falls as w increases, so downhill is toward larger w (toward 0). Stepping that way takes the cost from (-4)² = 16 down toward 0 (for example, at w = -3 the cost is 9, already lower). Note the negative gradient handles this automatically: it is -(-8) = +8, pointing in the +w direction, exactly the way to 0, even though here you started on the left side of the valley rather than the right.

Step 2. Each component of the gradient is that knob’s own slope:

gradient ∇C     = [2·(-2), 2·5] = [-4, 10]
negative gradient = [4, -10]            (points back toward the origin)
current cost      = (-2)² + 5² = 4 + 25 = 29

The negative gradient [4, -10] says “increase w1 toward 0, decrease w2 toward 0,” which is exactly the way back to the bottom of the bowl, and stepping that way lowers the cost from 29. Two knobs or thirteen thousand, the downhill direction is always the negative gradient.

Try it yourself, part 2: stuck in a shallow valley

About 3 minutes, reasoning only. A network is trained by only ever stepping downhill. It settles at a point where every direction nearby leads uphill, and the cost stops dropping, but a colleague proves that a very different setting of the weights would give a much lower cost. Two questions: what is the network sitting in, and why did downhill-only stepping fail to find the better setting?

Show answer

The network is sitting in a local minimum: a valley bottom where no nearby step goes downhill, so the walk halts. It failed to reach the colleague’s better setting (a deeper valley, closer to the global minimum) because downhill-only stepping can never climb out of the valley it is in. To leave a valley you would have to go uphill first, which pure downhill stepping refuses to do. This is the genuine limitation of the method: it finds a good (locally lowest) solution, not a provably best one. In practice this is often less damaging than the two-valley picture suggests, because high-dimensional landscapes behave in their own surprising ways, but it is real and worth knowing.

Flashcards

Nine cards. Click any card to reveal the answer. Use the Print flashcards button to lay out the full set as one card per page, ready to print or save as a PDF for offline review.

Q. What is the cost landscape?

The cost function pictured as terrain: every setting of the weights and biases is a point, and its cost is the height there. High cost is high ground; the lowest valley is the best setting. It is a way of seeing C(w, b), not something stored.

Q. How do you reason about a 13,000-dimensional landscape?

Do not try to picture it. Build intuition on a 2-knob surface (drawable in 3D) and trust that slopes, downhill, and valleys carry over to high dimensions as exact math. The picture is a crutch; the math works regardless.

Q. What is the gradient of the cost?

The direction (a vector) ∇C pointing the way cost increases fastest, the steepest uphill. It is assembled from every knob’s individual slope. In one dimension it is just a single signed slope.

Q. Why step along the negative gradient?

Because we want cost to drop. The negative gradient -∇C is the gradient turned around to point steepest downhill, so a step that way lowers cost faster than any other step.

Q. For C(w) = w², what is the slope at a point w?

The slope is 2w. At w = 3 it is 6 (uphill to the right); at w = -4 it is -8 (downhill to the right). Stepping against the slope, toward 0, lowers the cost.

Q. For C(w1,w2)=w1²+w2² at (3,4), what is the gradient?

Each component is that knob’s slope: [2·3, 2·4] = [6, 8], pointing away from the bottom. The negative gradient [-6, -8] points back toward the origin, and the cost there is 9 + 16 = 25.

Q. Local minimum versus global minimum?

A local minimum is a valley bottom where every nearby step goes uphill. The global minimum is the deepest valley anywhere. Downhill walking only guarantees reaching a local minimum, not the global one.

Q. Why can downhill-only training miss the best solution?

Because it can never climb out of the valley it is in, and leaving a valley requires going uphill first. So it settles at a local minimum that may be shallower than the deepest valley elsewhere.

Q. Why is training iterative rather than one calculation?

Reaching a valley means taking many small downhill steps, repeatedly reading the local downhill direction and stepping. “Converged” means the walk reached a valley floor where steps stop lowering cost much.