Gradient descent, in brief

What you’ll learn

This is lesson 3 of Track 10, the close of Phase 1 (What learning from data means). By the end you will be able to trace, step by step, how gradient descent nudges a model’s parameters downhill to lower its error, and explain why this one procedure is how nearly every model learns, from the two-parameter line of lesson 2 to a network with billions of weights. The single mental image to walk away with: a point on a foggy hillside, feeling the local slope and stepping downhill until the ground goes flat.

The track structurally mirrors StatQuest’s intuition-first machine learning videos, with Microsoft’s “ML For Beginners” as the hands-on companion for readers who want to build the models in code. Full attribution is in this lesson’s references.

Where this fits

Lesson 2 ended on an open question: we defined the best-fit line as the one that minimizes the sum of squared residuals, but never showed how to find it. This lesson answers that, and the answer is bigger than regression. Gradient descent is the engine under classification, neural networks, and large language models alike, so this lesson closes Phase 1 by handing you the second half of how a model learns: a way to measure error (the loss) and a way to drive it down (gradient descent). The next phase opens with logistic regression, the first model we actually fit using this search.

Before you start

Prerequisite: Lesson 2, Fitting a line: linear regression. You need the idea of a loss to minimize (the sum of squared residuals) and of parameters (slope and intercept), because gradient descent is the procedure that searches for the parameter values that make the loss small. No calculus required; the lesson builds the intuition with a hillside, not derivatives.

By the end, you’ll be able to

Explain why a general search is needed when no formula gives the answer
Describe the loss as a landscape and the gradient as the local slope
Trace the update loop and apply new = old - learning_rate * gradient by hand
Explain how the learning rate affects training, and spot one set too large
Connect gradient descent to how large models train

Time and difficulty

Read time: about 12 minutes
Practice time: about 15 minutes (a by-hand two-step descent trace, a learning-rate diagnosis, and flashcards)
Difficulty: standard