How models actually learn: gradient descent
What you’ll learn
Section titled “What you’ll learn”This is lesson 3 of Track 10, the close of Phase 1 (What learning from data means). By the end you will be able to trace, step by step, how gradient descent nudges a model’s parameters downhill to lower its error, and explain why this one procedure is how nearly every model learns, from the two-parameter line of lesson 2 to a network with billions of weights. The single mental image to walk away with: a point on a foggy hillside, feeling the local slope and stepping downhill until the ground goes flat.
The track structurally mirrors StatQuest’s intuition-first machine learning videos, with Microsoft’s “ML For Beginners” as the hands-on companion for readers who want to build the models in code. Full attribution is in this lesson’s references.
Where this fits
Section titled “Where this fits”Lesson 2 ended on an open question: we defined the best-fit line as the one that minimizes the sum of squared residuals, but never showed how to find it. This lesson answers that, and the answer is bigger than regression. Gradient descent is the engine under classification, neural networks, and large language models alike, so this lesson closes Phase 1 by handing you the second half of how a model learns: a way to measure error (the loss) and a way to drive it down (gradient descent). The next phase opens with logistic regression, the first model we actually fit using this search.
Before you start
Section titled “Before you start”Prerequisite: Lesson 2, Fitting a line: linear regression. You need the idea of a loss to minimize (the sum of squared residuals) and of parameters (slope and intercept), because gradient descent is the procedure that searches for the parameter values that make the loss small. No calculus required; the lesson builds the intuition with a hillside, not derivatives.
By the end, you’ll be able to
Section titled “By the end, you’ll be able to”- Explain why a general search is needed when no formula gives the answer
- Describe the loss as a landscape and the gradient as the local slope
- Trace the update loop and apply
new = old - learning_rate * gradientby hand - Explain how the learning rate affects training, and spot one set too large
- Connect gradient descent to how large models train
Time and difficulty
Section titled “Time and difficulty”- Read time: about 12 minutes
- Practice time: about 15 minutes (a by-hand two-step descent trace, a learning-rate diagnosis, and flashcards)
- Difficulty: standard