References: How models actually learn: gradient descent

Source material

Source material (conceptual spine):
• StatQuest with Josh Starmer: "Gradient Descent, Step by Step"
  Creator: Josh Starmer
  YouTube: https://www.youtube.com/watch?v=sDv4f4s2SB8
  Channel / site: https://statquest.org/
  License: as published on StatQuest's public YouTube channel (link-out only)

Related StatQuest video:
• "Stochastic Gradient Descent"
  YouTube: https://www.youtube.com/watch?v=vMh0zPT0tLI

Clawdemy provides original notes, summaries, and quizzes derived from this material
for educational purposes. All rights to the original videos remain with the creator.

What this lesson draws from each source

StatQuest’s “Gradient Descent, Step by Step” anchors the procedure: the loss surface, stepping against the slope, and the role of the step size. StatQuest works the gradient with calculus on a sum-of-squared-residuals example; this lesson deliberately keeps the no-calculus intuition (the foggy hillside) and traces a single-parameter bowl by hand, so the mechanism is clear before any derivatives. If you want the calculus-level derivation, the StatQuest video is the place to go.
The “loss as a landscape” framing and the worked bowl trace are Clawdemy’s own simplifications, built to make the downhill loop concrete without notation.

Going deeper

StatQuest with Josh Starmer. The gradient descent and stochastic gradient descent videos pair directly with this lesson. StatQuest also covers the chain rule and backpropagation, which is gradient descent applied through the layers of a neural network.
3Blue1Brown: Gradient descent, how neural networks learn by Grant Sanderson. A visual, geometry-first walk through gradient descent in the context of a neural network learning to recognize digits. The single best companion video if you want to see the landscape and the steps.

Adjacent topics

Logistic regression (the next lesson). The first place we use gradient descent for real: there is no neat formula for the best logistic regression, so it is fit by gradient descent.
Backpropagation. The method that computes the gradient efficiently across the many layers of a neural network. It is the reason gradient descent scales to billions of parameters. A natural next step once this lesson is solid (and the subject of other Clawdemy tracks).
Learning-rate schedules. In practice the learning rate is often changed during training (large at first, smaller later). A refinement of the single fixed rate used here.

Community discussion

None selected for this lesson. Gradient descent is thoroughly covered by the StatQuest and 3Blue1Brown resources above. If a canonical discussion surfaces, it will be added at the next review.