Building and training a net: brief

What you’ll learn

This is lesson 2 of Phase 1 (The autograd engine) in the Build Neural Networks from Scratch track, which follows the arc of Andrej Karpathy’s Neural Networks: Zero to Hero series. Lesson 1 built the autograd engine: give it any expression and it returns the gradient of the output with respect to every input.

This lesson uses that engine to build and train an actual network. A neuron is the expression tanh(w1*x1 + ... + wn*xn + b), with the weights and bias as its trainable parameters. Stack neurons into a layer and layers into a multilayer perceptron. Measure how wrong the network is with a single loss number (mean squared error), call loss.backward() to give every parameter its gradient, and apply gradient descent: nudge each parameter opposite its gradient by a small learning rate. The lesson works one full gradient-descent step by hand on a single weight (watching the loss fall from 16 to 0.64 to 0.0256), lays out the four-step training loop, flags the classic forgot-to-zero-the-gradients bug, and shows that this loop is exactly how every network, GPT included, learns.

Where this fits

This is lesson 2 of Phase 1, The autograd engine, and it closes the micrograd arc. Lesson 1 built the engine that computes gradients; this lesson assembles that engine into a trainable network and runs the learning loop. Together they cover Karpathy’s Lecture 1 end to end. The next lesson opens Phase 2 (Building a language model) by leaving micrograd behind and starting makemore, a character-level model that learns the statistics of names and generates new ones, the first step from a toy network toward a model that produces language.

Before you start

Prerequisite (within this track): lesson 1, Building an autograd engine: micrograd. This lesson assumes you know what a Value object and a computational graph are, that each operation has a local derivative, and that backward() walks the chain rule to give every node its gradient. If “loss.backward() fills in every parameter’s gradient” reads as a procedure rather than magic, you are ready. A working sense of the chain rule and of gradient descent (a function decreases fastest opposite its gradient) helps; both are covered in the calculus track. No coding is required to follow the lesson.

By the end, you’ll be able to

Describe how a neuron, a layer, and a multilayer perceptron are built as expressions out of weights, biases, and tanh
Explain why training needs a single loss number, and how mean squared error produces one from the network’s predictions
State the gradient-descent update rule and explain why each parameter steps opposite its gradient by a small learning rate
Run a full gradient-descent step by hand on a single-weight network and confirm the loss decreases
List the four steps of the training loop in order and explain why forgetting to zero the gradients breaks training

Time and difficulty

Read time: about 12 minutes
Practice time: about 20 minutes (a gradient-descent step by hand, optionally confirmed in micrograd’s demo, plus flashcards)
Difficulty: standard