Skip to content

Building and training a net: micrograd

This is lesson 2 of Phase 1 (The autograd engine) in the Build Neural Networks from Scratch track, which follows the arc of Andrej Karpathy’s Neural Networks: Zero to Hero series. Lesson 1 built the autograd engine: give it any expression and it returns the gradient of the output with respect to every input.

This lesson uses that engine to build and train an actual network. A neuron is the expression tanh(w1*x1 + ... + wn*xn + b), with the weights and bias as its trainable parameters. Stack neurons into a layer and layers into a multilayer perceptron. Measure how wrong the network is with a single loss number (mean squared error), call loss.backward() to give every parameter its gradient, and apply gradient descent: nudge each parameter opposite its gradient by a small learning rate. The lesson works one full gradient-descent step by hand on a single weight (watching the loss fall from 16 to 0.64 to 0.0256), lays out the four-step training loop, flags the classic forgot-to-zero-the-gradients bug, and shows that this loop is exactly how every network, GPT included, learns.

This is lesson 2 of Phase 1, The autograd engine, and it closes the micrograd arc. Lesson 1 built the engine that computes gradients; this lesson assembles that engine into a trainable network and runs the learning loop. Together they cover Karpathy’s Lecture 1 end to end. The next lesson opens Phase 2 (Building a language model) by leaving micrograd behind and starting makemore, a character-level model that learns the statistics of names and generates new ones, the first step from a toy network toward a model that produces language.

Prerequisite (within this track): lesson 1, Building an autograd engine: micrograd. This lesson assumes you know what a Value object and a computational graph are, that each operation has a local derivative, and that backward() walks the chain rule to give every node its gradient. If “loss.backward() fills in every parameter’s gradient” reads as a procedure rather than magic, you are ready. A working sense of the chain rule and of gradient descent (a function decreases fastest opposite its gradient) helps; both are covered in the calculus track. No coding is required to follow the lesson.

  • Describe how a neuron, a layer, and a multilayer perceptron are built as expressions out of weights, biases, and tanh
  • Explain why training needs a single loss number, and how mean squared error produces one from the network’s predictions
  • State the gradient-descent update rule and explain why each parameter steps opposite its gradient by a small learning rate
  • Run a full gradient-descent step by hand on a single-weight network and confirm the loss decreases
  • List the four steps of the training loop in order and explain why forgetting to zero the gradients breaks training
  • Read time: about 12 minutes
  • Practice time: about 20 minutes (a gradient-descent step by hand, optionally confirmed in micrograd’s demo, plus flashcards)
  • Difficulty: standard