Autograd engine, micrograd: brief

What you’ll learn

This is the opener of Phase 1 (The autograd engine) in the Build Neural Networks from Scratch track, which follows the arc of Andrej Karpathy’s Neural Networks: Zero to Hero series. The track’s contract is simple: by the end, nothing inside a neural network is a mystery, because you will have built every piece from nothing.

This lesson builds the engine that makes a network able to learn: an autograd engine. A network learns by nudging its parameters to make the loss smaller, and to know which way to nudge each one it needs that parameter’s gradient. micrograd, the smallest honest version (about 150 lines), computes those gradients automatically with three ideas: wrap every number in a Value that records how it was computed (the computational graph), give each operation a local derivative known in advance, and backpropagate by walking the chain rule backward through the graph until every node holds its gradient. The lesson works one full backward pass by hand on L = (a*b + c) * f, adds tanh so the engine can express a neuron, and shows that loss.backward() in PyTorch is exactly this procedure on tensors.

Where this fits

This is lesson 1 of Phase 1, The autograd engine, and the opener of the whole track. There is no prior lesson in this track; it starts from zero. The next lesson takes the engine built here and does what it was built for: assembling neurons into a network and training it by repeatedly computing gradients and stepping the parameters downhill. Phase 2 then moves from this single-value engine to a language model, and Phase 3 to a transformer. Every later lesson rests on the gradient mechanism this one establishes.

Before you start

Prerequisites (conceptual, not lessons in this track): you’ll get the most from this lesson if you already know what a neural network is in shape (an input passed through layers of weighted sums and nonlinearities, learning by adjusting its weights) and the chain rule from calculus (rates multiply through a composition of functions). If “the chain rule sends a rate backward through a composition” reads cleanly, you have the one piece of math this lesson leans on; everything else is built inline. No coding is required to follow the lesson, though reading the micrograd repo (MIT-licensed, around 150 lines) afterward is the fastest way to make it concrete.

By the end, you’ll be able to

Explain why a neural network needs the gradient of the loss with respect to every parameter, and what a gradient tells you about how to nudge a parameter
Describe how wrapping each number in a Value object records the computation as a graph during the forward pass
State the local derivative of addition, multiplication, and tanh, and explain why each one depends only on its own operation
Run a full backward pass by hand on a small expression, computing the gradient of the output with respect to every input via the chain rule
Recognize that loss.backward() in real frameworks is this same procedure (recorded graph, local derivatives, chain rule walked backward), just on tensors instead of single numbers

Time and difficulty

Read time: about 18 minutes
Practice time: about 20 minutes (a full backward pass by hand, optionally confirmed against micrograd’s visualizer, plus flashcards)
Difficulty: standard