Summary: Building and training a net: micrograd

TL;DR. The autograd engine from last lesson computes the gradient of any expression. This lesson uses it to build and train a network. A neuron is the expression tanh(weighted sum + bias); stack neurons into layers and layers into a multilayer perceptron. Measure how wrong it is with a single loss number, call loss.backward() to get every parameter’s gradient, and nudge each parameter opposite its gradient by a small learning rate. Repeat that forward / zero-gradients / backward / update loop and the network learns. This is exactly how every network trains, GPT included.

Core ideas

A network is one big expression. A neuron is tanh(w1*x1 + ... + wn*xn + b), with the weights and bias as trainable parameters. A layer is several neurons; an MLP is stacked layers. Running an input through records the whole network as one computational graph.
The loss is a single number at the top of the graph. Mean squared error sums (pred - target)^2 over the training examples. Because the loss is a Value, loss.backward() gives every parameter dLoss/d(itself): the signal for which way to nudge it.
Gradient descent nudges every parameter downhill. parameter = parameter - learning_rate * gradient. Step opposite the gradient (the minus sign is essential) by a small learning rate. One worked step on a single weight took the loss from 16 to 0.64, and another to 0.0256, converging.
Training is a four-step loop, repeated. Forward pass, zero the gradients, backward pass, update. Forgetting to zero the gradients (they accumulate with +=) is the classic bug that quietly breaks training.
This is how real models train. Forward, loss, backward, step is the heartbeat of training GPT and every other network. At scale the loss is cross-entropy, the optimizer is Adam, the forward pass is a transformer on tensors across many GPUs, but the loop is the same. A trained model is just its parameters after many downhill nudges.

What changes for you

“Training a neural network” stops being a black box and becomes a loop you could write: build the expression, score it, nudge the parameters downhill, repeat. When you hear a model was “trained on trillions of tokens,” you now know precisely what happened, this loop ran an astronomical number of times. The next lesson points the machinery at something real: makemore, a character-level language model that learns the statistics of names and generates new ones, the first step from a toy network toward a model that produces language.