Skip to content

Summary: Building and training a net: micrograd

TL;DR. The autograd engine from last lesson computes the gradient of any expression. This lesson uses it to build and train a network. A neuron is the expression tanh(weighted sum + bias); stack neurons into layers and layers into a multilayer perceptron. Measure how wrong it is with a single loss number, call loss.backward() to get every parameter’s gradient, and nudge each parameter opposite its gradient by a small learning rate. Repeat that forward / zero-gradients / backward / update loop and the network learns. This is exactly how every network trains, GPT included.

  • A network is one big expression. A neuron is tanh(w1*x1 + ... + wn*xn + b), with the weights and bias as trainable parameters. A layer is several neurons; an MLP is stacked layers. Running an input through records the whole network as one computational graph.

  • The loss is a single number at the top of the graph. Mean squared error sums (pred - target)^2 over the training examples. Because the loss is a Value, loss.backward() gives every parameter dLoss/d(itself): the signal for which way to nudge it.

  • Gradient descent nudges every parameter downhill. parameter = parameter - learning_rate * gradient. Step opposite the gradient (the minus sign is essential) by a small learning rate. One worked step on a single weight took the loss from 16 to 0.64, and another to 0.0256, converging.

  • Training is a four-step loop, repeated. Forward pass, zero the gradients, backward pass, update. Forgetting to zero the gradients (they accumulate with +=) is the classic bug that quietly breaks training.

  • This is how real models train. Forward, loss, backward, step is the heartbeat of training GPT and every other network. At scale the loss is cross-entropy, the optimizer is Adam, the forward pass is a transformer on tensors across many GPUs, but the loop is the same. A trained model is just its parameters after many downhill nudges.

“Training a neural network” stops being a black box and becomes a loop you could write: build the expression, score it, nudge the parameters downhill, repeat. When you hear a model was “trained on trillions of tokens,” you now know precisely what happened, this loop ran an astronomical number of times. The next lesson points the machinery at something real: makemore, a character-level language model that learns the statistics of names and generates new ones, the first step from a toy network toward a model that produces language.