Lesson: Seeing it whole, and where next
We started this track with a handful of messy handwritten threes and an uncomfortable observation. You could read them all instantly, but you could not write down the rule you used to do it. That gap, between recognizing and explaining, was the whole reason we needed something other than ordinary rule-writing. The fix, we said, was a paradigm shift: stop writing rules, start showing examples, and let a machine find the pattern.
Back then, “let a machine find the pattern” was a promise with a sealed box behind it. Over nine lessons we opened that box completely. This last lesson is for stepping back and seeing the whole thing at once, walking one full training step from start to finish, and then sending you off toward whatever you want to do next. No new machinery here. Just the view from the top of the hill we climbed.
The whole story, in one breath
Section titled “The whole story, in one breath”Here is everything, in order, each piece clicking into the next.
A neural network is a function that turns 784 numbers (the pixels of an image) into 10 numbers (a score per digit). That was lesson 1’s promise and lesson 4’s assembly.
That function is built from layers of neurons, where a neuron is nothing but a container holding one number between 0 and 1, its activation. The input layer holds the raw pixels; the output layer holds the ten scores; the hidden layers do the work in between. That was lesson 2.
Each neuron computes its number the same way: take a weighted sum of the previous layer’s activations, add a bias, and pass the result through a squish to keep it in range. That was lesson 3.
Stack that computation across every neuron and the whole network is one big function with about 13,000 adjustable knobs, its weights and biases. Set them randomly and it outputs nonsense; set them well and it reads digits. The capability lives entirely in those numbers. That was lesson 4.
To find good numbers, we needed a way to measure wrongness, so we defined a cost function: one number for how far the network’s outputs are from the answers we wanted, averaged over the training set. Learning, we said, is just making that number small. That was lesson 5.
We pictured the cost as a landscape over the space of all possible knob settings, where height is cost, and the goal is to reach a low valley. That was lesson 6.
To get downhill, we follow the negative gradient, taking small steps in the steepest-downhill direction, over and over. That is gradient descent, the algorithm. That was lesson 7.
And to compute the downhill direction efficiently, we use backpropagation: each output neuron’s desire to change ripples backward through the layers, and the whole gradient falls out of a single backward sweep. That was lesson 8’s intuition and lesson 9’s chain-rule arithmetic.
Put the last pieces in motion and you get the training loop: forward pass to get an output, compute the cost, backward pass to get the gradient, nudge every knob downhill, then do it again with the next image. That loop, run enough times, is how a pile of random numbers becomes a network that recognizes handwriting.
One training step, start to finish
Section titled “One training step, start to finish”Let us watch a single turn of that loop on the very digit we opened with: a messy handwritten 3.
The image enters as 784 brightness numbers, filling the input layer. The forward pass runs: layer by layer, each neuron takes the previous activations, does its weighted-sum-plus-bias-plus-squish, and passes its number forward, until the ten output neurons light up. Say they come out like this:
digit: 0 1 2 3 4 5 ...output: 0.1 0.05 0.0 0.2 0.5 0.0 ...The network’s tallest output is the “4” neuron at 0.5, so right now it thinks this 3 is a 4. It is wrong. We compute the cost against the answer we wanted (a 1 in the “3” slot, 0 everywhere else), and it comes out high, around 0.90. The network is being told, in one number, that it did badly.
Now the backward pass. Backpropagation sweeps from the output back to the front, and for every one of the 13,000 weights and biases it computes the same thing: which way, and how much, should this knob change to make the cost smaller? The “3” output neuron wanted to be higher, the “4” wanted to be lower, and those desires propagated back into a precise nudge for every knob in the network.
Then the update: each knob takes one small step in its downhill direction. None of them moves far. After this single step, the network is only very slightly less wrong about this one 3. That is all one step does.
But now do it again, with a different image, and again, and again, across the whole training set. One full pass through all the training images is called an epoch, and training runs for many epochs. Step by tiny step, averaged across thousands of examples, the knobs settle into values that work, not just for this 3 but for threes and sevens and every digit it was shown. The pile of random numbers becomes a digit reader. Nothing magic happened. A number went down, many times.
What we did not cover
Section titled “What we did not cover”This track was the foundation, and a foundation is honest about being one. Plenty sits on top of what you now understand, waiting in other tracks or further reading:
- Specialized architectures. The network here was fully connected, the simplest kind. Real systems use structures tuned to their data: convolutional networks for images, and transformers for language, which Track 5 covers in depth. They are not new first principles; they are clever arrangements of the same neurons, weights, and backprop.
- Smarter optimizers. We used plain gradient descent. In practice, methods like momentum and Adam adjust the step sizes adaptively to train faster. Same downhill idea, better footwork.
- Training niceties. Regularization, dropout, and batch normalization are techniques that help big networks train well and generalize. They tune the process; they do not change the story.
- Working with trained networks. Fine-tuning and transfer learning reuse an already-trained network for a new task, which is how most practical AI is built today.
- Actual code. We stayed in intuition and arithmetic. Building a working network in real code is its own satisfying step.
None of these are over your head now. Each one is a refinement of, or a structure built on, the machinery you just learned.
Where to go next
Section titled “Where to go next”Three honest paths, depending on what you are itching to do.
If you want to build it yourself, go to Track 13 (Build Neural Networks from Scratch). This is the natural next move if the arithmetic in lessons 7 and 9 made you want to type it into a computer and watch it learn. That track builds a working network in Python from first principles, and it maps almost one to one onto what you just learned: the gradient descent and backprop of this track, written as code you run.
If you want to understand modern AI models, go to Track 5 (AI Foundations). This is the path if your real question all along was “so how do today’s large language models actually work?” It covers transformers and the machinery behind large language models. A transformer is a particular kind of neural network, and every foundation from this track, the forward pass, the cost, gradient descent, backprop, carries straight over. Track 5 adds the specific architecture that makes language models work.
If you want to use AI to build things, look toward Track 20 (AI Agents and Tool Use). This is for the reader who now understands what is under the hood and wants to drive. Agents are built on top of trained networks, wiring them up to take actions and use tools. A different altitude, resting on the same foundation you now have.
Why this matters when you use AI
Section titled “Why this matters when you use AI”You came into this track able to say “neural network” in a sentence without being able to picture one. That is most people, including plenty who work near this technology. You leave able to picture it: layers of numbers, a function with thousands of knobs, trained by walking downhill on a cost landscape using gradients that backpropagation computes in a single backward sweep.
That picture is worth more than trivia. It is what lets you read AI news without being dazzled or frightened by it, judge a confident claim about what a model “knows,” understand why these systems are brilliant at fuzzy pattern tasks and brittle at the edges, and ask sharper questions about any AI tool you are handed. You are no longer on the outside of the box looking at the label. You have seen the gears.
The one picture to keep
Section titled “The one picture to keep”If you forget every detail of this track, keep this single image, because everything else can be rebuilt from it. A neural network is a long row of dials, and behind the dials is a landscape. Where the dials currently sit puts you at some spot on that landscape, and the height where you stand is how wrong the network is. Training is nothing more than feeling which way is downhill from where you stand, turning all the dials a hair in that direction, and doing it again and again until you settle into a low valley. The forward pass is how you read your current height; backpropagation is how you feel the slope; gradient descent is the step. That is the whole of it: dials, a landscape, and a patient walk downhill.
What you should remember
Section titled “What you should remember”- A neural network is a function with many knobs. It maps inputs to outputs through layers of simple neurons, and everything it can do lives in the specific values of its weights and biases.
- Training is minimizing a cost by walking downhill. Measure wrongness (cost), find the downhill direction (gradient, via backpropagation), take a small step (gradient descent), and repeat across many examples and epochs.
- Every piece is simple; the power is in the scale and the tuning. One neuron is multiply-add-squash. One step barely moves anything. Billions of simple pieces and millions of tiny steps are where the capability comes from.
- This is the foundation, not the ceiling. Architectures, optimizers, and applications build on exactly what you now hold. Track 13 builds it in code, Track 5 reaches transformers, Track 20 reaches agents.
Ten lessons ago, a handwritten 3 was a small mystery you could solve instantly but not explain. Now you can explain it, all the way down to the arithmetic, and you know how a machine learns to do the same. That is the whole of this track, and it is yours to build on.