Practice: Seeing it whole, and where next

Self-check

Six short questions, this time spanning the whole track. Answer each one in your head (or on paper) before opening the collapsible. Trying to retrieve the answer is where the learning sticks; rereading feels productive but does much less.

1. In one breath, what is a neural network, and where does its capability live?

Show answer

A function that turns inputs (784 numbers for a digit image) into outputs (10 scores), built from layers of simple neurons, each computing a weighted sum plus a bias passed through a squish. Its capability lives entirely in the specific values of its weights and biases (about 13,000 for the digit network, billions for a modern model), not in the structure or the formula.

2. List the four phases of one training step, in order.

Show answer

(1) Forward pass: run the image through the network to get its 10 outputs. (2) Cost: compare the outputs to the desired answer, getting one wrongness number. (3) Backward pass (backpropagation): sweep backward to find every knob’s downhill nudge, the gradient. (4) Update (gradient descent): step every knob a hair in its downhill direction. Then repeat with the next image.

3. What does a single training step accomplish, and why does training need so many?

Show answer

A single step makes the network only very slightly less wrong, on essentially one image. Training needs many steps (across many images, over many epochs) because the knobs settle into good values only by averaging tiny nudges over the whole training set, so the consistent patterns survive and one-image quirks cancel. A pile of random numbers becomes a digit reader one hair-sized step at a time.

4. What is an epoch?

Show answer

One full pass through all the training images. Training runs for many epochs, repeating the forward-cost-backward-update loop across the whole dataset over and over until the cost stops dropping much.

5. The “one picture to keep” is a row of dials and a landscape. Map each part of training onto it.

Show answer

The dials are the weights and biases; where the dials sit is your current position on the landscape; your height there is the cost (how wrong you are). The forward pass reads your current height, backpropagation feels which way is downhill (the slope), and gradient descent takes the step. Training is feeling downhill and turning every dial a hair that way, again and again, until you settle in a low valley.

6. Name two things this track did not cover, and where they fit relative to what you learned.

Show answer

Any two of: specialized architectures (convolutional nets, transformers), smarter optimizers (momentum, Adam), training niceties (regularization, dropout, batch norm), working with trained networks (fine-tuning, transfer learning), or actual code. None are new first principles, they are refinements of, or structures built on, the same neurons-weights-cost-gradient-backprop machinery you now hold.

Try it yourself, part 1: trace one full training step

About 5 minutes, no calculation. The network is shown an image that is actually a 7, but its tallest output right now is the “1” neuron, so it currently calls the image a 1. Walk through one training step: for each of the four phases, say what happens and what it produces.

Show answer

Forward pass. The 7 image enters as 784 brightness numbers in the input layer. Each layer computes its activations (weighted sum, plus bias, squished) and passes them forward, until the ten output neurons hold their scores. Produces: the output, with the “1” neuron tallest (a wrong guess).
Cost. Compare the output to the desired answer (a 1 in the “7” slot, 0 elsewhere). Since the “7” neuron is too low and the “1” neuron is too high, the cost comes out high. Produces: one wrongness number.
Backward pass (backpropagation). Starting from the output’s desires (the “7” wants up, the “1” wants down), sweep backward through the layers, computing for every one of the ~13,000 weights and biases which way and how much it should change. Produces: the gradient.
Update (gradient descent). Nudge every knob one small step in its downhill direction. Produces: a network that is very slightly less wrong about this 7. Then move to the next image and repeat.

Try it yourself, part 2: map the metaphor

About 3 minutes. Match each piece of the “dials and a landscape” picture (left) to the technical thing it stands for (right).

Metaphor                          Technical counterpart
A. The dials                      1. The cost (how wrong the network is)
B. Where the dials currently sit  2. Backpropagation
C. Your height on the landscape   3. The weights and biases
D. Feeling which way is downhill  4. A gradient-descent step
E. Turning every dial a hair      5. The current setting of all the parameters

Show answer

A → 3. The dials are the weights and biases (the knobs you can turn).
B → 5. Where the dials sit is the current setting of all the parameters, your position on the landscape.
C → 1. Your height is the cost: how wrong the network is at that setting.
D → 2. Feeling which way is downhill is backpropagation computing the gradient (the slope).
E → 4. Turning every dial a hair downhill is one gradient-descent step.

Put together: the forward pass reads your height, backprop feels the slope, gradient descent takes the step, and repeating it is the whole of training. If you keep one thing from this track, keep this picture, everything else can be rebuilt from it.

Flashcards

Ten cards spanning the whole track. Click any card to reveal the answer. Use the Print flashcards button to lay out the full set as one card per page, ready to print or save as a PDF for offline review.

Q. What is a neural network, in one sentence?

A function that maps inputs to outputs through layers of simple neurons (each computing weighted sum + bias + squish), whose entire capability lives in the specific values of its weights and biases.

Q. What is a neuron, and what does it compute?

A container holding one number between 0 and 1 (its activation). It computes that number as a weighted sum of the previous layer’s activations, plus a bias, passed through an activation function (the squish).

Q. What are a network's parameters, and how many in the digit network?

Its weights and biases, all counted together. The small 784-16-16-10 digit network has about 13,002; modern networks have billions. Behavior lives in these values.

Q. What is the cost function, and what is learning?

The cost is one number for how wrong the network is, averaged over the training set. Learning is finding the weights and biases that make the cost as small as possible, an optimization problem.

Q. What is gradient descent?

Repeatedly stepping every knob a little in the negative-gradient (steepest-downhill) direction to lower the cost. new = old − learning_rate × slope, applied to all knobs, many times.

Q. What is backpropagation?

The method that computes the gradient: each output neuron’s desire propagates backward through the layers, and one backward sweep yields every knob’s slope. Underneath, it is the chain rule run backward.

Q. What are the four phases of one training step?

Forward pass (get the output), cost (measure wrongness), backward pass (backprop to get the gradient), update (gradient-descent step on every knob). Then repeat with the next image.

Q. What is an epoch?

One full pass through all the training images. Training runs for many epochs, repeating the forward-cost-backward-update loop across the whole dataset until the cost stops dropping much.

Q. The one picture to keep?

A row of dials and a landscape behind them. The dials are the weights and biases; your height is the cost; training is feeling downhill (backprop), turning every dial a hair that way (gradient descent), and repeating until you settle in a low valley.

Q. Where do you go next after this track?

Track 13 (Build Neural Networks from Scratch) to build it in code; Track 5 (AI Foundations) for transformers and LLMs; Track 20 (AI Agents and Tool Use) to use trained networks to build things. All rest on this foundation.