Skip to content

Lesson: The whole network as one function

Back in lesson 1, we set out to find a function: something that takes 784 pixel-brightness numbers in and gives 10 scores out, one per digit. Then we got busy with parts. Lesson 2 showed the parts are arranged in layers of neurons. Lesson 3 showed how each neuron computes its number from the one before it. We have been staring at the gears. This lesson steps back so you can see the whole machine, and the whole machine turns out to be exactly the function we promised.

Here is the claim, stated plainly: the entire neural network is one big function. You feed it 784 numbers and it returns 10 numbers. Everything in between, all the layers, all the neurons, all the weighted sums and squishes, is the inner workings of that single function.

That is not a metaphor or a simplification. It is literally true. A function is anything that takes inputs and reliably produces outputs, and that is precisely what the network does. The reason it looks complicated is that this particular function has a lot of internal steps. But from the outside, it is as ordinary as any function: numbers in, numbers out.

Watching the function run: the forward pass

Section titled “Watching the function run: the forward pass”

The process of feeding an input through the network to get an output has a name: the forward pass. It is just lesson 3’s neuron formula, applied layer by layer, with each layer’s outputs becoming the next layer’s inputs.

Let us run a complete one by hand. A full 784-to-10 network would bury us in arithmetic, so here is a miniature with the same shape: 3 inputs, one hidden layer of 2 neurons, and 2 outputs. We will use ReLU (the lesson-3 squish that zeroes out anything negative and passes positives through unchanged) because it keeps the numbers clean.

The input is three numbers: 1.0, 0.5, and 0.0.

Hidden layer. Two neurons, each with its own weights and bias, each running the lesson-3 formula on the three inputs.

h1: weights [0.5, -0.4, 0.2], bias 0.1
sum = 0.5·1.0 + (-0.4)·0.5 + 0.2·0.0 + 0.1 = 0.4
ReLU(0.4) = 0.4
h2: weights [-0.3, 0.8, 0.5], bias -0.2
sum = -0.3·1.0 + 0.8·0.5 + 0.5·0.0 - 0.2 = -0.1
ReLU(-0.1) = 0.0 (negative, so squished to zero)

So the hidden layer’s activations are 0.4 and 0.0. Notice h2 went completely dark; its weighted sum landed negative and ReLU clamped it to zero. That is normal.

Output layer. Two neurons, now reading the hidden layer’s 0.4 and 0.0 as their inputs.

o1: weights [0.6, 0.9], bias 0.0
sum = 0.6·0.4 + 0.9·0.0 + 0.0 = 0.24
ReLU(0.24) = 0.24
o2: weights [-0.5, 0.3], bias 0.05
sum = -0.5·0.4 + 0.3·0.0 + 0.05 = -0.15
ReLU(-0.15) = 0.0

The output is 0.24 and 0.0. The first output neuron wins, so this little network’s answer is “the first class.” And that is a full forward pass: we turned an input into an output by doing nothing but the lesson-3 computation, twice per layer, in order. Scale this up to 784 inputs, two hidden layers of 16, and 10 outputs, and it is the identical procedure, just with more multiply-add-squash steps.

Now the key idea that ties the track together. A working network has two very different kinds of numbers flowing around in it, and keeping them straight is the whole point of this lesson.

  • The input changes every time you use the network. Feed it a different image and the 784 input numbers are different.
  • The weights and biases do not change as you use the network. They are fixed numbers baked into the function, the same for every image you show it.

Mathematicians have a compact way to write this. They treat the whole network as a single function with two kinds of inputs, the image you feed in plus the fixed weights and biases that define the network:

f(x; w, b)

Reading that left to right: f is the network, x is the input (the 784 pixel values), and w and b together are the whole collection of weights and biases. The semicolon is doing real work. It separates the thing that varies from one use to the next, the input image, from the thing that defines which network this is, the weights and biases.

Change the input and the output changes because you showed it a different picture. Change a weight or a bias and the output changes because you are now holding a different network. That second kind of change is the one to sit with: the architecture (784, 16, 16, 10) is just a skeleton, and the weights and biases are what turn that skeleton into a specific, behaving thing.

To feel how completely the parameters control the network, imagine the same 784-16-16-10 skeleton with three different sets of weights.

All zeros. Set every weight and bias to 0. Now every neuron computes the same weighted sum (zero), so every neuron in a layer produces the identical activation (whichever squish you use), and that sameness cascades forward, leaving the network with one fixed output no matter what image you feed in. A 3, a 7, or a photo of your cat all get the same dead answer. Same skeleton, useless function.

Random values. Fill the weights with random numbers. Now the network produces outputs, but they are noise: it “guesses” digits with no relationship to the actual image. Same skeleton, still useless, just noisily so.

Well-tuned values. Set the weights and biases to the right numbers and the very same skeleton reliably reads handwritten digits. Same skeleton, finally a useful function.

The skeleton never changed. Only the parameters did. Everything the network can or cannot do is sitting in those numbers.

It helps to picture the choice in front of you. Every possible setting of the network’s roughly 13,000 weights and biases is one specific function, one specific network. The all-zeros setting is one point in that vast space of possibilities; each random setting is another; the well-tuned setting we want is somewhere in there too. So building a working digit-recognizer is really a search problem: out of an unimaginably large space of possible parameter settings, find one that makes the function behave. That reframe, from “how do neurons work” to “which point in parameter space do we want,” is the bridge into the rest of the track.

This is the honest payoff of the whole chapter. A trained digit-recognizer’s ability to recognize digits is not in the idea of layers, and not in the neuron formula, both of which are identical between the useless network and the useful one. It is entirely in the specific values of its roughly 13,000 weights and biases. For a modern model, it is in the specific values of billions of them.

There is no understanding in there, no awareness, no little reader looking at the picture. “Neural network” is just a name for a particular kind of mathematical function: many layers of weighted sums and squishes, parameterized by a great many numbers. When those numbers are set well, the function gives useful answers. That is the entire trick, and it is enough to power most of modern AI.

Which leaves exactly one question, the one this whole chapter has been walking toward: nobody sits down and types 13,000 numbers by hand, never mind billions. So how do the right values get found?

The single most useful mental model you can carry out of this chapter is this: an AI model is a function, and its behavior is fixed by its parameters. That reframing cuts through a lot of confusion.

It explains why a model gives the same answer to the same input every time its settings are held steady; the function has not changed. It explains why “fine-tuning” a model means adjusting its parameters rather than teaching it in any human sense. It explains why two models can share an architecture yet behave nothing alike: different parameters, different function. And it explains why a model has no thoughts about your question. It is evaluating a function, the way a calculator evaluates one, just with billions of internal numbers instead of a few. Holding that picture keeps you grounded about what these systems are and are not.

Thinking the network “decides” or “understands.” It evaluates a function. Numbers in, numbers out, by fixed arithmetic. There is no comprehension step hidden anywhere in the layers.

Confusing the input with the parameters. The input changes every time you use the network. The weights and biases are fixed and define the network. Mixing these up is the most common confusion in the whole chapter.

Thinking the architecture is what makes a network smart. The architecture is a skeleton. The all-zeros and the well-trained networks have the same architecture and behave completely differently. The smarts are in the parameter values.

Expecting something more than arithmetic. The forward pass is multiply, add, squash, repeated. No step is more mysterious than the one you did by hand above. Scale is the only thing that changes.

  • The whole network is one function from 784 numbers to 10. The layers and neurons are its inner workings; from outside it is just numbers in, numbers out.
  • Running it is the forward pass: apply the lesson-3 neuron formula layer by layer, each layer’s outputs feeding the next, until the output layer’s activations are your answer.
  • The parameters are the function. Written compactly as a function of the input given the weights and biases: the input varies per use; the weights and biases are fixed and define which network you have. Same skeleton plus different parameters equals different behavior.
  • All the capability lives in the specific parameter values, about 13,000 for the digit network and billions for a modern model. Not in the structure, not in the formula, just the numbers.

A neural network is a function whose behavior is written entirely in its numbers. Set the numbers well and it works; set them badly and it does not. Everything else is just bookkeeping.

Next: the cheatsheet puts the forward pass and the function framing on one page. Then lesson 5 takes on the question this whole chapter has been circling. How do you find the right thirteen thousand numbers? That search has a name, and it is called learning.