Skip to content

Lesson: The handwritten-digit problem

Look at a few handwritten threes. One is round and careful. One is sharp and slanted. One was dashed off so fast the top loop barely closes. Your eyes read all of them as “3” before you have even finished noticing they are different shapes. You did not think about it. You did not run down a checklist. The answer just arrived.

Now try to write down what you did. Not the answer, the method. What exactly makes a shape a 3 and not an 8, or a 5, or a hurried 2? The moment you try to put it into words, the easy thing turns surprisingly hard. That gap, between how effortlessly you recognize a digit and how impossible it is to explain the recognition, is the whole reason neural networks exist. This lesson is about why that gap is the right place to start.

To you, a handwritten digit is a shape. To a computer, it is not a shape at all. It is a grid of pixels, and each pixel is just a number saying how bright that little square is, from 0 for a fully black square up to 1 for a fully white one. A common setup uses a 28 by 28 grid, which is 784 little brightness numbers and nothing more. There is no “curve” and no “loop” in there anywhere. There is only a long list of numbers.

A handwritten 3 becomes a 28 by 28 grid of brightness numbers On the left, a handwritten digit 3 drawn as a single stroke. An arrow labeled "becomes" points right to a 28 by 28 grid of small square cells. The cells trace out the same 3 on a black canvas: cells on the stroke are bright, near white (high brightness values near 1.0), cells at the stroke edges are mid-gray (around 0.7), and cells in the background are black (0.0). Three sample cells are outlined and labeled with their brightness values, illustrating that to a computer the digit is just 784 numbers, one per cell. becomes 1.0 0.7 0.0 28 x 28 = 784 brightness numbers brightness per cell
To you, a 3 is a shape. To a computer it is this: a grid of cells, each holding one number for how bright it is, 0.0 for black up to 1.0 for a fully lit white pixel. There are no curves or loops in there anywhere, only 784 brightness numbers. And shift the same 3 a few cells over and every one of those numbers changes, which is what makes the digit so hard to pin down with rules.

Here is what makes that brutal. Your 3 and my 3 land on completely different pixels. Shift the digit a few squares to the right and every single number in the list changes, even though the digit is obviously the same. Make it bigger, thinner, more slanted, and the list changes again. The thing that is “the same” to your eye is, at the level of raw numbers, wildly different every time.

The natural instinct, especially if you write code, is to reach for rules. So let us try. Here is one honest attempt at a rule for the digit 3:

A 3 has two rounded bumps on its right side, stacked one above the other, and an open left side.

For a tidy, upright, textbook 3, that works. Now meet three real handwritten threes.

One rule for the digit 3 fails on three real handwritten threes At the top, a stated rule: a 3 has two rounded bumps, stacked on the right, open to the left. Below are three handwritten threes that each break it. The first is steeply slanted, so its bumps sit off to the side rather than cleanly stacked. The second has a flat top instead of a rounded upper bump. The third was written fast, so its lower bump is a straight flick rather than a curve. A violet ring highlights the offending feature on each. The rule that seemed reasonable misses all three. The rule: two rounded bumps, stacked on the right, open left. Slanted: the two bumps sit off to the side, not stacked. Flat top: the upper section is more flat than round. Fast: the lower bump is really just a straight flick.
One honest rule, three real threes, three misses. You could patch the rule for slant, then for flat tops, then for flicks, but every patch invites a new 3 you did not plan for. The rules pile up and never quite close the gap. That is the signal you are reaching for the wrong tool.

The slanted one has its bumps off to the side, not cleanly stacked. The fast one has a lower “bump” that is really just a straight flick. The careful one has a top that is more flat than round. Your rule, which felt reasonable thirty seconds ago, already misses three out of three real examples.

You could patch it. Add a clause for slant. Add a clause for flat tops. Add a clause for flicks. But every patch invites a new digit you did not plan for, and the rules pile up without ever quite covering reality. You would be writing rules forever and still run into a 3 that breaks them. This is not a failure of effort. It is a sign you are using the wrong tool.

Why this is the right problem to learn from

Section titled “Why this is the right problem to learn from”

If handwritten digits are so awkward, why is this the problem that nearly every introduction to neural networks opens with? Because it sits in an unusually useful sweet spot.

  • The input is small and fixed. Every image is 784 numbers. Not a paragraph, not a video, just a tidy, predictable list. That keeps the problem small enough to reason about.
  • The output is small too. There are only ten possible answers, the digits 0 through 9. The computer is not writing an essay; it is picking one of ten boxes.
  • It is genuinely hard, but clearly solvable. Rule-writing falls apart, yet a six-year-old reads these digits without breaking stride. When something is effortless for a human but resists every obvious rule, that is a strong hint that a smarter approach exists and is worth finding.
  • The approach travels. Reading a digit, recognizing a face, spotting a tumor on a scan, sorting a photo by what is in it: under the hood, these are the same shape of problem. Numbers in, a label out. Crack handwritten digits and you have a template that scales to all of them.

So the digit problem is not the point. It is the smallest honest example of a much larger pattern, which is exactly what you want when you are learning the idea for the first time.

The shift: stop writing rules, start showing examples

Section titled “The shift: stop writing rules, start showing examples”

Here is the move that changes everything, and it is more of an attitude than a technique.

Instead of trying to tell the computer what a 3 is, you show it. You gather thousands of images that people have already labeled, this one is a 3, this one is a 7, this one is a 0, and you hand the computer the examples instead of the rules. Then you let it find the pattern on its own. You stop being the author of the answer and become the curator of the examples.

Rule-based programming versus learning from examples A two-column contrast. On the left, rule-based programming: a human writes a growing pile of if/then rules, an arrow leads down to a brittle program marked with a crack, which breaks on the first 3 nobody planned for. The bottom label reads "you describe the answer." On the right, learning from examples: a row of labeled digit images (a 3, a 7, a 0, a 2) feeds an arrow down into a system that shapes its own pattern and improves as it sees more examples. The bottom label reads "you demonstrate the answer." Rule-based programming if two round bumps stacked: return 3elif slanted: ...elif flat top: ...elif fast flick: ... # and on, # forever brittle program breaks on the first 3 nobody planned for You describe the answer. Learning from examples 3 3 7 7 0 0 2 2 thousands of labeled images shapes its own pattern improves as it sees more examples You demonstrate the answer.
The shift that makes modern AI work. Instead of writing the logic for every case, which breaks on the first case nobody anticipated, you hand the computer thousands of labeled examples and let it find the pattern itself. You stop being the author of the answer and become the curator of the examples.
Rule-based programmingLearning from examples
A human writes the logic for every caseA human provides labeled examples
Breaks on the first case nobody anticipatedImproves as it sees more examples
You describe the answerYou demonstrate the answer

It helps to name what we are actually after. We want a function: something that takes those 784 brightness numbers in and gives back ten numbers out, one score per possible digit, with the highest score being the answer. The twist is that we are not going to write that function by hand. We are going to let the computer build it from the labeled examples.

What is inside that function, how it is structured, and how the computer actually shapes it from examples, is the work of the next several lessons. For now, hold on to just the reframe: we moved from “describe the answer” to “show examples and learn the answer.” That single shift is the door into everything else in this track.

Almost every AI tool you have touched, the chat assistants, the photo search, the voice transcription, the spam filter quietly working in the background, is built on this same idea. Not one of them is a giant pile of rules a person sat down and wrote. They are all, underneath, systems that were shown enormous numbers of examples and learned the patterns themselves.

That one fact explains a lot of what feels strange about modern AI. It is uncannily good at fuzzy, human things, like telling a cat from a dog or catching the tone of a sentence, precisely because those are things we could never have written clean rules for anyway. And it can be oddly brittle at the edges, confidently wrong on an example unlike anything it was shown, because it only ever knew the examples, never a rule. Once you see that these systems learned from examples rather than followed instructions, their strengths and their blind spots stop being mysterious and start making sense.

The mistakes here are not technical, because there is no technique yet. They are misconceptions about the framing itself.

Thinking modern AI is a huge list of human-written rules. It is the opposite. The whole point of the shift is that nobody wrote the rules for recognizing a 3. The system found the pattern from examples.

Thinking the hard part is the seeing. The seeing is the easy part; your eyes do it instantly. The hard part is the specifying, putting into exact words what makes a 3 a 3. That is what defeats the rule-writer.

Thinking “just write more rules” would eventually work. It feels like you are one clause away from a complete rule, always. You are not. Real handwriting has endless variation, and a finite list of rules will never close the gap.

Underestimating what a pile of labeled examples can do. It is tempting to assume examples alone could not possibly be enough and that real intelligence must need hand-coded knowledge. The surprising lesson of the field is how far examples alone can take you.

  • Recognizing a handwritten digit is effortless for you and brutally hard to write as rules. That gap between doing and explaining is the reason neural networks exist.
  • To a computer, an image is just a list of brightness numbers (often 784 of them for a 28 by 28 image), with no shapes or curves anywhere inside, and the same digit lands on wildly different numbers each time.
  • Handwritten digits are the classic first problem because the input and output are small, the task is genuinely hard but clearly solvable, and the approach scales to faces, scans, and far beyond.
  • The paradigm shift is the whole point: stop writing rules, start showing labeled examples, and let the system find the pattern. We are after a function from 784 numbers to 10, built from examples rather than written by hand.

Modern AI exists because we stopped writing rules and started showing examples.

Next: the cheatsheet puts this opener on one page, and the references link Grant Sanderson’s video if you want to watch the idea unfold. Then lesson 2 cracks open that function from 784 numbers to 10 and shows what is actually inside it.