WaveNet hierarchical model, in brief

What you’ll learn

This is lesson 5 of Phase 2 (Building a language model) in the Build Neural Networks from Scratch track, which follows the arc of Andrej Karpathy’s Neural Networks: Zero to Hero series, and it closes the phase. The MLP language model fused all of its context in one crude step: concatenate every context character’s embedding, crush them through a single hidden layer.

This lesson restructures that model into a hierarchy, in the style of DeepMind’s WaveNet. Instead of fusing everything at once, it fuses two neighboring groups at a time, in levels: pairs of characters, then pairs of those, and so on up a tree. Each fusion is a tiny linear-plus-tanh layer, but the span each output covers doubles every level, so after k levels each output sees 2^k characters. The lesson works the receptive-field doubling on numbers (eight characters in three levels; ten levels reach over a thousand), shows what one fusion computes, and draws out the big idea: context grows with depth instead of width, the network builds representations in stages, and this staged composition through depth is the central structural idea of modern AI, the same idea that makes transformers work.

Where this fits

This is lesson 5 of Phase 2, Building a language model, and the phase closer. It directly restructures the MLP from lesson 4: the embeddings, the tanh layers, and the softmax output all stay; what changes is how the context is combined, gradually up a tree instead of all at once. With this, the language-model phase is complete: real context (embeddings), stable training (initialization and normalization), and now staged understanding (a deep hierarchy). The final phase builds the transformer from scratch, starting with self-attention, which replaces WaveNet’s fixed tree with a scheme where each position chooses which others to draw from.

Before you start

Prerequisite (within this track): lesson 4, Giving the model memory: the MLP language model. This lesson restructures that exact network, so you need to know its shape (embeddings looked up and concatenated, a tanh hidden layer, a softmax output trained on negative log likelihood). The critique that opens this lesson, that the flat MLP fuses all its context in one step, only lands if that model is familiar. If you can picture the MLP taking a few context characters and producing next-character probabilities, you are ready. No coding is required to follow along, though running Karpathy’s makemore repo (MIT-licensed) and watching the tensor shapes halve at each level makes the hierarchy concrete.

By the end, you’ll be able to

Explain why the flat MLP’s one-step fusion of context is crude and scales poorly
Describe the WaveNet hierarchy and what a single fusion (concatenate two neighbors, linear, tanh) computes
Explain why the receptive field doubles per level and compute the depth needed for a given context as log2
Contrast “context grows with depth” (the tree) against “context grows with width” (the flat model)
Recognize staged composition through depth as the central structural idea of modern AI, including transformers

Time and difficulty

Read time: about 12 minutes
Practice time: about 18 minutes (computing depth-versus-context and tracing one fusion by hand, optionally confirmed in the makemore repo, plus flashcards)
Difficulty: standard