Skip to content

Cheatsheet: a WaveNet-style hierarchical model

It concatenates all context embeddings and crushes them through one hidden layer: every position fused at once, in a single step, with no staged structure. To see more context you widen the input. It does not scale.

Fuse context two groups at a time, in levels:

level 1: fuse adjacent pairs of characters
level 2: fuse adjacent pairs of those results
level 3: fuse adjacent pairs of those... (and so on up the tree)

Information merges gradually instead of being dumped into one layer. Borrowed from DeepMind’s WaveNet (which generated raw audio); originally built with dilated causal convolutions.

Each fusion is a tiny MLP layer: take two neighboring vectors (two embeddings, or two summaries from the level below), concatenate them, pass through a small linear layer + tanh, out comes one vector summarizing the pair. The whole tree is this one operation repeated.

depth (levels) context the top sees
1 2
2 4
3 8
4 16
10 1024

After k levels each output sees 2^k characters: one more layer doubles the context. Context grows exponentially with depth, not linearly with width. Going from 8 to 16 characters costs one layer in the tree, versus doubling the input width in the flat model.

  • Staged representations: pairs -> fours -> eights, progressively higher-level chunks (like edges -> shapes -> objects in a vision network).
  • Simple layers: each fuses only two groups; nothing learns a giant tangled mapping at once.
  • Cheap long context: paid for in depth, not width.

The lecture reorganizes the net into reusable layer modules (Linear, Tanh, BatchNorm) snapped into a Sequential container, the API real frameworks use. Takeaway: networks are built by composing simple reusable layers; deeper = stack more.

“Stack simple layers, build understanding with depth” is the central structural idea of modern AI. Audio (WaveNet), images (convolutional nets), and text (this model, and transformers) all use a hierarchy of simple local operations whose reach compounds with depth. A transformer is a stack of identical refining layers with attention as the per-layer operation; “96 layers” means 96 rounds of refinement.

Replace the flat MLP’s one-step fusion with a tree of small linear+tanh fusions that combine two neighbors per level, so the receptive field doubles with depth and the network builds understanding in stages, the principle behind every deep architecture, transformers included.