Summary: a WaveNet-style hierarchical model

TL;DR. The MLP language model fused all of its context in one crude step (concatenate every character’s embedding, crush through one hidden layer). This lesson restructures it into a hierarchy, in the style of DeepMind’s WaveNet, that fuses context two neighboring groups at a time, level by level. Each fusion is a tiny linear-plus-tanh layer, but the span each output covers doubles per level, so after k levels each output sees 2^k characters. Context grows exponentially with depth instead of linearly with width, and the network builds representations in stages. Staged composition through depth is the central structural idea of modern AI, the same idea that makes transformers work.

Core ideas

The flat MLP fused context crudely. Concatenate all context embeddings, crush through one hidden layer, everything mixed at once with no staged structure. To see more, widen the input. It does not scale.
WaveNet fuses gradually, in a tree. Combine adjacent pairs at level 1, pairs of those at level 2, and so on. Each fusion is a tiny MLP layer (concatenate two neighbors, linear, tanh), and the whole tree is that one operation repeated up the levels.
The receptive field doubles per level. A context of 8 characters takes 3 fusing levels (2 -> 4 -> 8); after k levels each output sees 2^k characters, so one more layer doubles the context. Reaching N characters needs only log2(N) levels, versus an N-wide input in the flat model.
The hierarchy builds understanding in stages. Pairs become fours become eights, progressively higher-level chunks, the way a vision network goes edges -> shapes -> objects. Each layer stays simple, and long context becomes cheap (paid for in depth, not width).
This is the core idea of modern AI. Audio (WaveNet), images (convolutional nets), and text (transformers) all stack simple local operations whose reach compounds with depth. A transformer is a stack of identical refining layers with attention as the per-layer operation.

What changes for you

The word “deep” stops being a vague honorific and becomes concrete: layers that each do something simple, stacked so the whole does something rich, with each layer refining what the one below produced. When you hear a model has “96 layers,” you know that means 96 rounds of refinement. This completes the language-model phase: you have given the model real context (embeddings), stable training (initialization and normalization), and now staged understanding (a deep hierarchy). The final phase builds the architecture that currently dominates AI, the transformer, starting from its core mechanism, self-attention, which replaces WaveNet’s fixed tree with a more flexible scheme where each position chooses for itself which others to draw from.