Skip to content

Summary: the MLP language model

TL;DR. The bigram model saw only one character of context, and you cannot fix that by counting more characters: the table explodes (three characters need over half a million mostly-empty entries). The fix is learned embeddings, a short vector per character, stored in a lookup table and trained by gradient descent, so similar characters land near each other and the model generalizes. The architecture (from Bengio 2003) looks up the context characters’ embeddings, concatenates them, runs them through a tanh hidden layer, and produces next-character probabilities via softmax, trained on negative log likelihood. Embeddings are the front end of every large language model.

  • Counting cannot scale with context. Each extra character multiplies the count table by 27 and empties it out: three characters of context need 27^4 = 531,441 entries, most never seen in training. A representation that generalizes is required.

  • Embeddings are the fix. Give each character a short learned vector in a lookup table. Because the vectors are trained by gradient descent, similar characters can end up close together, and the model generalizes from contexts it has seen to nearby ones it has not.

  • The architecture is an MLP with an embedding front end. Look up each context character’s embedding, concatenate, pass through a hidden layer (linear + tanh), then an output layer to 27 logits, then softmax. Only the lookup-and-concatenate part is new; the rest is the Phase 1 network. A 3-character, 2-dimensional-embedding, 100-hidden model is about 3,481 parameters, versus 531,441 count entries, and it generalizes.

  • Training is the same loop, with practical wrinkles. Negative-log-likelihood loss, backward(), gradient descent, now with the embeddings learning too. Real training adds minibatches (speed), a learning-rate sweep, and a train/dev/test split (to catch overfitting).

  • This is the front end of every large language model. Each token is looked up in a giant learned embedding table and turned into a vector the network processes. Swap the single hidden layer for a transformer and a few characters for thousands of tokens, and the makemore MLP becomes, in outline, a modern language model.

The word “embedding,” everywhere in talk about large language models, stops being jargon and becomes concrete: a lookup table of vectors the network learned, where similar things land near each other. You can now picture where text first becomes numbers a model can reason with, the very first step inside any language model. The generated names are noticeably better here because the model finally has real context. But training this deeper network well is finicky: the next lesson opens the network up during training to diagnose what goes wrong (saturated neurons, exploding or vanishing gradients) and fixes it with careful initialization and batch normalization.