Skip to content

Giving the model memory: the MLP language model

This is lesson 2 of Phase 2 (Building a language model) in the Build Neural Networks from Scratch track, which follows the arc of Andrej Karpathy’s Neural Networks: Zero to Hero series. The bigram model from the previous lesson predicted the next character from only the one before it. This lesson gives the model real memory.

You cannot fix the bigram’s short context by counting more characters: with 27 characters, a three-character context needs 27^4 = 531,441 table entries, most of them never seen in training and therefore blank. The fix is learned embeddings: give each character a short vector of numbers, stored in a lookup table and trained by gradient descent, so similar characters land near each other and the model generalizes. The architecture (from Bengio’s 2003 neural language model) looks up the context characters’ embeddings, concatenates them, runs them through a tanh hidden layer, and softmaxes into next-character probabilities, trained on negative log likelihood. The lesson works the embed-and-concatenate step and a full parameter count by hand, and shows that learned embeddings are the front end of every large language model.

This is lesson 2 of Phase 2, Building a language model. The previous lesson built the bigram model and exposed its one real weakness, a single character of context. This lesson fixes that with more context and the idea that makes more context affordable: learned embeddings. It reuses the Phase 1 training loop unchanged (the embeddings simply learn alongside the weights). The next lesson looks inside this deeper network while it trains, diagnoses what makes it stall or saturate, and stabilizes it with careful initialization and batch normalization.

Prerequisite (within this track): lesson 3, Your first language model: makemore (the bigram model). This lesson builds directly on it: the characters, the . token, softmax, and negative log likelihood all carry over, and the training loop is the one from Phase 1 that lesson 3 already used. If “predict the next character, softmax the outputs into probabilities, train on negative log likelihood” reads as a procedure, you are ready. A sense of what an embedding is from the AI Foundations track helps but is not required; this lesson builds the idea from scratch. No coding is required to follow along, though running Karpathy’s makemore repo (MIT-licensed) is the best way to make it concrete.

  • Explain why the counting approach cannot scale with context, using the table-size explosion
  • Describe what a learned embedding is and how it lets the model generalize across similar characters
  • Walk the MLP language model architecture end to end, from embedding lookup and concatenation through the hidden layer to softmax probabilities
  • Size a model by hand (embedding table, hidden, output) and compare its parameter count to the count table it replaces
  • Recognize that learned embeddings are the front end of every large language model
  • Read time: about 12 minutes
  • Practice time: about 20 minutes (sizing a model by hand and comparing to the count table, optionally confirmed in the makemore repo, plus flashcards)
  • Difficulty: standard