Your first language model: makemore (the bigram model)
What you’ll learn
Section titled “What you’ll learn”This is the opener of Phase 2 (Building a language model) in the Build Neural Networks from Scratch track, which follows the arc of Andrej Karpathy’s Neural Networks: Zero to Hero series. Phase 1 left you able to build a network from nothing and train it by gradient descent. This lesson points that machinery at language.
You will build makemore, a character-level model that reads a list of names and generates new ones. The model is a bigram model: it predicts the next character from only the current one, with a . token marking the start and end of each name. You build it two equivalent ways: by counting character pairs and normalizing each row into probabilities, and as a one-layer neural network (one-hot input, a single linear layer, softmax) trained on negative log likelihood with the engine from Phase 1. The two converge to the same answer. The lesson works a count-to-probability-to-NLL example by hand, then shows that this toy is a large language model in miniature: assign a probability to what comes next, sample, append, repeat.
Where this fits
Section titled “Where this fits”This is lesson 1 of Phase 2, Building a language model, and the first lesson to produce something that looks like AI. Phase 1 (lessons 1 and 2) built the autograd engine and the training loop on toy numbers. This lesson reuses that loop to train a real, if simple, language model, and introduces the ideas the rest of the phase builds on: characters, a probability over what comes next, sampling, and negative log likelihood. The next lesson keeps the predict-the-next-character framing but feeds several previous characters through a multilayer perceptron with learned character embeddings, so the model uses real context and the generated names improve.
Before you start
Section titled “Before you start”Prerequisites (within this track): lessons 1 and 2, Building an autograd engine: micrograd and Building and training a net: micrograd. The neural-network version of the bigram model is trained with exactly the engine and gradient-descent loop from those lessons: a one-hot input, a single linear layer, a loss, backward(), and a downhill step. If “compute a loss, call backward(), nudge the parameters downhill” reads as a procedure, you are ready. No coding is required to follow the lesson, though running Karpathy’s makemore repo (MIT-licensed) on the real names dataset is the best way to make it concrete.
By the end, you’ll be able to
Section titled “By the end, you’ll be able to”- Explain what a language model does (assign probabilities to the next piece of text) and how generating text is a predict-sample-append loop
- Describe the bigram simplification and the role of the start/end token in modeling names character by character
- Build a bigram model by counting character pairs and normalizing each row into probabilities, and sample from it
- Explain negative log likelihood as the quality measure, and compute it by hand on a small worked example
- Recognize that the same bigram model can be built as a one-layer softmax network trained by gradient descent, and that this is a large language model in miniature
Time and difficulty
Section titled “Time and difficulty”- Read time: about 13 minutes
- Practice time: about 20 minutes (a bigram row built and scored by hand, optionally confirmed in the makemore repo, plus flashcards)
- Difficulty: standard