Skip to content

Your first language model: makemore (the bigram model)

This is the opener of Phase 2 (Building a language model) in the Build Neural Networks from Scratch track, which follows the arc of Andrej Karpathy’s Neural Networks: Zero to Hero series. Phase 1 left you able to build a network from nothing and train it by gradient descent. This lesson points that machinery at language.

You will build makemore, a character-level model that reads a list of names and generates new ones. The model is a bigram model: it predicts the next character from only the current one, with a . token marking the start and end of each name. You build it two equivalent ways: by counting character pairs and normalizing each row into probabilities, and as a one-layer neural network (one-hot input, a single linear layer, softmax) trained on negative log likelihood with the engine from Phase 1. The two converge to the same answer. The lesson works a count-to-probability-to-NLL example by hand, then shows that this toy is a large language model in miniature: assign a probability to what comes next, sample, append, repeat.

This is lesson 1 of Phase 2, Building a language model, and the first lesson to produce something that looks like AI. Phase 1 (lessons 1 and 2) built the autograd engine and the training loop on toy numbers. This lesson reuses that loop to train a real, if simple, language model, and introduces the ideas the rest of the phase builds on: characters, a probability over what comes next, sampling, and negative log likelihood. The next lesson keeps the predict-the-next-character framing but feeds several previous characters through a multilayer perceptron with learned character embeddings, so the model uses real context and the generated names improve.

Prerequisites (within this track): lessons 1 and 2, Building an autograd engine: micrograd and Building and training a net: micrograd. The neural-network version of the bigram model is trained with exactly the engine and gradient-descent loop from those lessons: a one-hot input, a single linear layer, a loss, backward(), and a downhill step. If “compute a loss, call backward(), nudge the parameters downhill” reads as a procedure, you are ready. No coding is required to follow the lesson, though running Karpathy’s makemore repo (MIT-licensed) on the real names dataset is the best way to make it concrete.

  • Explain what a language model does (assign probabilities to the next piece of text) and how generating text is a predict-sample-append loop
  • Describe the bigram simplification and the role of the start/end token in modeling names character by character
  • Build a bigram model by counting character pairs and normalizing each row into probabilities, and sample from it
  • Explain negative log likelihood as the quality measure, and compute it by hand on a small worked example
  • Recognize that the same bigram model can be built as a one-layer softmax network trained by gradient descent, and that this is a large language model in miniature
  • Read time: about 13 minutes
  • Practice time: about 20 minutes (a bigram row built and scored by hand, optionally confirmed in the makemore repo, plus flashcards)
  • Difficulty: standard