Skip to content

Summary: makemore, the bigram model

TL;DR. A language model assigns a probability to each possible next piece of text; generating is predict, sample, append, repeat. This lesson builds the simplest one, a character-level bigram model that predicts the next character from only the current one, and trains it on a list of names. You build it two equivalent ways: by counting character pairs and normalizing each row into probabilities, and as a one-layer neural network (one-hot input, linear layer, softmax) trained on negative log likelihood with the engine from Phase 1. Both reach the same answer. Swap characters for tokens, one character of context for thousands, and one linear layer for a transformer, and this toy becomes, in outline, a modern large language model.

  • A language model assigns probabilities to what comes next. Given the text so far, it scores each possible next piece; generating text is sampling from those scores in a loop. That single ability is the whole job.

  • The bigram model is the hand-buildable version. It predicts the next character from only the current character, with a . token marking the start and end of each name (ava becomes .ava.). One character of context is crude but complete.

  • Build it by counting. Tally a 27x27 table of how often each character follows each other (26 letters plus .), normalize each row into probabilities, and sample a row to generate the next character.

  • Build the same model as a neural network. One-hot the current character, pass it through a single linear layer, softmax the outputs into probabilities, and train on negative log likelihood with the autograd engine. It converges to the same probabilities counting gives directly; the network is the version that generalizes.

  • Quality is the average negative log likelihood, lower is better. loss = -log(probability) per bigram. Counts after a of n:4, b:1 give P(n|a)=0.8 (loss 0.223) and P(b|a)=0.2 (loss 1.609): confident-correct is cheap, unlikely is expensive.

“A model that generates text” stops being mysterious and becomes a probability table you could fill in by hand. When a chatbot streams a reply, you can picture the same loop, predict the next piece, sample, append, repeat, running underneath. The difference between this bigram model and ChatGPT is scale and reach (tokens, more context, a transformer), not a different idea, which is also why a model samples and can surprise you rather than looking up a fixed answer. The next lesson attacks the bigram model’s one real weakness, its single character of context, by feeding several previous characters through a multilayer perceptron so the generated names finally start to look like names.