Skip to content

Cheatsheet: makemore, the bigram model

Given the text so far, assign a probability to each possible next piece. Generating text = predict next, sample one, append, repeat.

ChoiceValue
Unita single character (26 letters + .) = 27 characters
Contextonly the current character (one character of memory)
. tokenmarks both the start and the end of a name; ava -> .ava.
Corpusa list of ~30k names

A bigram is a pair of adjacent characters. The model predicts the next character from the current one alone.

By counting. Make a 27x27 table of how often each character follows each other character. Normalize each row (divide by its total) to get probabilities. Sample from a row to pick the next character.

As a neural network. One-hot the current character (length-27 vector, single 1) -> single linear layer (27x27 weights, no bias) -> 27 outputs read as log-counts -> softmax (exponentiate, then normalize) -> probabilities. Train on negative log likelihood with the engine from Phase 1. The trained softmax converges to the same probabilities the counting method gives directly.

For one bigram: loss = -log(probability the model assigned it). Model quality = the average over every bigram; lower is better. Log because a product of many small probabilities underflows; negate so “better” means “smaller.”

Counts after a: n four times, b once (row total 5).

P(n|a) = 4/5 = 0.8 P(b|a) = 1/5 = 0.2
loss on a -> n: -log(0.8) = 0.223
loss on a -> b: -log(0.2) = 1.609
average: (0.223 + 1.609)/2 = 0.916

Confident-correct (0.8) -> small loss; unlikely (0.2) -> bigger loss; “impossible” -> infinite loss.

One character of context throws away almost everything before it. That is the motivation for the next lesson (an MLP fed several previous characters), not a bug to fix here.

A large language model does exactly this: assign a probability over the next piece of text, sample, append, repeat. The differences are scale and reach, not kind: tokens instead of characters, thousands of tokens of context instead of one character, a transformer instead of a single linear layer. The predict-and-sample core is unchanged, which is why a model samples (and can surprise you) rather than looking up a fixed answer.

A bigram model assigns a probability to each next character from the current one; build it by counting-and-normalizing or as a one-layer softmax network trained on negative log likelihood, then sample to generate names. It is a large language model in miniature.