Cheatsheet: makemore, the bigram model
What a language model does
Section titled “What a language model does”Given the text so far, assign a probability to each possible next piece. Generating text = predict next, sample one, append, repeat.
The bigram setup
Section titled “The bigram setup”| Choice | Value |
|---|---|
| Unit | a single character (26 letters + .) = 27 characters |
| Context | only the current character (one character of memory) |
. token | marks both the start and the end of a name; ava -> .ava. |
| Corpus | a list of ~30k names |
A bigram is a pair of adjacent characters. The model predicts the next character from the current one alone.
Two ways to build it (same model)
Section titled “Two ways to build it (same model)”By counting. Make a 27x27 table of how often each character follows each other character. Normalize each row (divide by its total) to get probabilities. Sample from a row to pick the next character.
As a neural network. One-hot the current character (length-27 vector, single 1) -> single linear layer (27x27 weights, no bias) -> 27 outputs read as log-counts -> softmax (exponentiate, then normalize) -> probabilities. Train on negative log likelihood with the engine from Phase 1. The trained softmax converges to the same probabilities the counting method gives directly.
Negative log likelihood (the loss)
Section titled “Negative log likelihood (the loss)”For one bigram: loss = -log(probability the model assigned it). Model quality = the average over every bigram; lower is better. Log because a product of many small probabilities underflows; negate so “better” means “smaller.”
Worked example
Section titled “Worked example”Counts after a: n four times, b once (row total 5).
P(n|a) = 4/5 = 0.8 P(b|a) = 1/5 = 0.2loss on a -> n: -log(0.8) = 0.223loss on a -> b: -log(0.2) = 1.609average: (0.223 + 1.609)/2 = 0.916Confident-correct (0.8) -> small loss; unlikely (0.2) -> bigger loss; “impossible” -> infinite loss.
Why bigram is weak
Section titled “Why bigram is weak”One character of context throws away almost everything before it. That is the motivation for the next lesson (an MLP fed several previous characters), not a bug to fix here.
Why it matters for AI
Section titled “Why it matters for AI”A large language model does exactly this: assign a probability over the next piece of text, sample, append, repeat. The differences are scale and reach, not kind: tokens instead of characters, thousands of tokens of context instead of one character, a transformer instead of a single linear layer. The predict-and-sample core is unchanged, which is why a model samples (and can surprise you) rather than looking up a fixed answer.
The one-line version
Section titled “The one-line version”A bigram model assigns a probability to each next character from the current one; build it by counting-and-normalizing or as a one-layer softmax network trained on negative log likelihood, then sample to generate names. It is a large language model in miniature.