Practice: makemore, the bigram model
Self-check
Section titled “Self-check”Five short questions. Try to answer each in your head (or on paper) before opening the collapsible. Active retrieval is where the learning sticks; rereading is comfortable but does much less.
1. What is the one narrow thing a language model does, and what does generating text amount to?
Show answer
It assigns a probability to each possible next piece of text, given the text so far. Generating is just that ability applied in a loop: predict the next piece, sample one according to the probabilities, append it, and predict again. A chatbot streaming a reply token by token is running exactly this loop.
2. What does “bigram” mean here, and what is the role of the . token?
Show answer
A bigram is a pair of adjacent characters; a bigram model predicts the next character from only the current one. The . token marks both the start and the end of a name (so ava becomes .ava.). The starting . lets the model learn which characters tend to begin a name; the ending . lets it learn when to stop.
3. How do you turn a row of raw counts into predictions?
Show answer
Normalize it: divide each count by the row’s total, so the row becomes probabilities that sum to 1. The cell at row a, column n counts how often n followed a; dividing by the a row total gives P(n | a), the probability the model assigns to n coming after a.
4. Why is the loss the negative log likelihood, rather than just the likelihood?
Show answer
The likelihood of the data is a product of many small probabilities, which underflows to zero and is awkward to work with. Taking the log turns the product into a sum; negating makes “better” correspond to “smaller,” which is what a loss needs. So the loss on one bigram is -log(probability the model gave it), and model quality is the average over all bigrams.
5. What are the two ways to build the bigram model, and why prefer the neural-network one despite counting being simpler?
Show answer
Counting: tally character pairs and normalize each row. Neural network: one-hot the current character, pass it through a single linear layer, softmax the outputs into probabilities, and train on negative log likelihood. For a bigram they converge to the same probabilities, and counting is faster. The network is preferred because it generalizes: you can grow it with more context and more layers, where a fixed count table cannot follow.
Try it yourself
Section titled “Try it yourself”Build one row of a bigram model by hand, score two of its predictions, and (optionally) confirm against the real makemore.
Setup. In a small dataset, the character s is followed by h three times and by a once, and never by anything else.
Steps.
- Write the row total for
s. - Normalize the row into probabilities:
P(h | s)andP(a | s). - Compute the negative log likelihood loss on the bigram
s -> hand ons -> a, usingloss = -log(probability). - Average the two losses.
- Describe how you would sample the character after
s: which letter comes up more often, and roughly in what proportion?
Expected outcome.
row total for 's': 3 + 1 = 4P(h|s) = 3/4 = 0.75 P(a|s) = 1/4 = 0.25loss on s -> h: -log(0.75) = 0.288loss on s -> a: -log(0.25) = 1.386average: (0.288 + 1.386)/2 = 0.837The likely, correct prediction (h at 0.75) earns a small loss; the less likely one (a at 0.25) earns a bigger one. To sample the next character you would draw at random weighted by the row, so about three times out of four you would get h and about one time in four a. That weighted draw, repeated character by character, is how the model generates a whole name.
Confirm it against the real thing (optional). Andrej Karpathy’s makemore repo builds this exact model on a real list of ~30k names. Run the bigram section, look at a row of the normalized probability matrix, and generate a few names; then read off the model’s average negative log likelihood over the whole dataset. Seeing your by-hand row and loss match the code’s, and watching the sampled names come out roughly name-like, makes the model concrete.
Flashcards
Section titled “Flashcards”Eight cards. Click any card to reveal the answer. Use the Print flashcards button to lay out the full set as one card per page, ready to print or save as a PDF for offline review.
Q. What does a language model do, in one sentence?
Given the text so far, it assigns a probability to each possible next piece of text. Generating text is predict-next, sample one, append, repeat.
Q. What is a bigram model?
A model that predicts the next character from only the single current character (a bigram is a pair of adjacent characters). Crude (one character of context) but a complete, hand-buildable language model.
Q. What is the `.` token for?
It marks both the start and the end of a name: ava becomes .ava.. The starting dot teaches which characters begin names; the ending dot teaches when to stop.
Q. How do you build the bigram model by counting?
Tally a 27x27 table of how often each character follows each other character, then normalize each row (divide by its total) into probabilities. Sample from a row to pick the next character.
Q. How do you build the same model as a neural network?
One-hot the current character (length-27 vector), pass through a single linear layer (27x27, no bias), softmax the 27 outputs into probabilities, and train on negative log likelihood with the autograd engine. It converges to the same table as counting.
Q. What is softmax?
Exponentiate a set of numbers (making them positive) then normalize them to sum to 1, turning any vector into a probability distribution. It turns the network’s raw outputs (log-counts) into next-character probabilities.
Q. What is negative log likelihood, and why log?
loss = -log(probability the model gave each real bigram), averaged over all bigrams; lower is better. Log turns a product of many tiny probabilities into a sum (avoiding underflow); negating makes “better” mean “smaller”, as a loss should.
Q. How does the bigram model relate to a large language model?
Same skeleton: assign a probability over the next piece of text, sample, append, repeat. An LLM differs by scale and reach, tokens not characters, thousands of tokens of context not one character, a transformer not one linear layer, but the predict-and-sample core is identical.