makemore bigram model: cheatsheet

What a language model does

Given the text so far, assign a probability to each possible next piece. Generating text = predict next, sample one, append, repeat.

The bigram setup

Choice	Value
Unit	a single character (26 letters + `.`) = 27 characters
Context	only the current character (one character of memory)
`.` token	marks both the start and the end of a name; `ava` -> `.ava.`
Corpus	a list of ~30k names

A bigram is a pair of adjacent characters. The model predicts the next character from the current one alone.

Two ways to build it (same model)

By counting. Make a 27x27 table of how often each character follows each other character. Normalize each row (divide by its total) to get probabilities. Sample from a row to pick the next character.

As a neural network. One-hot the current character (length-27 vector, single 1) -> single linear layer (27x27 weights, no bias) -> 27 outputs read as log-counts -> softmax (exponentiate, then normalize) -> probabilities. Train on negative log likelihood with the engine from Phase 1. The trained softmax converges to the same probabilities the counting method gives directly.

Negative log likelihood (the loss)

For one bigram: loss = -log(probability the model assigned it). Model quality = the average over every bigram; lower is better. Log because a product of many small probabilities underflows; negate so “better” means “smaller.”

Worked example

Counts after a: n four times, b once (row total 5).

P(n|a) = 4/5 = 0.8        P(b|a) = 1/5 = 0.2
loss on a -> n:  -log(0.8) = 0.223
loss on a -> b:  -log(0.2) = 1.609
average:        (0.223 + 1.609)/2 = 0.916

Confident-correct (0.8) -> small loss; unlikely (0.2) -> bigger loss; “impossible” -> infinite loss.

Why bigram is weak

One character of context throws away almost everything before it. That is the motivation for the next lesson (an MLP fed several previous characters), not a bug to fix here.

Why it matters for AI

A large language model does exactly this: assign a probability over the next piece of text, sample, append, repeat. The differences are scale and reach, not kind: tokens instead of characters, thousands of tokens of context instead of one character, a transformer instead of a single linear layer. The predict-and-sample core is unchanged, which is why a model samples (and can surprise you) rather than looking up a fixed answer.

The one-line version

A bigram model assigns a probability to each next character from the current one; build it by counting-and-normalizing or as a one-layer softmax network trained on negative log likelihood, then sample to generate names. It is a large language model in miniature.