Cheatsheet: building the GPT tokenizer
What the tokenizer does
Section titled “What the tokenizer does”It is the translator between raw text and the integer token IDs a GPT consumes (and back). A separate stage with its own training, sitting in front of the model.
Why subword tokens
Section titled “Why subword tokens”| Unit | Problem |
|---|---|
| Characters | sequences far too long, wastes the limited context window |
| Words | vocabulary explodes; unseen words have no token |
| Subwords (BPE) | common chunks get one token, rare strings spelled from pieces, anything representable |
Byte-pair encoding (BPE) loop
Section titled “Byte-pair encoding (BPE) loop”- Start with the basic units (bytes / characters) as the vocabulary.
- Find the most frequent adjacent pair of tokens in the training text.
- Merge it into a new token; add to vocabulary; record the merge.
- Repeat until the vocabulary hits a target size.
Common patterns become single tokens. The merges are the tokenizer.
Worked merge (the classic example)
Section titled “Worked merge (the classic example)”aaabdaaabac (11 symbols)merge aa->Z: ZabdZabacmerge ab->Y: ZYdZYacmerge ZY->X: XdXac (X = ZY = aaab; 11 symbols -> 5)Merges are layered: X is built from Z and Y, which are built from base symbols.
Encode / decode
Section titled “Encode / decode”encode "aaab": aaab -> (aa->Z) Zab -> (ab->Y) ZY -> (ZY->X) X = one tokendecode X: X -> ZY -> aa,ab -> "aaab"encode "dac": no learned merges apply -> d, a, c = three tokensCommon gets short; rare stays spelled out. Trained once on a corpus, separate from the model.
Byte-level
Section titled “Byte-level”Real tokenizers run BPE over UTF-8 bytes, so any character (any language, emoji) is representable, no “unknown token” gaps. Typical target: tens of thousands of tokens. A page of text (thousands of characters) becomes hundreds of tokens.
Quirks that come from tokenization (not the model)
Section titled “Quirks that come from tokenization (not the model)”- Spelling / letter-counting (“how many r’s in strawberry”): the model sees tokens, not letters.
- Arithmetic: numbers chunked into tokens inconsistently.
- Whitespace sensitivity: a leading space makes a different token.
- Weaker in some languages: more tokens per word, less effective context.
The whole pipeline (now complete)
Section titled “The whole pipeline (now complete)”text -> tokenizer -> token IDs -> token + position embeddings -> stack of transformer blocks (attention + feed-forward, residual + norm) -> softmax -> next-token probability -> sample -> tokenizer (decode) -> textEvery piece built from scratch across this track.
The one-line version
Section titled “The one-line version”A tokenizer uses byte-pair encoding (repeatedly merge the most frequent adjacent pair) to turn text into subword token IDs and back; it is a separate trained stage, and a surprising share of language-model quirks (spelling, arithmetic, whitespace) trace to it, not to the model.