Summary: building the GPT tokenizer
TL;DR. Every prior lesson assumed text arrives split into tokens; this final lesson builds the tokenizer that does the splitting. Characters make sequences too long and words make the vocabulary explode, so tokenizers use subword units built by byte-pair encoding: repeatedly merge the most frequent adjacent pair into a new token until the vocabulary hits a target size, so common chunks become single tokens. It runs over UTF-8 bytes (so any text is representable), is trained separately from the model, and explains a surprising share of language-model quirks. With it, the whole pipeline from raw text to generated text is complete, and built from nothing.
Core ideas
Section titled “Core ideas”-
The tokenizer is a separate stage. It translates raw text to token IDs and back, with its own training set and procedure (BPE merges, not gradient descent). The model is trained on whatever tokens it produces.
-
Subword tokens are the middle ground. Characters make sequences too long (wasting context); words make the vocabulary explode and break on unseen words. Subwords give common chunks single tokens and spell rare strings from pieces.
-
Byte-pair encoding builds the vocabulary by merging. Repeatedly find the most frequent adjacent pair and merge it into a new token. Worked once:
aaabdaaabacbecomesXdXacin three merges, andababababtrains to merges that encodeababas a single token. The merges are the tokenizer. -
Encode, decode, and bytes. Encode by applying the learned merges; decode by expanding tokens back to characters. Real tokenizers run BPE over UTF-8 bytes so any character is representable, with a target vocabulary of tens of thousands of tokens.
-
Many quirks come from tokenization, not the model. Spelling and letter-counting trouble, shaky arithmetic, whitespace sensitivity, and weaker performance in heavily-split languages all trace to the tokenizer, since the model sees tokens, not letters.
What changes for you
Section titled “What changes for you”You can now explain a whole class of language-model behavior that puzzles most users, why a model miscounts the letters in a word or stumbles on arithmetic, by pointing at the tokenizer rather than the model. And you have closed the loop. The full pipeline, text into the tokenizer, IDs into embeddings, through a stack of transformer blocks, a softmax predicting the next token, a sample, and the tokenizer turning it back into text, is something you have now built end to end. That is the whole of the track: you began with a single number and an autograd engine, and you finish holding the complete blueprint of a large language model, every piece built by hand. Nothing inside is a mystery, because you built all of it. To go further, read Karpathy’s minbpe and nanoGPT end to end, you will recognize every line.