GPT tokenizer, in brief

What you’ll learn

This is the final lesson of Phase 3 (Building a transformer) and the close of the Build Neural Networks from Scratch track, which follows the arc of Andrej Karpathy’s Neural Networks: Zero to Hero series. You have built the entire GPT, but every step assumed text arrives already split into tokens. This lesson builds the piece that does the splitting: the tokenizer.

You will learn why tokenizers use subword units (characters make sequences too long; words make the vocabulary explode and break on unseen words) and build byte-pair encoding from scratch: repeatedly merge the most frequent adjacent pair of tokens into a new one until the vocabulary reaches a target size, so common chunks become single tokens. The lesson works a BPE run and an encode/decode by hand, explains why real tokenizers run over UTF-8 bytes (so any text is representable), and shows that a surprising share of language-model quirks, trouble spelling words, shaky arithmetic, whitespace sensitivity, weaker performance in some languages, come from this stage rather than the model. With the tokenizer in place, the whole pipeline from raw text to generated text is complete, and built from nothing.

Where this fits

This is lesson 3 of Phase 3 and the last lesson of the track. The previous lessons built the full GPT that consumes tokens; this lesson builds the tokenizer that produces them, the front door to the entire pipeline. It closes the arc that began with a single Value and an autograd engine: tokenizer, embeddings, attention, the transformer block, the training loop, every piece now built by hand. There is no next lesson; this is the finish line.

Before you start

Prerequisite (within this track): lesson 9, Assembling and training the full GPT, so that “the model consumes tokens” is concrete and you can see where the tokenizer sits in the pipeline. The tokenizer itself is independent of the neural-network machinery (it is trained by merging frequent pairs, not by gradient descent), so this lesson is lighter on the earlier math than its neighbors. If you know that a GPT takes token IDs in and predicts the next token, you are ready. No coding is required to follow along, though Karpathy’s minbpe is the minimal implementation to read or build afterward.

By the end, you’ll be able to

Explain what a tokenizer does and why it is a separate stage from the model
Explain why subword tokens beat both characters and words
State the byte-pair-encoding training loop and run it by hand on a small string
Encode and decode with a trained set of BPE merges
Recognize that many language-model quirks (spelling, arithmetic, whitespace, language coverage) come from tokenization, not the model

Time and difficulty

Read time: about 13 minutes
Practice time: about 18 minutes (training a tiny BPE tokenizer by hand and encoding/decoding, optionally reading minbpe, plus flashcards)
Difficulty: standard