Practice: building the GPT tokenizer

Self-check

Five short questions. Try to answer each in your head (or on paper) before opening the collapsible. Active retrieval is where the learning sticks; rereading is comfortable but does much less.

1. What does a tokenizer do, and is it part of the model?

Show answer

It is the translator between raw text and the integer token IDs a GPT consumes (and back again). It is not part of the model: it is a separate stage with its own training set and its own training procedure (byte-pair-encoding merges, not gradient descent). The model is then trained on whatever tokens the tokenizer produces.

2. Why use subword tokens instead of characters or words?

Show answer

Characters make sequences far too long, wasting the model’s limited context window. Words make the vocabulary explode and break on any unseen word (no token for it). Subword tokens are the middle ground: common chunks get a single token, while rare strings are spelled out from smaller pieces, so text is compact and anything is still representable.

3. State the byte-pair-encoding training loop.

Show answer

Start with the basic units (bytes or characters) as the vocabulary. Find the most frequent adjacent pair of tokens in the training text, merge it into a single new token, add it to the vocabulary, and record the merge. Repeat until the vocabulary reaches a target size. The recorded merges are the tokenizer.

4. Why do real tokenizers operate on bytes, and what does that buy them?

Show answer

Because every possible character, in any language or script, including emoji, is some sequence of UTF-8 bytes. Running BPE over bytes means a tokenizer can represent absolutely any text with no “unknown token” gaps, while still merging common byte sequences into efficient single tokens.

5. Name two language-model quirks that come from tokenization, not the model.

Show answer

Any two of: trouble spelling words or counting their letters (the model sees tokens, not letters, so “how many r’s in strawberry” is genuinely hard); shaky arithmetic (numbers get chunked into tokens inconsistently); whitespace sensitivity (a leading space makes a different token); and weaker performance in languages that BPE splits into more tokens per word (less effective context). All are artifacts of the tokenizer, not the transformer.

Try it yourself

Train a tiny byte-pair-encoding tokenizer by hand, then use it to encode and decode.

Setup. Your training text is the string abababab (eight characters). You will do two merges, then encode a new string and decode a token.

Steps.

Find the most frequent adjacent pair in abababab and merge it into a new token Z. Write the result.
Find the most frequent adjacent pair in that result and merge it into Y. Write the result.
Using your two merges, encode the new string abab into tokens.
Decode the token Y back into characters.

Expected outcome.

1. most frequent pair "ab" (4 times) -> Z:   abababab -> ZZZZ
2. most frequent pair "ZZ" (2 times) -> Y:   ZZZZ     -> YY      (Y = ZZ = abab)
3. encode "abab":  ab->Z gives ZZ,  ZZ->Y gives Y      -> one token: Y
4. decode Y:  Y -> ZZ -> ab,ab -> "abab"

The training string of eight characters compressed to two tokens (YY), and the common chunk abab became a single token Y. A string with no learned merges would simply stay as its individual characters. That is byte-pair encoding in miniature: frequent patterns earn short tokens, everything else is spelled out.

Confirm it against the real thing (optional). Andrej Karpathy’s minbpe is a minimal byte-pair-encoding tokenizer with train, encode, and decode. Train it on a paragraph of text, then encode a sentence and watch common words collapse to single tokens while rare ones split into pieces, and confirm that decoding the tokens reproduces your original text exactly.

Flashcards

Seven cards. Click any card to reveal the answer. Use the Print flashcards button to lay out the full set as one card per page, ready to print or save as a PDF for offline review.

Q. What does a tokenizer do, and is it part of the model?

It translates raw text into the integer token IDs a GPT consumes, and back. It is not part of the model: a separate stage with its own training set and procedure (BPE merges, not gradient descent). The model is trained on whatever tokens it produces.

Q. Why subword tokens instead of characters or words?

Characters make sequences too long (wasting context); words make the vocabulary explode and break on unseen words. Subword tokens are the middle ground: common chunks get one token, rare strings are spelled from pieces, and anything is representable.

Q. State the byte-pair-encoding training loop.

Start with basic units (bytes/characters). Repeatedly: find the most frequent adjacent pair, merge it into a new token, add it to the vocabulary, record the merge. Stop at a target vocabulary size. The recorded merges are the tokenizer.

Q. How do you encode and decode with a trained BPE tokenizer?

Encode: apply the learned merges (in learned order) to new text until none apply, read off the token IDs. Decode: expand each token back through its merges into characters. Example: with ab->Z and ZZ->Y, abab encodes to one token Y, and Y decodes back to abab.

Q. Why do real tokenizers run BPE over bytes?

Every character in any language or script (including emoji) is some sequence of UTF-8 bytes, so a byte-level tokenizer can represent any text with no unknown-token gaps, while still merging common byte sequences into efficient tokens.

Q. Name quirks that come from tokenization, not the model.

Trouble spelling/counting letters (the model sees tokens, not letters); shaky arithmetic (numbers chunked inconsistently); whitespace sensitivity (a leading space is a different token); weaker performance in languages split into more tokens per word.

Q. What is the full text-to-text pipeline of a GPT?

text -> tokenizer -> token IDs -> token + position embeddings -> stack of transformer blocks (attention + feed-forward, residual + layer norm) -> softmax -> next-token probability -> sample -> tokenizer decodes -> text. Every piece built across this track.