Tokenization, in brief

What you’ll learn

This is the opener of Phase 1 (How models read text) and the Track 5 entry point. A transformer never sees the text you typed. It sees a sequence of integer IDs from a fixed vocabulary, called tokens, and every word, code snippet, or piece of punctuation has been sliced into those IDs before any of the rest of the architecture runs. The Stanford CME 295 course materials (syllabus, schedule, the Amidi cheatsheets) are at cme295.stanford.edu.

The lesson opens with a classic AI failure (older models could not count the Rs in strawberry; newer frontier models like Llama 4, o3, and GPT-5.x have largely patched the specific case but the structural reason still bites slightly weirder versions). Then it builds the intuition for why neither whole-word vocabularies nor character-level vocabularies work, walks one byte-pair encoding merge by hand on a tiny corpus, pays off the strawberry mystery (a token is atomic to the model, so spelling questions live on the wrong side of the boundary), and closes with special tokens (BOS, EOS, chat-role markers, mask, pad) and the security surface they create.

Where this fits

This is lesson 1 of 3 in Phase 1, How models read text, and the Track 5 entry point. There is no previous lesson in the track. The next lesson is How words become vectors with meaning (embeddings), which picks up directly where this one leaves off (you have a sequence of token IDs; what does the model actually do with each one). Phase 1 closes with How models know word order, after which Phase 2 builds out attention and the rest of the transformer architecture.

Before you start

Prerequisites: none. This is the Track 5 entry point. If you have never seen the word vocabulary used in the AI sense, this lesson defines it as we go.

About the math

Track 5 covers AI Fundamentals. Most lesson bodies require no math beyond high-school algebra. Practice exercises occasionally use trigonometry (in radians), 2D rotation matrices, square roots, and softmax by hand. A calculator and comfortable algebra and trig will help you fully complete the practice sections. If math isn’t your strength, the worked solutions show every step, so you can read the answer to see the pattern even if you don’t compute it yourself.

By the end, you’ll be able to

Explain in plain language why AI models read tokens, not whole words or individual letters
Identify the tradeoff between vocabulary size and sequence length that tokenization is designed to manage
Walk through one byte-pair encoding (BPE) merge by hand on a small input
Predict what a real tokenizer will output for a given string before checking, and explain why letter-counting questions are a structural weak spot regardless of which model has patched the famous strawberry case
Recognize the role of special tokens (BOS, EOS, chat-role markers) in structuring model input and how they create a prompt-injection surface

Time and difficulty

Read time: about 20 minutes
Practice time: about 15 minutes (predict-and-check using tiktokenizer.vercel.app on five strings, plus flashcards)
Difficulty: standard