Cheatsheet: How AI reads: turning text into tokens
The one idea that matters
Section titled “The one idea that matters”text → tokenizer → integer IDs → model ↑ the only thing the model ever seesThe vocabulary defines the mapping. Every input is sliced into vocab entries before the model touches it. Every output is built from the same vocab on the way out.
The vocabulary tradeoff
Section titled “The vocabulary tradeoff”| Scheme | Vocab size | Sequence length | OOV behavior | Verdict |
|---|---|---|---|---|
| One per word | Millions of entries | Short | ”Unknown” placeholder | ✗ |
| One per character | About 200 entries | 5 to 10 times longer | Never | ✗ |
| Subword (BPE) | 30k to 100k entries | Moderate | Reassembled from pieces | ✓ |
BPE algorithm in four steps
Section titled “BPE algorithm in four steps”- Start. Every character is its own token.
- Count. Find the most frequent adjacent pair across the training corpus.
- Merge. Add that pair as a new token in the vocabulary; rewrite the corpus to use it.
- Loop. Repeat steps 2 and 3 until the vocabulary reaches the target size.
Mental model: BPE is a compression strategy. Common patterns get short codes (one token); rare patterns spell out across many tokens. The vocabulary is the codebook.
One worked merge step
Section titled “One worked merge step”Tiny corpus: low (5x), lower (2x), newer (6x), wider (3x), new (1x).
| Before | After | |
|---|---|---|
| Most frequent pair | (e, r) with 11 occurrences | merged |
lower representation | l o w e r </w> | l o w er </w> |
| Vocabulary | characters only | + new entry er |
Why “how many Rs in strawberry” fails
Section titled “Why “how many Rs in strawberry” fails”A token is atomic. To the model, it is an integer ID with no inside; no letters, no structure. “Strawberry” is one or two tokens depending on the tokenizer. Either way, the model cannot count letters inside a token because there are no letters at that level. Spelling questions live on the wrong side of the token boundary.
Special tokens
Section titled “Special tokens”- Boundaries:
<|begin_of_text|>,<|end_of_text|>(BOS, EOS). Mark start and end of text. - Chat roles:
<|im_start|>,<|im_end|>, plussystem/user/assistant. Mark whose turn is whose. - Housekeeping:
<pad>,<unk>,<mask>. Padding, unknown-token fallback, fill-in-the-blank training objective.
Special tokens are also the surface for prompt-injection attacks. If the tokenizer parses a malicious user-string as a special token instead of as plain text, the attacker has just inserted a fake conversation turn into the model’s input.
Practical numbers (rules of thumb)
Section titled “Practical numbers (rules of thumb)”- Pricing is per token, not per character. Two prompts with identical character counts can differ by 10 to 20 percent in token count.
- English prose: roughly 4 characters per token.
- Code: roughly 2 characters per token (twice as expensive per character).
- Non-Latin scripts (CJK, Cyrillic, Arabic): tokenize sparsely; expect more tokens per visual character.
- Whitespace matters:
hello(with leading space) andhelloare different tokens with different IDs.
Pitfalls to dodge
Section titled “Pitfalls to dodge”- Token is not word. Often a word; often a fragment; sometimes both in the same sentence.
- Token is not character. Usually shorter sequences than character count suggests.
- Character count does not predict token count. Code, non-Latin, and unusual names break the ratio.
- The model does not “know the letters” of a word. It knows associations for the token; spelling is on the wrong side of the wall.
- The famous strawberry case got patched via training, but slightly weirder spelling questions (“count the Ms in mellifluous”) still fail for the same structural reason.
Words to use precisely
Section titled “Words to use precisely”- Token: a vocabulary entry; the unit the model processes.
- Vocabulary: the fixed precomputed list of tokens (typically 30k to 100k entries).
- BPE: byte-pair encoding; the algorithm that builds the vocabulary by repeatedly merging the most frequent adjacent pair.
- Atomic token: a token the model cannot look inside; it is just an integer ID.
- Special token: a vocabulary entry that is not a subword fragment (BOS, EOS, chat-role markers, etc.).
AI does not read text. It reads tokens.