Skip to content

Summary: How AI reads: turning text into tokens

Tokenization is the bridge from raw text to the integers a neural network can actually process. The model never sees letters or words. It sees a sequence of integer IDs drawn from a fixed, precomputed vocabulary. Understanding this one move explains a long list of AI quirks (why GPT cannot count the Rs in “strawberry”, why prompt costs scale with token count, why small wording changes can shift the bill). The lesson walks you through why tokens exist, builds the intuition for byte-pair encoding by hand, and pays off the strawberry mystery. This summary is the scan-it-in-five-minutes version.

  • A neural network is a machine that turns numbers into other numbers. Words don’t go in; integers do. Tokenization is the conversion from one to the other, and the vocabulary is the fixed table that defines it.
  • Two obvious answers fail. One token per word explodes vocabulary size (English has 600,000+ words plus names, typos, variants, neologisms; covering them needs millions of entries, which makes the model slow to train and bad at anything not in the vocabulary). One token per character explodes sequence length (every word becomes 5 to 10 times longer, so every transformer layer does 5 to 10 times more work).
  • The pragmatic answer is subword tokens: a fixed vocabulary of roughly 30,000 to 100,000 fragments. Common whole words (“the”, “and”) are single tokens. Common affixes and syllables (“ing”, “tion”, “un”) are tokens. Rare or unfamiliar words decompose into known fragments. Out-of-vocabulary failures essentially go away.
  • The standard algorithm for building that vocabulary is byte-pair encoding (BPE). Start from characters. Repeatedly merge the most frequent adjacent pair into a new token. Add it to the vocabulary. Run the loop until the vocabulary reaches the target size. GPT-style models and many modern LLMs use variants of this idea.
  • Mental model for BPE: it is a compression strategy. Common patterns get short codes (one token); rare patterns get spelled out across many tokens. The vocabulary is the codebook.
  • The lesson works one BPE merge by hand on the canonical “low / lower / newer / wider / new” toy corpus. The pair (e, r) appears 11 times across the corpus, more than any other pair, so it merges first into a new token er. That is one step. Run 50,000 of them and you have a real vocabulary.
  • A token is atomic. To the model, a token is indivisible. There is no inside, no letters, no structure, just an ID. This is why letter-counting and spelling questions are a structural weak spot. In some GPT-family tokenizers, “strawberry” is a single token; in others it splits into a couple of pieces. Either way, the model cannot see the three Rs because it cannot see inside any of those tokens. Recent models have been retrained to handle the famous strawberry case, but slightly weirder spelling questions (“count the Ms in mellifluous”) still expose the same hole.
  • Special tokens (BOS, EOS, chat-role markers like <|im_start|>, plus <pad>, <unk>, <mask>) are not subword fragments. They are housekeeping markers the model has been trained to recognize. They are also the primary surface for prompt-injection attacks: if a malicious input contains the literal special-token string and the tokenizer parses it as the marker rather than as text, the attacker has just inserted a fake conversation turn. The full treatment is its own future lesson.
  • Three real-world implications worth holding in your head when you start using AI in earnest: long prompts cost more than character count suggests, because pricing is per-token; small wording changes can shift the total token count by 10 to 20 percent with no semantic change; code consumes roughly twice as many tokens as the same idea expressed in English.
  • Pitfalls worth naming: assuming token equals word (often wrong); assuming token equals character (also wrong, in the other direction); assuming whitespace is handled predictably (modern tokenizers attach the leading space to the next word, so ” cat” and “cat” are different tokens); assuming character count predicts token count (it does not, especially for code or non-Latin scripts); believing the model “knows the letters” of any word (it knows associations for the token, not the spelling).

Before this lesson, you might have thought of an AI model as something that reads your sentences. After it, you know the actual interface: your text becomes a sequence of integer IDs, the model does all of its work on those IDs, and the IDs get converted back to text on the way out. That single fact reshapes how you think about prompts (you are paying per token, not per character), about debugging (a model’s wrongness on letter-counting and spelling questions is structural, not a reasoning failure), and about security (prompt-injection attacks live at the token boundary). The next lesson, on embeddings, picks up where this one stops: you have the integer IDs; how does the model give them meaning?

AI does not read text. It reads tokens.