Skip to content

Summary: Tokenizers up close

The tokenizer is the bridge between text and the numbers a model reads, and it runs a four-stage pipeline: normalization (clean), pre-tokenization (split into words), the model (the subword algorithm that makes tokens), and postprocessing (add special tokens). Fast tokenizers are Rust-backed and the default; beyond speed they track offsets and word IDs that map each token back to the source text. Three subword algorithms dominate: BPE (GPT family), WordPiece (BERT), and Unigram (T5/SentencePiece), all learning a middle vocabulary of word-pieces so rare words split into familiar parts. You can train a new tokenizer on your own corpus with train_new_from_iterator, which is a deterministic statistical scan (not model training), and a tokenizer matched to your domain produces far shorter sequences. This is the scan version; the lesson opens the box and trains a code tokenizer.

  • A tokenizer is a four-stage pipeline. Normalization, pre-tokenization, the subword model, and postprocessing. Inspect the first two with backend_tokenizer.normalizer.normalize_str and .pre_tokenizer.pre_tokenize_str.
  • Fast tokenizers are Rust-backed and the default. AutoTokenizer picks them automatically. They are fast and, crucially, track offsets and word IDs that align tokens to the original text.
  • Three subword algorithms cover the field. BPE (GPT), WordPiece (BERT), Unigram (T5/SentencePiece). Subword vocabularies are why a model can represent words it never saw, by splitting them into known pieces.
  • Markers preserve spacing. GPT-2 marks spaces with Ġ, SentencePiece with a special underscore, so the original text is recoverable; BERT drops repeated spaces, so its tokenization is not reversible.
  • Training a tokenizer is not training a model. It is deterministic statistics over a corpus (no gradient descent, same result every time), and it is fast and needs no GPU.
  • train_new_from_iterator retrains the vocabulary from an existing fast tokenizer (keeping its algorithm and special tokens). Feed a generator of text batches and a vocab size.

This lesson exposes the most overlooked lever in the stack. The tokenizer learns nothing during model training and gets no attention, yet it fixes the units everything else is measured in. Because cost and context limits are counted in tokens, a tokenizer poorly matched to your text is a direct tax: more tokens per document means higher bills, slower runs, and less content per context window. The Python-code example in the lesson cut token counts by about a quarter, which on a real codebase is a quarter off the budget. Most of the time you will use the tokenizer that ships with your model and never think about it, which is right; but knowing what it does, and that you can retrain it for an unusual domain or language, is what lets you diagnose why text is unexpectedly expensive instead of just paying for it. Next, the task lesson puts tokenizer, model, and data together across the common NLP jobs.

Every model starts by tokenizing, and the tokenizer decides the units the rest of the work is measured in. Knowing how it splits text, and that you can retrain it, turns a silent black box into a lever you can pull.