Skip to content

Tokenizers up close

This lesson opens the tokenizer, the component you have called in nearly every lesson since lesson 2 without looking inside. You will see how a fast tokenizer turns text into tokens and train a new one on your own corpus. The source curriculum is the Hugging Face LLM Course, Chapter 6, freely available and Apache-2.0 licensed at huggingface.co/learn/llm-course/chapter6.

You will walk the four-stage pipeline (normalization, pre-tokenization, the subword model, postprocessing) and inspect the first two stages directly; understand why fast tokenizers are fast and what their offsets and word IDs make possible; meet the three subword algorithms (BPE, WordPiece, Unigram) and which model families use them; and train a brand-new tokenizer on a corpus of Python code with train_new_from_iterator, seeing it cut token counts by about a quarter.

This is lesson 6 of 12, the second lesson of Phase 2 (data, tokenizers, and tasks). It opens up the AutoTokenizer you first used in lesson 2 and explains why the batched=True speedup from lesson 5 works. It also sets up lesson 7: the offsets and word IDs introduced here are what make the token-level tasks (named-entity recognition, question answering) work cleanly.

Prerequisites: lesson 2 of this track (AutoTokenizer, input_ids, the three steps of a pipeline), since this lesson explains what was happening inside that tokenizer call. Lesson 5 (the Datasets library) helps, as you will load a corpus to train on. You should be comfortable with Python generators, which the training corpus uses. Install with pip install transformers datasets.

None. The subword algorithms are described by what they do (merge frequent pairs, or prune a large vocabulary), not derived. The hands-on work is inspecting a tokenizer’s stages and calling one training method; the only Python concept worth a refresher is the generator (yield).

The single capability this lesson builds: explain how a fast tokenizer turns text into tokens, and train a new tokenizer on a corpus. Concretely, you will be able to:

  • Describe the four-stage tokenizer pipeline (normalization, pre-tokenization, model, postprocessing)
  • Inspect normalization and pre-tokenization via backend_tokenizer
  • Explain why fast tokenizers are fast and what offsets and word IDs provide
  • Name the three subword algorithms (BPE, WordPiece, Unigram) and a model family for each
  • Train a new tokenizer on a corpus with train_new_from_iterator
  • Read time: about 12 minutes
  • Practice time: about 15 minutes (inspect two tokenizers, then train a code tokenizer, plus flashcards)
  • Difficulty: standard (conceptual pipeline plus one short training run; no math)