Tokenizers up close
What you’ll learn
Section titled “What you’ll learn”This lesson opens the tokenizer, the component you have called in nearly every lesson since lesson 2 without looking inside. You will see how a fast tokenizer turns text into tokens and train a new one on your own corpus. The source curriculum is the Hugging Face LLM Course, Chapter 6, freely available and Apache-2.0 licensed at huggingface.co/learn/llm-course/chapter6.
You will walk the four-stage pipeline (normalization, pre-tokenization, the subword model, postprocessing) and inspect the first two stages directly; understand why fast tokenizers are fast and what their offsets and word IDs make possible; meet the three subword algorithms (BPE, WordPiece, Unigram) and which model families use them; and train a brand-new tokenizer on a corpus of Python code with train_new_from_iterator, seeing it cut token counts by about a quarter.
Where this fits
Section titled “Where this fits”This is lesson 6 of 12, the second lesson of Phase 2 (data, tokenizers, and tasks). It opens up the AutoTokenizer you first used in lesson 2 and explains why the batched=True speedup from lesson 5 works. It also sets up lesson 7: the offsets and word IDs introduced here are what make the token-level tasks (named-entity recognition, question answering) work cleanly.
Before you start
Section titled “Before you start”Prerequisites: lesson 2 of this track (AutoTokenizer, input_ids, the three steps of a pipeline), since this lesson explains what was happening inside that tokenizer call. Lesson 5 (the Datasets library) helps, as you will load a corpus to train on. You should be comfortable with Python generators, which the training corpus uses. Install with pip install transformers datasets.
About the math
Section titled “About the math”None. The subword algorithms are described by what they do (merge frequent pairs, or prune a large vocabulary), not derived. The hands-on work is inspecting a tokenizer’s stages and calling one training method; the only Python concept worth a refresher is the generator (yield).
By the end, you’ll be able to
Section titled “By the end, you’ll be able to”The single capability this lesson builds: explain how a fast tokenizer turns text into tokens, and train a new tokenizer on a corpus. Concretely, you will be able to:
- Describe the four-stage tokenizer pipeline (normalization, pre-tokenization, model, postprocessing)
- Inspect normalization and pre-tokenization via
backend_tokenizer - Explain why fast tokenizers are fast and what offsets and word IDs provide
- Name the three subword algorithms (BPE, WordPiece, Unigram) and a model family for each
- Train a new tokenizer on a corpus with
train_new_from_iterator
Time and difficulty
Section titled “Time and difficulty”- Read time: about 12 minutes
- Practice time: about 15 minutes (inspect two tokenizers, then train a code tokenizer, plus flashcards)
- Difficulty: standard (conceptual pipeline plus one short training run; no math)