Tokenizers up close: brief

What you’ll learn

This lesson opens the tokenizer, the component you have called in nearly every lesson since lesson 2 without looking inside. You will see how a fast tokenizer turns text into tokens and train a new one on your own corpus. The source curriculum is the Hugging Face LLM Course, Chapter 6, freely available and Apache-2.0 licensed at huggingface.co/learn/llm-course/chapter6.

You will walk the four-stage pipeline (normalization, pre-tokenization, the subword model, postprocessing) and inspect the first two stages directly; understand why fast tokenizers are fast and what their offsets and word IDs make possible; meet the three subword algorithms (BPE, WordPiece, Unigram) and which model families use them; and train a brand-new tokenizer on a corpus of Python code with train_new_from_iterator, seeing it cut token counts by about a quarter.

Where this fits

This is lesson 6 of 12, the second lesson of Phase 2 (data, tokenizers, and tasks). It opens up the AutoTokenizer you first used in lesson 2 and explains why the batched=True speedup from lesson 5 works. It also sets up lesson 7: the offsets and word IDs introduced here are what make the token-level tasks (named-entity recognition, question answering) work cleanly.

Before you start

Prerequisites: lesson 2 of this track (AutoTokenizer, input_ids, the three steps of a pipeline), since this lesson explains what was happening inside that tokenizer call. Lesson 5 (the Datasets library) helps, as you will load a corpus to train on. You should be comfortable with Python generators, which the training corpus uses. Install with pip install transformers datasets.

About the math

None. The subword algorithms are described by what they do (merge frequent pairs, or prune a large vocabulary), not derived. The hands-on work is inspecting a tokenizer’s stages and calling one training method; the only Python concept worth a refresher is the generator (yield).

By the end, you’ll be able to

The single capability this lesson builds: explain how a fast tokenizer turns text into tokens, and train a new tokenizer on a corpus. Concretely, you will be able to:

Describe the four-stage tokenizer pipeline (normalization, pre-tokenization, model, postprocessing)
Inspect normalization and pre-tokenization via backend_tokenizer
Explain why fast tokenizers are fast and what offsets and word IDs provide
Name the three subword algorithms (BPE, WordPiece, Unigram) and a model family for each
Train a new tokenizer on a corpus with train_new_from_iterator

Time and difficulty

Read time: about 12 minutes
Practice time: about 15 minutes (inspect two tokenizers, then train a code tokenizer, plus flashcards)
Difficulty: standard (conceptual pipeline plus one short training run; no math)