References: Tokenizers up close

Source material

Source curriculum (structural mirror, cited as further study):
• Hugging Face, "LLM Course", Chapter 6: "The Tokenizers library"
  Authors: the Hugging Face team (Lewis Tunstall, Leandro von Werra,
    Lysandre Debut, Sylvain Gugger, Merve Noyan, and others)
  Course page: https://huggingface.co/learn/llm-course/chapter6
  Code and notebooks: https://github.com/huggingface/course
  License: Apache 2.0 (prose and code)
  Required attribution: "Based on the Hugging Face LLM Course
    (huggingface.co/learn/llm-course), © Hugging Face, used under the
    Apache 2.0 license. This is an independent structural mirror;
    Hugging Face does not endorse it."
This lesson mirrors the structure of Chapter 6 (training a new tokenizer,
fast-tokenizer features, normalization and pre-tokenization, and the three
subword algorithms). Clawdemy's lessons are original prose that follows the
pedagogical arc of the course. We do not reproduce or transcribe the
course; we cite it as the recommended companion. Course materials are used
under the Apache 2.0 license with the attribution above, which requires a
link to the license and an indication of changes, and does not permit
implying endorsement.

Read this next

Hugging Face LLM Course, Chapter 6: The Tokenizers library. The chapter this lesson mirrors. It goes through each subword algorithm (BPE, WordPiece, Unigram) step by step with worked examples, and shows how to build a tokenizer block by block from the tokenizers library, the natural deep dive once the pipeline here makes sense.

Going deeper

A short, durable list. Each link is a specific next step, not a generic pile.

The tokenizers library documentation. The Rust-backed library behind fast tokenizers, with its own Python API for assembling a tokenizer from normalizer, pre-tokenizer, model, and post-processor components. For when you need a custom pipeline.
Summary of the tokenizers. A concise docs overview of BPE, WordPiece, and Unigram side by side. The quickest reference when you need to remember which algorithm a model family uses.
SentencePiece (Google). The library behind the Unigram-based tokenizers (T5 and others). Its README explains reversible tokenization and why treating text as raw Unicode helps languages that do not use spaces.

Adjacent topics

Where this connects inside the track.

Run a model in a few lines (lesson 2). You first met AutoTokenizer and input_ids there. This lesson explains what was happening inside that call.
Wrangling data with the Datasets library (lesson 5). The batched=True speedup there is exactly what fast tokenizers exploit; this lesson explains why “fast” tokenizers earn the name (Rust, parallelization).
The main NLP tasks, end to end (lesson 7). The offsets and word IDs introduced here are what make token-level tasks (named-entity recognition, extractive question answering) work in the next lesson.