Skip to content

References: Tokenizers up close

Source curriculum (structural mirror, cited as further study):
• Hugging Face, "LLM Course", Chapter 6: "The Tokenizers library"
Authors: the Hugging Face team (Lewis Tunstall, Leandro von Werra,
Lysandre Debut, Sylvain Gugger, Merve Noyan, and others)
Course page: https://huggingface.co/learn/llm-course/chapter6
Code and notebooks: https://github.com/huggingface/course
License: Apache 2.0 (prose and code)
Required attribution: "Based on the Hugging Face LLM Course
(huggingface.co/learn/llm-course), © Hugging Face, used under the
Apache 2.0 license. This is an independent structural mirror;
Hugging Face does not endorse it."
This lesson mirrors the structure of Chapter 6 (training a new tokenizer,
fast-tokenizer features, normalization and pre-tokenization, and the three
subword algorithms). Clawdemy's lessons are original prose that follows the
pedagogical arc of the course. We do not reproduce or transcribe the
course; we cite it as the recommended companion. Course materials are used
under the Apache 2.0 license with the attribution above, which requires a
link to the license and an indication of changes, and does not permit
implying endorsement.
  • Hugging Face LLM Course, Chapter 6: The Tokenizers library. The chapter this lesson mirrors. It goes through each subword algorithm (BPE, WordPiece, Unigram) step by step with worked examples, and shows how to build a tokenizer block by block from the tokenizers library, the natural deep dive once the pipeline here makes sense.

A short, durable list. Each link is a specific next step, not a generic pile.

  • The tokenizers library documentation. The Rust-backed library behind fast tokenizers, with its own Python API for assembling a tokenizer from normalizer, pre-tokenizer, model, and post-processor components. For when you need a custom pipeline.

  • Summary of the tokenizers. A concise docs overview of BPE, WordPiece, and Unigram side by side. The quickest reference when you need to remember which algorithm a model family uses.

  • SentencePiece (Google). The library behind the Unigram-based tokenizers (T5 and others). Its README explains reversible tokenization and why treating text as raw Unicode helps languages that do not use spaces.

Where this connects inside the track.

  • Run a model in a few lines (lesson 2). You first met AutoTokenizer and input_ids there. This lesson explains what was happening inside that call.

  • Wrangling data with the Datasets library (lesson 5). The batched=True speedup there is exactly what fast tokenizers exploit; this lesson explains why “fast” tokenizers earn the name (Rust, parallelization).

  • The main NLP tasks, end to end (lesson 7). The offsets and word IDs introduced here are what make token-level tasks (named-entity recognition, extractive question answering) work in the next lesson.