References: Tokenizers up close
Source material
Section titled “Source material”Source curriculum (structural mirror, cited as further study):• Hugging Face, "LLM Course", Chapter 6: "The Tokenizers library" Authors: the Hugging Face team (Lewis Tunstall, Leandro von Werra, Lysandre Debut, Sylvain Gugger, Merve Noyan, and others) Course page: https://huggingface.co/learn/llm-course/chapter6 Code and notebooks: https://github.com/huggingface/course License: Apache 2.0 (prose and code) Required attribution: "Based on the Hugging Face LLM Course (huggingface.co/learn/llm-course), © Hugging Face, used under the Apache 2.0 license. This is an independent structural mirror; Hugging Face does not endorse it."This lesson mirrors the structure of Chapter 6 (training a new tokenizer,fast-tokenizer features, normalization and pre-tokenization, and the threesubword algorithms). Clawdemy's lessons are original prose that follows thepedagogical arc of the course. We do not reproduce or transcribe thecourse; we cite it as the recommended companion. Course materials are usedunder the Apache 2.0 license with the attribution above, which requires alink to the license and an indication of changes, and does not permitimplying endorsement.Read this next
Section titled “Read this next”- Hugging Face LLM Course, Chapter 6: The Tokenizers library. The chapter this lesson mirrors. It goes through each subword algorithm (BPE, WordPiece, Unigram) step by step with worked examples, and shows how to build a tokenizer block by block from the
tokenizerslibrary, the natural deep dive once the pipeline here makes sense.
Going deeper
Section titled “Going deeper”A short, durable list. Each link is a specific next step, not a generic pile.
-
The
tokenizerslibrary documentation. The Rust-backed library behind fast tokenizers, with its own Python API for assembling a tokenizer from normalizer, pre-tokenizer, model, and post-processor components. For when you need a custom pipeline. -
Summary of the tokenizers. A concise docs overview of BPE, WordPiece, and Unigram side by side. The quickest reference when you need to remember which algorithm a model family uses.
-
SentencePiece (Google). The library behind the Unigram-based tokenizers (T5 and others). Its README explains reversible tokenization and why treating text as raw Unicode helps languages that do not use spaces.
Adjacent topics
Section titled “Adjacent topics”Where this connects inside the track.
-
Run a model in a few lines (lesson 2). You first met
AutoTokenizerandinput_idsthere. This lesson explains what was happening inside that call. -
Wrangling data with the Datasets library (lesson 5). The
batched=Truespeedup there is exactly what fast tokenizers exploit; this lesson explains why “fast” tokenizers earn the name (Rust, parallelization). -
The main NLP tasks, end to end (lesson 7). The offsets and word IDs introduced here are what make token-level tasks (named-entity recognition, extractive question answering) work in the next lesson.