Skip to content

Cheatsheet: Tokenizers up close

StageWhat it does
1. NormalizationClean text: lowercasing, strip accents, Unicode tidy-up
2. Pre-tokenizationSplit cleaned text into words (the subword boundaries)
3. ModelSubword algorithm breaks words into tokens
4. PostprocessingAdd special tokens ([CLS], [SEP], etc.)
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("bert-base-uncased")
tok.backend_tokenizer.normalizer.normalize_str("Héllò?") # -> 'hello?'
tok.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hi, you?")
# -> [('hi', (0,2)), (',', (2,3)), ('you', (4,7)), ('?', (7,8))] (with offsets)
FastSlow
Backed byRust (tokenizers lib)Pure Python
Default?Yes (AutoTokenizer picks it)Only if no fast version
Offsets / word IDsYes (align tokens to source text)No
Speed~30x with batched=TrueSlow

Offsets/word IDs are what make token classification (NER) and extractive QA work.

AlgorithmDirectionLearnsUsed by
BPEMerge up from charactersMerge rules + vocabGPT family
WordPieceMerge up (frequency-scored)VocabBERT
UnigramPrune down from large vocabVocab + token scoresT5 (SentencePiece)

All produce a middle vocabulary of word-pieces, so rare words split into known parts.

  • GPT-2 (BPE): Ġ marks a space
  • SentencePiece (Unigram, T5): a special underscore marks a space
  • BERT: drops repeated spaces (tokenization not reversible)
from datasets import load_dataset
from transformers import AutoTokenizer
raw = load_dataset("code_search_net", "python", split="train")
old = AutoTokenizer.from_pretrained("gpt2") # start from a fast tokenizer
def corpus(): # generator: never all in RAM
for i in range(0, len(raw), 1000):
yield raw[i : i + 1000]["whole_func_string"]
new = old.train_new_from_iterator(corpus(), 52000) # 52000 = vocab size
new.save_pretrained("my-tokenizer")
new.push_to_hub("my-tokenizer")
  • Not model training: deterministic statistics over the corpus, no gradient descent, no GPU.
  • Inherits the old tokenizer’s algorithm, normalization, and special tokens; only the vocabulary changes.
  • Fast tokenizers only.
  • Subword tokenization: a vocabulary of word-pieces between whole-word and character level.
  • Offsets / word IDs: the source character span / original word each token maps to.
  • backend_tokenizer: handle to the underlying Rust tokenizer (normalizer, pre_tokenizer).
  • Vocabulary size: how many distinct tokens the tokenizer learns.
  • Hugging Face LLM Course, Chapter 6: “The Tokenizers library.” huggingface.co/learn/llm-course/chapter6. Released under Apache 2.0; this lesson mirrors its structure with original prose.