References: building the GPT tokenizer
Source material
Section titled “Source material”Source curriculum (structural mirror, cited as further study):• Andrej Karpathy, "Neural Networks: Zero to Hero", Lecture 8: "Let's build the GPT Tokenizer" Creator: Andrej Karpathy Video: https://www.youtube.com/watch?v=zduSFxRajkE Code repo (minbpe): https://github.com/karpathy/minbpe (MIT License) Series page: https://karpathy.ai/zero-to-hero.html License: minbpe is MIT-licensed; the video is YouTube standard.This lesson covers Lecture 8, where Karpathy builds a byte-pair-encodingtokenizer from scratch over UTF-8 bytes and explains the language-model quirksthat trace to tokenization. Clawdemy's lessons are original prose following thepedagogical arc of this series; we do not reproduce or transcribe the video orcode. The aaabdaaabac merge example is the classic public BPE illustration; theencode/decode walkthrough here is ours, built to be checkable by hand. Allrights to the original video and code remain with the creator.Watch this next
Section titled “Watch this next”- Let’s build the GPT Tokenizer (Andrej Karpathy) by Andrej Karpathy. The lecture this lesson mirrors. Karpathy builds BPE over raw bytes step by step, trains it, and spends real time on the tokenization-driven quirks (spelling, arithmetic, whitespace, non-English text), showing concrete failures and tracing each back to the tokenizer. If the “quirks come from tokenization” claim surprised you, watching the live demonstrations is the most convincing follow-up.
Going deeper
Section titled “Going deeper”-
minbpe on GitHub (MIT License). Karpathy’s minimal, clean byte-pair-encoding implementation, the train/encode/decode loop from this lesson in a few readable files. Building or reading it is the way to make BPE concrete.
-
Neural Machine Translation of Rare Words with Subword Units (Sennrich, Haddow, Birch, 2016) (arXiv). The paper that brought byte-pair encoding (originally a compression algorithm) into language modeling as a subword tokenization method. It is the origin of the approach every GPT tokenizer now uses.
Adjacent topics
Section titled “Adjacent topics”Where this sits in the curriculum.
-
How AI reads tokens (AI Foundations track). That lesson describes tokenization from the user’s side, what a token is and why text is chunked into them. This lesson builds the tokenizer that produces those tokens, via byte-pair encoding. Read together, one gives the what-and-why and the other the how.
-
The whole of this track. The tokenizer is the front door to everything you built: its tokens feed the embedding tables (the MLP and GPT lessons), which feed the transformer blocks (the self-attention and GPT-assembly lessons), trained with the engine and loop from Phase 1. This lesson completes the pipeline from raw text to generated text.