References: building the GPT tokenizer

Source material

Source curriculum (structural mirror, cited as further study):
• Andrej Karpathy, "Neural Networks: Zero to Hero", Lecture 8:
  "Let's build the GPT Tokenizer"
  Creator: Andrej Karpathy
  Video: https://www.youtube.com/watch?v=zduSFxRajkE
  Code repo (minbpe): https://github.com/karpathy/minbpe (MIT License)
  Series page: https://karpathy.ai/zero-to-hero.html
  License: minbpe is MIT-licensed; the video is YouTube standard.
This lesson covers Lecture 8, where Karpathy builds a byte-pair-encoding
tokenizer from scratch over UTF-8 bytes and explains the language-model quirks
that trace to tokenization. Clawdemy's lessons are original prose following the
pedagogical arc of this series; we do not reproduce or transcribe the video or
code. The aaabdaaabac merge example is the classic public BPE illustration; the
encode/decode walkthrough here is ours, built to be checkable by hand. All
rights to the original video and code remain with the creator.

Watch this next

Let’s build the GPT Tokenizer (Andrej Karpathy) by Andrej Karpathy. The lecture this lesson mirrors. Karpathy builds BPE over raw bytes step by step, trains it, and spends real time on the tokenization-driven quirks (spelling, arithmetic, whitespace, non-English text), showing concrete failures and tracing each back to the tokenizer. If the “quirks come from tokenization” claim surprised you, watching the live demonstrations is the most convincing follow-up.

Going deeper

minbpe on GitHub (MIT License). Karpathy’s minimal, clean byte-pair-encoding implementation, the train/encode/decode loop from this lesson in a few readable files. Building or reading it is the way to make BPE concrete.
Neural Machine Translation of Rare Words with Subword Units (Sennrich, Haddow, Birch, 2016) (arXiv). The paper that brought byte-pair encoding (originally a compression algorithm) into language modeling as a subword tokenization method. It is the origin of the approach every GPT tokenizer now uses.

Adjacent topics

Where this sits in the curriculum.

How AI reads tokens (AI Foundations track). That lesson describes tokenization from the user’s side, what a token is and why text is chunked into them. This lesson builds the tokenizer that produces those tokens, via byte-pair encoding. Read together, one gives the what-and-why and the other the how.
The whole of this track. The tokenizer is the front door to everything you built: its tokens feed the embedding tables (the MLP and GPT lessons), which feed the transformer blocks (the self-attention and GPT-assembly lessons), trained with the engine and loop from Phase 1. This lesson completes the pipeline from raw text to generated text.