Skip to content

References: building the GPT tokenizer

Source curriculum (structural mirror, cited as further study):
• Andrej Karpathy, "Neural Networks: Zero to Hero", Lecture 8:
"Let's build the GPT Tokenizer"
Creator: Andrej Karpathy
Video: https://www.youtube.com/watch?v=zduSFxRajkE
Code repo (minbpe): https://github.com/karpathy/minbpe (MIT License)
Series page: https://karpathy.ai/zero-to-hero.html
License: minbpe is MIT-licensed; the video is YouTube standard.
This lesson covers Lecture 8, where Karpathy builds a byte-pair-encoding
tokenizer from scratch over UTF-8 bytes and explains the language-model quirks
that trace to tokenization. Clawdemy's lessons are original prose following the
pedagogical arc of this series; we do not reproduce or transcribe the video or
code. The aaabdaaabac merge example is the classic public BPE illustration; the
encode/decode walkthrough here is ours, built to be checkable by hand. All
rights to the original video and code remain with the creator.
  • Let’s build the GPT Tokenizer (Andrej Karpathy) by Andrej Karpathy. The lecture this lesson mirrors. Karpathy builds BPE over raw bytes step by step, trains it, and spends real time on the tokenization-driven quirks (spelling, arithmetic, whitespace, non-English text), showing concrete failures and tracing each back to the tokenizer. If the “quirks come from tokenization” claim surprised you, watching the live demonstrations is the most convincing follow-up.

Where this sits in the curriculum.

  • How AI reads tokens (AI Foundations track). That lesson describes tokenization from the user’s side, what a token is and why text is chunked into them. This lesson builds the tokenizer that produces those tokens, via byte-pair encoding. Read together, one gives the what-and-why and the other the how.

  • The whole of this track. The tokenizer is the front door to everything you built: its tokens feed the embedding tables (the MLP and GPT lessons), which feed the transformer blocks (the self-attention and GPT-assembly lessons), trained with the engine and loop from Phase 1. This lesson completes the pipeline from raw text to generated text.