Skip to content

References: How AI reads: turning text into tokens

Source material:
• Stanford CME 295: Transformers & Large Language Models, Autumn 2025, Lecture 1
Instructor: Afshine Amidi & Shervine Amidi, Stanford University
YouTube: https://www.youtube.com/watch?v=Ub3GoFaUcds
Course site: https://cme295.stanford.edu/
Cheatsheet: https://cme295.stanford.edu/cheatsheet/
License (lecture video): as published on Stanford's public YouTube channel
License (Amidi cheatsheets): MIT
Clawdemy provides original notes, summaries, and quizzes derived from this material
for educational purposes. All rights to the original lecture remain with Stanford and
the instructors.

A short list, chosen for durability. Each link is for a specific next step, not a generic “learn more” pile.

  • “Let’s build the GPT Tokenizer” by Andrej Karpathy. The deepest publicly available walkthrough of BPE specifically. Two hours of the same algorithm we covered in this lesson, but Karpathy implements it from scratch in Python while explaining every choice. Watch this when you want to understand exactly what tokenizer code looks like, not just the algorithm.

  • Tiktokenizer by a Karpathy student. The interactive tokenizer we used in the practice section. Useful as a debugging tool: when an AI prompt is consuming more tokens than you expected, paste it here and find out where the cost is going.

  • OpenAI’s tiktoken library. The actual tokenizer code OpenAI ships, open-source. The cl100k_base and o200k_base encoder names you see referenced elsewhere are specific versions of the BPE vocabulary. If you ever need to count tokens programmatically before sending a prompt, this is the canonical Python library.

  • Hugging Face tokenizers library. The cross-vendor equivalent of tiktoken. Supports BPE, WordPiece (the BERT variant), and SentencePiece (the Google variant). Best resource if you ever need to train your own tokenizer or understand a non-OpenAI model’s tokenization choices.

  • Stanford CME 295 cheatsheet by the Amidi twins. Their MIT-licensed cheatsheet covers tokenization alongside the rest of the course. Especially good for visual learners; their typesetting is cleaner than most ML reference material.

Topics that build on or sit beside this one. Some are upcoming Clawdemy lessons; some are pointers outside the course.

  • Embeddings, attention, multi-head attention, the full transformer block. The remaining lessons in our Lecture 1 adaptation. Each one builds on tokenization: embeddings give the integer IDs their first numeric form (a high-dimensional vector); attention reads those vectors and produces context-mixed versions of them; multi-head runs that attention mechanism in parallel many times per layer; the transformer block wraps attention with feed-forward networks and normalization. None of them work without the bridge that tokenization provides.

  • “The Bitter Lesson” by Rich Sutton (2019). One page. Argues that the methods that win in AI are the ones that scale with compute, not the ones that encode human cleverness. Tokenization is an interesting borderline case. BPE is a clever engineering compromise (not pure scaling), and there is real research arguing that we should drop tokenization entirely for byte-level or even pixel-level models. Worth reading before you read any of the “tokenization is a hack” arguments online.

  • Prompt injection. The future security lesson that picks up where this one’s “Special tokens” section stopped. The tokenizer is the structural surface where injection attacks live; the full story belongs in its own lesson, but you have already seen the foundation.

The primary sources this lesson draws from.

  • “Neural Machine Translation of Rare Words with Subword Units”, Sennrich et al., 2015 (published in ACL 2016). The paper that adapted byte-pair encoding from a 1994 compression algorithm into the dominant tokenization strategy for neural language models. Section 3 (“Byte-Pair Encoding”) is the merge loop we covered. If you read only one paper from this lesson, read this one.

  • Gage 1994, “A New Algorithm for Data Compression” (in the journal The C Users Journal, no stable open URL). The original byte-pair encoding paper, predating its NLP adoption by twenty years. Gage was solving compression, not language; Sennrich’s contribution was recognizing the algorithm transferred. The relevant context: BPE existed for two decades as a compression curiosity before anyone tried it on text.

  • “Language Models are Unsupervised Multitask Learners” (GPT-2), Radford et al., 2019. The GPT-2 paper introduces byte-level BPE, the variant most modern LLMs use. Section 2.2 (“Input Representation”) is the relevant page; it explains why operating on bytes (rather than Unicode characters) gives you a tokenizer that handles every possible input string with a fixed vocabulary, including emoji and non-Latin scripts.

None selected for this lesson. The public discussion of tokenization has consolidated into the Karpathy video and the OpenAI / Hugging Face docs above; the marginal Reddit or Hacker News thread does not add durable value over those. If a canonical thread surfaces, it will be added at the next quarterly review.