References: What "from scratch" means, and the tokenizer

Source material

Source curriculum (structural mirror, cited as further study):
• Stanford CS336, "Language Modeling from Scratch", Lecture 1: Overview, tokenization
  Instructors: Tatsunori Hashimoto and Percy Liang (Stanford)
  Course page: https://cs336.stanford.edu/
  Lecture videos: YouTube playlist
    https://www.youtube.com/playlist?list=PLoROMvodv4rMqXOcazWaTUHhq-yembLCV
  Assignment 1 (Basics): https://github.com/stanford-cs336/assignment1-basics
  License: no explicit license is published on the course site; lecture
    videos are on YouTube under standard terms; slides and assignment code
    are public on GitHub without a stated license.
  Required attribution: "Based on the structure of Stanford CS336,
    'Language Modeling from Scratch,' by Tatsunori Hashimoto and Percy Liang
    (cs336.stanford.edu). This is an independent structural mirror in
    original prose; it reproduces no course materials, and Stanford does
    not endorse it."
This lesson mirrors the structure of Lecture 1 (the from-scratch overview and
tokenization). Clawdemy's lessons are original prose that follows the
pedagogical arc of the course. Because the source publishes no explicit
license, we take the conservative posture: we cite the course as a
recommended companion and reproduce none of its materials (no slides, code,
or assignment text). All rights to the original course materials remain with
their creators.

Watch this next

Stanford CS336, Lecture 1: Overview and tokenization by Tatsunori Hashimoto and Percy Liang. The lecture this lesson mirrors. It motivates the whole from-scratch, efficiency-first approach and walks tokenization in depth. Pair it with this lesson for the full version of the road map.

Going deeper

A short, durable list. Each link is a specific next step, not a generic pile.

CS336 Assignment 1: Basics. The hands-on counterpart: implement a byte-level BPE tokenizer, a Transformer, the loss and optimizer, and a training loop from scratch. The place to actually build what this lesson describes.
“Neural Machine Translation of Rare Words with Subword Units” by Sennrich, Haddow, and Birch (2016). The paper that introduced BPE to NLP. Short and readable; the original source for the merge-the-most-frequent-pair idea.
The Hugging Face tokenizers course chapter. The companion that builds intuition for tokenizers from the using side, including the normalization and pre-tokenization steps that surround the BPE core covered here.

Adjacent topics

Where this connects inside the track and the wider curriculum.

Counting the cost: FLOPs, memory, and arithmetic intensity (lesson 2). The next lesson introduces the efficiency accounting this lesson named as the track’s through-line, and it lets you quantify the vocabulary-size trade-off concretely.
Track 14, Tokenizers up close. The practical-track companion: it uses and trains a fast (byte-level BPE) tokenizer through the Hugging Face library. Same algorithm, approached from the using side rather than the building side.
Track 13 (Build Neural Networks from Scratch). The other from-scratch track: it builds the conceptual engine (autograd, a small GPT). This track builds the full production pipeline; the two are complementary.