Skip to content

What "from scratch" means, and the tokenizer

This is the first lesson of Track 15, the deepest tier on Clawdemy: a track that builds a language model from scratch, the way the people who train frontier models do. This opener gives you the map of the whole endeavor and then builds the first concrete component. The source curriculum is Stanford CS336, “Language Modeling from Scratch,” by Tatsunori Hashimoto and Percy Liang, with lectures freely available on YouTube and the course at cs336.stanford.edu.

You will learn what building an LLM from scratch actually entails (tokenizer, architecture, loss, optimizer, training loop, plus the systems, scaling, data, and post-training that follow); why efficiency, accounting for FLOPs, memory, and hardware use, is the through-line of the whole track; what a tokenizer does as the model’s first component; why subword tokenization beats character- and word-level; and how byte-level BPE works, the merge-the-most-frequent-pair procedure you will later implement by hand.

This is lesson 1 of 14, opening Phase 1 (the model). It is the orientation lesson: it frames the from-scratch project and builds the tokenizer, the component every later lesson’s tokens flow from. The next lesson introduces the efficiency accounting (FLOPs and memory) that the whole track runs on. Track 13 (Build Neural Networks from Scratch) builds the conceptual engine; this track builds the full production pipeline, so they are complementary rather than overlapping.

Prerequisites: none within the track (this is the opener), but the track as a whole is the deep end and assumes real background: comfortable Python and PyTorch, familiarity with training a neural network, and at least some sense of how code runs on a GPU. Tracks 13 (from-scratch neural networks) or 14 (practical transformers) are good on-ramps, or equivalent footing. This first lesson is conceptual and reads without a notebook; later lessons are implementation-heavy.

Light in this lesson. The from-scratch overview and the tokenizer are conceptual; the only procedure is the BPE merge loop, which is counting and merging, no calculus. The track as a whole involves real math and heavy code, but this opener is about the map and the first component, not derivations.

The single capability this lesson builds: explain what building an LLM from scratch entails end to end, and describe what a tokenizer does as the model’s first component. Concretely, you will be able to:

  • Explain what building an LLM from scratch entails end to end
  • Explain why efficiency (FLOPs, memory, hardware) is the track’s through-line
  • Describe what a tokenizer does as the model’s first component
  • Explain why subword tokenization beats character- and word-level
  • Describe the byte-level BPE training procedure and its trade-offs
  • Read time: about 13 minutes
  • Practice time: about 12 minutes (trace BPE merges by hand, plus flashcards)
  • Difficulty: deep (Stage C, the deepest tier; this opener is conceptual, but the track assumes Python/PyTorch fluency)