Data, part 1, sources and datasets
What you’ll learn
Section titled “What you’ll learn”Scaling laws say D ~= 20N, which at frontier-model size means trillions of training tokens. This lesson is where those tokens actually come from. The source curriculum is Stanford CS336, Lecture 13, by Tatsunori Hashimoto and Percy Liang, with lectures freely available on YouTube and the course at cs336.stanford.edu.
You will name the six categories of pretraining text source (web crawls, wikis, books, code, math/academic, social/forum); recognize the major open pretraining datasets (The Pile, RedPajama, FineWeb, RefinedWeb) and what each provides; describe the 50-to-1000x raw-to-final funnel that turns raw Common Crawl into a usable corpus; understand why sampling weights differ from raw byte fractions; and reason about how the mix shapes downstream capability.
§6 framing note: where training data comes from is also a legal and policy question with active debate. This lesson does not take a position on those debates; it teaches the data-engineering mechanics. Same technical-not-legal discipline as elsewhere in the fleet.
Where this fits
Section titled “Where this fits”This is lesson 11 of 14, the third lesson of Phase 3 (scale, data, and alignment). It builds on lesson 9 (scaling laws made D ~= 20N central) and lesson 10 (the eval discipline that decides whether changes to the mix actually help). The next lesson opens the later stages of the funnel introduced here, filtering, deduplication, mixing, and synthetic data.
Before you start
Section titled “Before you start”Prerequisites: lesson 9 (the scaling-laws context that makes trillions of tokens unavoidable). Track 14 lesson 5 (the datasets library) helps as a using-side analogue, though this lesson is at the building-the-corpus level.
About the math
Section titled “About the math”None. This lesson is structural data-engineering: categories, scales, funnel proportions, sampling weights. Reasoning about percentages and shrink factors, not formulas.
By the end, you’ll be able to
Section titled “By the end, you’ll be able to”The single capability this lesson builds: explain where LLM training data comes from and how a training corpus is assembled. Concretely, you will be able to:
- Name the six categories of pretraining text source
- Recognize the major open pretraining datasets and what they provide
- Describe the raw-to-final corpus funnel
- Explain why sampling weights differ from raw byte fractions
- Reason about how the mix shapes downstream capability
Time and difficulty
Section titled “Time and difficulty”- Read time: about 13 minutes
- Practice time: about 10 minutes (sketch a sampling-weight allocation + reasoning about pure-web pretraining, plus flashcards)
- Difficulty: deep (Stage C; conceptual data-engineering, no math, kept strictly technical)