References: Data sources and datasets
Source material
Section titled “Source material”Source curriculum (structural mirror, cited as further study):• Stanford CS336, "Language Modeling from Scratch", Lecture 13: Data, sources and datasets Instructors: Tatsunori Hashimoto and Percy Liang (Stanford) Course page: https://cs336.stanford.edu/ Lecture videos: YouTube playlist https://www.youtube.com/playlist?list=PLoROMvodv4rMqXOcazWaTUHhq-yembLCV License: no explicit license is published on the course site; lecture videos are on YouTube under standard terms; slides are public on GitHub without a stated license. Required attribution: "Based on the structure of Stanford CS336, 'Language Modeling from Scratch,' by Tatsunori Hashimoto and Percy Liang (cs336.stanford.edu). This is an independent structural mirror in original prose; it reproduces no course materials, and Stanford does not endorse it."This lesson mirrors the structure of Lecture 13 (data, part 1). Clawdemy'slessons are original prose that follows the pedagogical arc of the course.Because the source publishes no explicit license, we cite it as a recommendedcompanion and reproduce none of its materials. This lesson is taught at astrictly technical (data-engineering) level; legal and policy questionsabout training data are out of scope here.Watch this next
Section titled “Watch this next”- Stanford CS336, Lecture 13: Data, sources and datasets by Hashimoto and Liang. The lecture this lesson mirrors. It walks the funnel and mixing decisions with worked examples and numbers, the natural next step once the picture here is clear.
Going deeper
Section titled “Going deeper”A short, durable list. Each link is a specific next step, not a generic pile.
-
“The Pile: An 800GB Dataset of Diverse Text for Language Modeling” by Gao et al. (2020). The introduction of The Pile, including its 22 sub-datasets and their motivations. The clearest worked example of “what does a real pretraining corpus look like” at the time, and still a useful template.
-
The FineWeb dataset and report by Hugging Face (2024). The current reference open-web corpus, with its filtering and dedup recipes published. Skim the dataset card for the funnel decisions modern recipes actually make.
-
Common Crawl. The public archive that almost every modern web-pretraining dataset is built from. Worth reading the overview to understand the raw shape, scale, and update cadence.
Adjacent topics
Section titled “Adjacent topics”Where this connects inside the track.
-
Counting the cost (lesson 2) and Scaling laws (lesson 9). The
D ~= 20Nrule from scaling laws is what makes data volume the central concern: trillions of tokens, the funnel, and the mix. -
Tokenizers up close (Track 14 Lesson 6). The tokenizer is trained on a corpus drawn from these same sources; the corpus and the tokenizer co-determine the units the model sees.
-
Data filtering, deduplication, and mixing (lesson 12). The next lesson opens the later stages of the funnel from this lesson, where most of the data engineering actually happens.