Skip to content

References: Data sources and datasets

Source curriculum (structural mirror, cited as further study):
• Stanford CS336, "Language Modeling from Scratch", Lecture 13:
Data, sources and datasets
Instructors: Tatsunori Hashimoto and Percy Liang (Stanford)
Course page: https://cs336.stanford.edu/
Lecture videos: YouTube playlist
https://www.youtube.com/playlist?list=PLoROMvodv4rMqXOcazWaTUHhq-yembLCV
License: no explicit license is published on the course site; lecture
videos are on YouTube under standard terms; slides are public on GitHub
without a stated license.
Required attribution: "Based on the structure of Stanford CS336,
'Language Modeling from Scratch,' by Tatsunori Hashimoto and Percy Liang
(cs336.stanford.edu). This is an independent structural mirror in
original prose; it reproduces no course materials, and Stanford does
not endorse it."
This lesson mirrors the structure of Lecture 13 (data, part 1). Clawdemy's
lessons are original prose that follows the pedagogical arc of the course.
Because the source publishes no explicit license, we cite it as a recommended
companion and reproduce none of its materials. This lesson is taught at a
strictly technical (data-engineering) level; legal and policy questions
about training data are out of scope here.

A short, durable list. Each link is a specific next step, not a generic pile.

  • “The Pile: An 800GB Dataset of Diverse Text for Language Modeling” by Gao et al. (2020). The introduction of The Pile, including its 22 sub-datasets and their motivations. The clearest worked example of “what does a real pretraining corpus look like” at the time, and still a useful template.

  • The FineWeb dataset and report by Hugging Face (2024). The current reference open-web corpus, with its filtering and dedup recipes published. Skim the dataset card for the funnel decisions modern recipes actually make.

  • Common Crawl. The public archive that almost every modern web-pretraining dataset is built from. Worth reading the overview to understand the raw shape, scale, and update cadence.

Where this connects inside the track.

  • Counting the cost (lesson 2) and Scaling laws (lesson 9). The D ~= 20N rule from scaling laws is what makes data volume the central concern: trillions of tokens, the funnel, and the mix.

  • Tokenizers up close (Track 14 Lesson 6). The tokenizer is trained on a corpus drawn from these same sources; the corpus and the tokenizer co-determine the units the model sees.

  • Data filtering, deduplication, and mixing (lesson 12). The next lesson opens the later stages of the funnel from this lesson, where most of the data engineering actually happens.