References: Data sources and datasets

Source material

Source curriculum (structural mirror, cited as further study):
• Stanford CS336, "Language Modeling from Scratch", Lecture 13:
    Data, sources and datasets
  Instructors: Tatsunori Hashimoto and Percy Liang (Stanford)
  Course page: https://cs336.stanford.edu/
  Lecture videos: YouTube playlist
    https://www.youtube.com/playlist?list=PLoROMvodv4rMqXOcazWaTUHhq-yembLCV
  License: no explicit license is published on the course site; lecture
    videos are on YouTube under standard terms; slides are public on GitHub
    without a stated license.
  Required attribution: "Based on the structure of Stanford CS336,
    'Language Modeling from Scratch,' by Tatsunori Hashimoto and Percy Liang
    (cs336.stanford.edu). This is an independent structural mirror in
    original prose; it reproduces no course materials, and Stanford does
    not endorse it."
This lesson mirrors the structure of Lecture 13 (data, part 1). Clawdemy's
lessons are original prose that follows the pedagogical arc of the course.
Because the source publishes no explicit license, we cite it as a recommended
companion and reproduce none of its materials. This lesson is taught at a
strictly technical (data-engineering) level; legal and policy questions
about training data are out of scope here.

Watch this next

Stanford CS336, Lecture 13: Data, sources and datasets by Hashimoto and Liang. The lecture this lesson mirrors. It walks the funnel and mixing decisions with worked examples and numbers, the natural next step once the picture here is clear.

Going deeper

A short, durable list. Each link is a specific next step, not a generic pile.

“The Pile: An 800GB Dataset of Diverse Text for Language Modeling” by Gao et al. (2020). The introduction of The Pile, including its 22 sub-datasets and their motivations. The clearest worked example of “what does a real pretraining corpus look like” at the time, and still a useful template.
The FineWeb dataset and report by Hugging Face (2024). The current reference open-web corpus, with its filtering and dedup recipes published. Skim the dataset card for the funnel decisions modern recipes actually make.
Common Crawl. The public archive that almost every modern web-pretraining dataset is built from. Worth reading the overview to understand the raw shape, scale, and update cadence.

Adjacent topics

Where this connects inside the track.

Counting the cost (lesson 2) and Scaling laws (lesson 9). The D ~= 20N rule from scaling laws is what makes data volume the central concern: trillions of tokens, the funnel, and the mix.
Tokenizers up close (Track 14 Lesson 6). The tokenizer is trained on a corpus drawn from these same sources; the corpus and the tokenizer co-determine the units the model sees.
Data filtering, deduplication, and mixing (lesson 12). The next lesson opens the later stages of the funnel from this lesson, where most of the data engineering actually happens.