LLM data sources, part 1: brief

What you’ll learn

Scaling laws say D ~= 20N, which at frontier-model size means trillions of training tokens. This lesson is where those tokens actually come from. The source curriculum is Stanford CS336, Lecture 13, by Tatsunori Hashimoto and Percy Liang, with lectures freely available on YouTube and the course at cs336.stanford.edu.

You will name the six categories of pretraining text source (web crawls, wikis, books, code, math/academic, social/forum); recognize the major open pretraining datasets (The Pile, RedPajama, FineWeb, RefinedWeb) and what each provides; describe the 50-to-1000x raw-to-final funnel that turns raw Common Crawl into a usable corpus; understand why sampling weights differ from raw byte fractions; and reason about how the mix shapes downstream capability.

§6 framing note: where training data comes from is also a legal and policy question with active debate. This lesson does not take a position on those debates; it teaches the data-engineering mechanics. Same technical-not-legal discipline as elsewhere in the fleet.

Where this fits

This is lesson 11 of 14, the third lesson of Phase 3 (scale, data, and alignment). It builds on lesson 9 (scaling laws made D ~= 20N central) and lesson 10 (the eval discipline that decides whether changes to the mix actually help). The next lesson opens the later stages of the funnel introduced here, filtering, deduplication, mixing, and synthetic data.

Before you start

Prerequisites: lesson 9 (the scaling-laws context that makes trillions of tokens unavoidable). Track 14 lesson 5 (the datasets library) helps as a using-side analogue, though this lesson is at the building-the-corpus level.

About the math

None. This lesson is structural data-engineering: categories, scales, funnel proportions, sampling weights. Reasoning about percentages and shrink factors, not formulas.

By the end, you’ll be able to

The single capability this lesson builds: explain where LLM training data comes from and how a training corpus is assembled. Concretely, you will be able to:

Name the six categories of pretraining text source
Recognize the major open pretraining datasets and what they provide
Describe the raw-to-final corpus funnel
Explain why sampling weights differ from raw byte fractions
Reason about how the mix shapes downstream capability

Time and difficulty

Read time: about 13 minutes
Practice time: about 10 minutes (sketch a sampling-weight allocation + reasoning about pure-web pretraining, plus flashcards)
Difficulty: deep (Stage C; conceptual data-engineering, no math, kept strictly technical)