Skip to content

Summary: Data sources and datasets

Scaling laws demand trillions of tokens, and a pretraining corpus is a deliberate mix of several large slices. The categories: web crawls (the bulk by bytes, almost always Common Crawl), wikis (Wikipedia and related), books and long-form, code (public repositories), math and academic text (arXiv-class), and social/forum text (Reddit, Stack Exchange). The reference open datasets that make a from-scratch run possible are The Pile, RedPajama, FineWeb, and RefinedWeb. Going from raw web crawl to a final training corpus is a 50-to-1000x funnel of HTML extraction, language and quality filtering, deduplication, and mixing, most of the data-engineering work. Sources are mixed with sampling weights, not byte-fractions: code is up-weighted past its raw share because it lifts coding and reasoning; high-density slices (wikis, books) are over-weighted; small high-quality slices may be passed multiple times while the large web slice is seen once or less. This lesson is technical-not-legal; legal and policy debates around the data are out of scope here.

  • A corpus is a mix, not one source. Web crawls (CC) dominate by bytes; wikis, books, code, math/academic, and social/forum text are the other categories. Pick combinations deliberately.
  • Reference open datasets. The Pile (~800GB, 22 sub-datasets), RedPajama (~1.2T tokens, LLaMA-style), FineWeb (~15T tokens web), RefinedWeb (earlier clean web). These make from-scratch trainable.
  • The raw-to-final funnel (50-1000x shrink): HTML/text + language filter, deeper quality filter, deduplication, then mixing with non-web sources. Most of the data work happens here.
  • Sampling weights are not byte fractions. Code, wikis, books, and math are up-weighted relative to their bytes share for density and capability lift.
  • Code in the mix lifts reasoning, not just coding, a well-documented effect on math and reasoning benchmarks.
  • Technical-not-legal. Where the data comes from is also a legal and policy question with active debate; that is beyond this track’s scope. This lesson is the data-engineering mechanics.

This lesson is the structural picture: a frontier LLM corpus is a deliberate mix of several large categories, drawn from a few dominant sources, with the bulk of the bytes coming from a web crawl funnel and the rest from curated additions. The mixing weights and the funnel are real research dials that affect what the model becomes good at. Treating data sourcing as a pipeline you understand and tune, sources, scale, funnel, mix, is what separates a training run that produces a competent model from one that produces an opinionated one with mysterious gaps. The next lesson opens the filtering, deduplication, and mixing decisions in detail.

A frontier LLM is trained on a deliberate mix of several large text slices, with web crawls as the bulk and curated sources tuning the rest. The funnel from raw crawl to final tokens is most of the data work, and the mixing weights are a real research dial. The next lesson does the filtering, deduplication, and mixing in detail.