Skip to content

Cheatsheet: Data sources and datasets

CategorySource examplesWhy include
Web crawlsCommon CrawlBulk by bytes; multi-language; general domain
WikisWikipedia, related wikisHigh density, well-edited, broad coverage
Books / long-formProject Gutenberg, licensed corporaMulti-paragraph coherence, formal register
CodePublic repositories (Stack-class)Lifts both code AND reasoning
Math / academicarXiv-class, LaTeX corporaDense formal-reasoning text
Social / forumReddit, Stack ExchangeConversational, dialogue-shaped
DatasetYearSizeNotes
The Pile2020~800 GB / 22 sub-datasetsEarliest big open mix
RedPajama2023~1.2T tokensLLaMA-style reproduction
RefinedWeb2023(large)Earlier clean open web
FineWeb2024~15T tokensCurrent reference open-web corpus

Public, documented funnels; what makes from-scratch trainable.

Raw Common Crawl (petabytes)
-> HTML-to-text + language + basic quality filter (~5-10x)
-> deeper quality filter (heuristics, classifiers) (~5-10x)
-> deduplication (across + within documents) (~2-10x)
-> mix with non-web sources (wikis, code, books, math, social)
Final training corpus (trillions of tokens)

A 1T-token final corpus typically starts from tens of trillions raw. Most data-engineering work lives in the funnel.

SourceTypical posture
WebBulk; single-epoch or fewer
Wiki, books, curatedUp-weighted; often multi-epoch
CodeUp-weighted past raw bytes; lifts code AND reasoning
Math / academicUp-weighted for density
Social / forumModerate; shape for dialogue tasks

Exact mix is empirical: small-scale scaling-law fits across candidate mixes; pick what extrapolates best.

LeverEffect
More codeHigher coding + reasoning scores
More webWeb-style prose; more noise to filter
More books / academicLong-form coherence; formal register
Broader languagesBetter non-English; long-tail domain coverage

Where the data comes from is also a legal and policy question with active debate. This lesson does not take a position on those debates; it teaches the data-engineering mechanics. Same discipline as the bias and privacy lessons elsewhere in the fleet.

  • Common Crawl: long-running public web archive; bulk source for most LLMs.
  • Mix / sampling weight: per-source proportion in the training stream (often different from raw bytes share).
  • Funnel: the multi-step shrink from raw crawl to final corpus.
  • Open mix: a publicly-released pretraining dataset (Pile, RedPajama, FineWeb, etc.).
  • Stanford CS336, Lecture 13 (Data, sources and datasets), by Hashimoto and Liang. cs336.stanford.edu. Independent structural mirror in original prose; see references.