Data sources and datasets: cheatsheet

The six source categories

Category	Source examples	Why include
Web crawls	Common Crawl	Bulk by bytes; multi-language; general domain
Wikis	Wikipedia, related wikis	High density, well-edited, broad coverage
Books / long-form	Project Gutenberg, licensed corpora	Multi-paragraph coherence, formal register
Code	Public repositories (Stack-class)	Lifts both code AND reasoning
Math / academic	arXiv-class, LaTeX corpora	Dense formal-reasoning text
Social / forum	Reddit, Stack Exchange	Conversational, dialogue-shaped

Reference open datasets

Dataset	Year	Size	Notes
The Pile	2020	~800 GB / 22 sub-datasets	Earliest big open mix
RedPajama	2023	~1.2T tokens	LLaMA-style reproduction
RefinedWeb	2023	(large)	Earlier clean open web
FineWeb	2024	~15T tokens	Current reference open-web corpus

Public, documented funnels; what makes from-scratch trainable.

The raw-to-final funnel (50-1000x shrink)

Raw Common Crawl                                (petabytes)
  -> HTML-to-text + language + basic quality filter   (~5-10x)
  -> deeper quality filter (heuristics, classifiers)  (~5-10x)
  -> deduplication (across + within documents)        (~2-10x)
  -> mix with non-web sources (wikis, code, books, math, social)
Final training corpus                            (trillions of tokens)

A 1T-token final corpus typically starts from tens of trillions raw. Most data-engineering work lives in the funnel.

Sampling weights (not byte fractions)

Source	Typical posture
Web	Bulk; single-epoch or fewer
Wiki, books, curated	Up-weighted; often multi-epoch
Code	Up-weighted past raw bytes; lifts code AND reasoning
Math / academic	Up-weighted for density
Social / forum	Moderate; shape for dialogue tasks

Exact mix is empirical: small-scale scaling-law fits across candidate mixes; pick what extrapolates best.

What different mixes change

Lever	Effect
More code	Higher coding + reasoning scores
More web	Web-style prose; more noise to filter
More books / academic	Long-form coherence; formal register
Broader languages	Better non-English; long-tail domain coverage

Technical-not-legal note

Where the data comes from is also a legal and policy question with active debate. This lesson does not take a position on those debates; it teaches the data-engineering mechanics. Same discipline as the bias and privacy lessons elsewhere in the fleet.

Words to use precisely

Common Crawl: long-running public web archive; bulk source for most LLMs.
Mix / sampling weight: per-source proportion in the training stream (often different from raw bytes share).
Funnel: the multi-step shrink from raw crawl to final corpus.
Open mix: a publicly-released pretraining dataset (Pile, RedPajama, FineWeb, etc.).

Source

Stanford CS336, Lecture 13 (Data, sources and datasets), by Hashimoto and Liang. cs336.stanford.edu. Independent structural mirror in original prose; see references.