Cheatsheet: Data sources and datasets
The six source categories
Section titled “The six source categories”| Category | Source examples | Why include |
|---|---|---|
| Web crawls | Common Crawl | Bulk by bytes; multi-language; general domain |
| Wikis | Wikipedia, related wikis | High density, well-edited, broad coverage |
| Books / long-form | Project Gutenberg, licensed corpora | Multi-paragraph coherence, formal register |
| Code | Public repositories (Stack-class) | Lifts both code AND reasoning |
| Math / academic | arXiv-class, LaTeX corpora | Dense formal-reasoning text |
| Social / forum | Reddit, Stack Exchange | Conversational, dialogue-shaped |
Reference open datasets
Section titled “Reference open datasets”| Dataset | Year | Size | Notes |
|---|---|---|---|
| The Pile | 2020 | ~800 GB / 22 sub-datasets | Earliest big open mix |
| RedPajama | 2023 | ~1.2T tokens | LLaMA-style reproduction |
| RefinedWeb | 2023 | (large) | Earlier clean open web |
| FineWeb | 2024 | ~15T tokens | Current reference open-web corpus |
Public, documented funnels; what makes from-scratch trainable.
The raw-to-final funnel (50-1000x shrink)
Section titled “The raw-to-final funnel (50-1000x shrink)”Raw Common Crawl (petabytes) -> HTML-to-text + language + basic quality filter (~5-10x) -> deeper quality filter (heuristics, classifiers) (~5-10x) -> deduplication (across + within documents) (~2-10x) -> mix with non-web sources (wikis, code, books, math, social)Final training corpus (trillions of tokens)A 1T-token final corpus typically starts from tens of trillions raw. Most data-engineering work lives in the funnel.
Sampling weights (not byte fractions)
Section titled “Sampling weights (not byte fractions)”| Source | Typical posture |
|---|---|
| Web | Bulk; single-epoch or fewer |
| Wiki, books, curated | Up-weighted; often multi-epoch |
| Code | Up-weighted past raw bytes; lifts code AND reasoning |
| Math / academic | Up-weighted for density |
| Social / forum | Moderate; shape for dialogue tasks |
Exact mix is empirical: small-scale scaling-law fits across candidate mixes; pick what extrapolates best.
What different mixes change
Section titled “What different mixes change”| Lever | Effect |
|---|---|
| More code | Higher coding + reasoning scores |
| More web | Web-style prose; more noise to filter |
| More books / academic | Long-form coherence; formal register |
| Broader languages | Better non-English; long-tail domain coverage |
Technical-not-legal note
Section titled “Technical-not-legal note”Where the data comes from is also a legal and policy question with active debate. This lesson does not take a position on those debates; it teaches the data-engineering mechanics. Same discipline as the bias and privacy lessons elsewhere in the fleet.
Words to use precisely
Section titled “Words to use precisely”- Common Crawl: long-running public web archive; bulk source for most LLMs.
- Mix / sampling weight: per-source proportion in the training stream (often different from raw bytes share).
- Funnel: the multi-step shrink from raw crawl to final corpus.
- Open mix: a publicly-released pretraining dataset (Pile, RedPajama, FineWeb, etc.).
Source
Section titled “Source”- Stanford CS336, Lecture 13 (Data, sources and datasets), by Hashimoto and Liang.
cs336.stanford.edu. Independent structural mirror in original prose; see references.