Practice: Data sources and datasets

Self-check

Seven short questions. Answer each before opening the collapsible.

1. Name the six categories of source a pretraining corpus typically draws from.

Show answer

Web crawls (the largest by bytes, almost always derived from Common Crawl), wikis (Wikipedia and related), books and long-form text, code (public repositories), math and academic text (arXiv-class), and social/forum text (Reddit, Stack Exchange). A corpus is a deliberate mix of these, not any single one.

2. Why is web crawl the dominant source even though most teams add other categories?

Show answer

Sheer volume: Common Crawl provides petabytes of multi-language general-domain text, far more than any other single source. But it has wildly variable quality, so other categories (wikis, books, code) are added at higher sampling weights than their bytes share to lift density and lift specific capabilities (coding, reasoning, formal register).

3. Name three open pretraining datasets and what each contributes.

Show answer

Any three of: The Pile (EleutherAI, ~800 GB across 22 sub-datasets; the earliest widely-used open mix), RedPajama (Together, ~1.2T tokens; reproduction of LLaMA’s data mix), FineWeb (Hugging Face, ~15T tokens of public web from CC; current reference open-web corpus), RefinedWeb (earlier clean open web; influential for filtering recipes).

4. Describe the raw-to-final data funnel.

Show answer

(1) Raw Common Crawl at petabyte scale. (2) HTML-to-text + language and basic quality filtering shrinks ~5-10x. (3) Deeper quality filtering (heuristics, classifiers) another ~5-10x. (4) Deduplication across and within documents another ~2-10x. (5) Final training corpus mixed with non-web sources. A 1T-token corpus starts from tens of trillions raw.

5. Why are sampling weights different from “fraction of bytes”?

Show answer

Different sources have very different information density per token. Code is a small share of crawl bytes but improves both coding and reasoning when up-weighted in training; wikis and books are smaller still but high-density and broadly useful. Sampling weights up-weight high-value categories relative to their byte fraction. The empirically chosen mix matters more than raw proportions.

6. What is one regularly observed effect of including code in the mix?

Show answer

Better code generation (expected) and better reasoning on non-code tasks (more surprising). Adding code lifts GSM8K, HumanEval, and similar evaluation scores noticeably; this is a well-documented effect, even though the precise mechanism (structured/formal text, planning patterns, longer-range coherence) is still studied.

7. How does this lesson stay technical-not-legal, and why?

Show answer

Where training data comes from is also a legal and policy question with active debate, but this lesson does not take a position on it. It describes the data-engineering mechanics: sources, scale, filtering funnel, mixing weights. The technical-not-legal discipline (same as in the bias / privacy lessons elsewhere in the fleet) is to teach the mechanics clearly and route any legal/policy framing to a different forum.

Try it yourself: design a corpus

About 10 minutes, no setup. Practice the data-engineering instincts.

Part A: a domain-shifted corpus. You are training a 3B-parameter model intended for use as a Python coding assistant on internal documentation, with general English assistant fallback. Sketch a sampling-weight allocation across the categories from this lesson. Aim for a 5-line table.

What a reasonable answer looks like

Public code (e.g. The Stack-class filtered open code)      40-50%
Internal Python docs + tutorials (small, high-quality)     10-15% (multiple epochs)
General web (filtered, e.g. FineWeb sample)                25-30%
Wikipedia + technical wikis                                 5-10%
Math/academic (arXiv-class)                                 2-5%

The key choices: code is heavily up-weighted past its raw bytes share because of the use case; internal docs are a small slice you’d want to pass multiple times for density; general web stays substantial to keep “ordinary English assistant” capability; math gets some weight to support reasoning. Exact ratios are empirical and would be tuned at small scale with scaling-law-style fits before committing.

Part B (reasoning). A teammate proposes “just train on Common Crawl, it’s huge and free.” Explain three things they’d lose, in technical terms, by skipping curation and other sources.

What you should notice

(1) Density. Wikipedia, books, and academic text have far more information per token than average web text; skipping them means more tokens to reach the same loss, costing compute (lesson 2). (2) Capability lift. Code in the mix raises both coding and reasoning evaluation scores; pure web underperforms on math and code. (3) Filtering still needed. Raw CC is mostly boilerplate, duplicates, and low-quality content; without the funnel from this lesson (and the filtering and dedup from lesson 12), the model wastes capacity on patterns you do not want it to learn. “Huge and free” is a strong starting point but not a corpus.

Part C (reasoning). Why is the funnel from raw crawl to final tokens often described as “where most of the data work happens”?

What you should notice

The 50-1000x shrink from raw to final represents most of the engineering: HTML extraction, language ID, quality filtering at multiple levels, deduplication across the corpus. Each step is a real algorithmic and infrastructure decision that affects the final model. Models trained on the same “Common Crawl” through different funnels behave very differently, which is exactly why the open datasets (The Pile, RedPajama, FineWeb) are valued: their funnel decisions are reproducible. Most of the data-engineering work is the funnel, not the sourcing.

Flashcards

Nine cards. Click any card to reveal the answer. Use the Print flashcards button to lay the set out one card per page for offline review.

Q. The six pretraining source categories?

Web crawls (Common Crawl, the bulk by bytes), wikis (Wikipedia and related), books and long-form, code (public repositories), math/academic text (arXiv-class), and social/forum text (Reddit, Stack Exchange).

Q. Three open pretraining datasets?

The Pile (EleutherAI; ~800GB; earliest big open mix); RedPajama (~1.2T tokens; LLaMA-style reproduction); FineWeb (~15T tokens web from CC; current reference); plus RefinedWeb (earlier clean open web).

Q. The raw-to-final data funnel?

Raw CC -> HTML-to-text + language filter (~5-10x shrink) -> deeper quality filter (~5-10x) -> dedup (~2-10x) -> mix with non-web sources. A 1T-token corpus starts from tens of trillions raw.

Q. Why aren't sampling weights just byte fractions?

Different sources have very different information density per token. Code, wikis, books are up-weighted beyond their byte share because each token is more valuable; web stays substantial but rarely dominates the sampled stream.

Q. What does including code in the mix do?

Lifts coding evaluations (expected) and lifts non-code reasoning (more surprising; well-documented effect on math/QA). Code is up-weighted past its raw bytes share for exactly this reason.

Q. A common 'pass-count' choice across slices?

Multi-epoch on small, high-quality slices (wikis, curated books); single-epoch or fewer on the large web slice. Density-aware exposure without inflating corpus size.

Q. What does pure-web pretraining give up?

Density (web is lower-info-density than wikis/books, so more tokens to reach the same loss), specific capability lifts (no code = worse coding and reasoning), and still needs filtering+dedup. “Huge and free” is a starting point, not a corpus.

Q. Why are the open datasets valued?

Their funnel decisions (filtering, dedup, mixing) are reproducible and public. Models trained on the same raw source through different funnels behave very differently, so a documented funnel matters as much as the source.

Q. How does this lesson stay technical?

Describes data-engineering mechanics (sources, scale, funnel, mixing weights) without taking a legal/policy position on training-data debates. Technical-not-legal discipline: teach the mechanics clearly; route policy framing elsewhere.