Data filtering and deduplication: brief

What you’ll learn

This lesson opens the later stages of the data funnel from lesson 11, where most of the data-engineering work actually lives. The source curriculum is Stanford CS336, Lecture 14, by Tatsunori Hashimoto and Percy Liang, with lectures freely available on YouTube and the course at cs336.stanford.edu.

You will describe the two layers of filtering (heuristic rules and a classifier-quality model); distinguish the three levels of deduplication (exact, near-duplicate via MinHash + LSH, substring/n-gram); explain how mixing weights are tuned, including DoReMi-class methods that learn the mix at small scale; categorize the main uses of synthetic data (distillation, textbook-style, instruction/dialogue, self-improvement) and the shared caveat that synthetic data carries the teacher’s blind spots; and diagnose an underperforming pretraining corpus using these tools.

§6 framing note: filtering, dedup, mixing, and synthetic data are taught here as data-engineering decisions. Legal and policy debates about training data are out of scope, the same technical-not-legal discipline as elsewhere in the fleet.

Where this fits

This is lesson 12 of 14, the fourth lesson of Phase 3 (scale, data, and alignment). It is the direct continuation of lesson 11 (which left off at sources); together they describe the full data side of pretraining. The next lesson turns from pretraining data to post-training data, where the corpus is much smaller and much more curated.

Before you start

Prerequisites: lesson 11 (sources, funnel introduction, mixing intuition). The diagnostics in the practice section reuse the scaling-laws and evaluation discipline from lessons 9 and 10.

About the math

None. The lesson uses approximate shrink factors (5-10x filtering; 2-10x dedup) and the conceptual DoReMi-class fit; no formulas to derive.

By the end, you’ll be able to

The single capability this lesson builds: explain the data-processing steps (filtering, deduplication, mixing, synthetic data) that turn raw data into a training corpus. Concretely, you will be able to:

Describe the two layers of filtering (heuristic, classifier)
Distinguish the three levels of deduplication and why each matters
Explain how mixing weights are tuned, including DoReMi-class methods
Categorize the main uses of synthetic data and their shared caveat
Diagnose an underperforming pretraining corpus from this lesson’s tools

Time and difficulty

Read time: about 13 minutes
Practice time: about 10 minutes (diagnose a pipeline + reason about synthetic-data caution, plus flashcards)
Difficulty: deep (Stage C; data-engineering, no math, technical-not-legal scope)