Data filtering and dedup: cheatsheet

Filtering (two layers)

Layer	What it does	Catches
Heuristic	Cheap rules: length, letter ratio, repetition, stop-words, language ID, URL/domain	Obvious junk; placeholder pages; scraping artifacts
Classifier	A small model scoring “high-quality text?”, trained on curated +/- sets	Subtler quality issues heuristics miss

Combined: ~5-10x shrink.

Deduplication (three levels, used together)

Level	Method	Catches
Exact	Document hashes	Identical mirrors
Near-duplicate	MinHash + LSH (modern standard)	Re-published articles with edits; templates; shuffled content
Substring / n-gram	Long-span match removal	Partial duplication across many docs

Combined: ~2-10x shrink on top of filtering. Less unique-clean beats more duplicated.

Mixing

Source weight != source byte-fraction
Code:      up-weighted (lifts code + reasoning)
Wiki/books: up-weighted (high density)
Web:        bulk, but rarely dominates the sampled stream

Pass count asymmetry: once over the large web slice (or less), multi-epoch on small high-quality slices.

Learn the mix (DoReMi-class): train tiny proxy models on candidate mixes, fit loss vs weights, propose optimal mix. Same idea as scaling laws.

Synthetic data (the newer category)

Use	What it is
Teacher-student distillation	Strong teacher generates target outputs; student trains on the pairs
Textbook-style (Phi-class)	Deliberately clean, structured pretraining text on chosen subjects
Instruction / dialogue pairs	Generated (prompt, response) for SFT (next lesson)
Self-improvement loops	Filter model’s best outputs, re-train (reasoning RL)

Caveat (applies to all): synthetic data carries the teacher’s blind spots and characteristic phrasing into the student. Filter + dedup synthetic; mix with non-synthetic.

Diagnose: underperforming corpus

If a model lags a baseline at similar token count, check:

Dedup level (exact + near-duplicate + substring?): single biggest silent failure.
Mixing weights (code/math/high-density over-weighted vs raw bytes?).
Filtering depth (classifier-quality layer on top of heuristics?).

Fix before retraining at larger scale; bigger scale on the same flawed corpus compounds the problem.

The “data is the moat” framing

Sources (CC, Wikipedia, GitHub, arXiv) are largely shared. Moat = pipeline: filter recipes + dedup methods + learned mixing + synthetic-data strategy. The competitive surface in pretraining data is engineering, not collection.

Technical-not-legal scope

Filtering, dedup, mixing, and synthetic data are data-engineering decisions here. Legal and policy debates about training data are out of scope.

Words to use precisely

Heuristic / classifier filter: rule-based vs learned quality scoring.
MinHash + LSH: the standard near-duplicate dedup method.
DoReMi-class: small-scale fit of source mixing weights.
Distillation: teacher generates outputs the student trains on.
Textbook-style synthetic: deliberately clean generated pretraining text.

Source

Stanford CS336, Lecture 14 (Data, filtering and deduplication, mixing, synthetic), by Hashimoto and Liang. cs336.stanford.edu. Independent structural mirror in original prose; see references.