Cheatsheet: Data filtering, deduplication, mixing, synthetic
Filtering (two layers)
Section titled “Filtering (two layers)”| Layer | What it does | Catches |
|---|---|---|
| Heuristic | Cheap rules: length, letter ratio, repetition, stop-words, language ID, URL/domain | Obvious junk; placeholder pages; scraping artifacts |
| Classifier | A small model scoring “high-quality text?”, trained on curated +/- sets | Subtler quality issues heuristics miss |
Combined: ~5-10x shrink.
Deduplication (three levels, used together)
Section titled “Deduplication (three levels, used together)”| Level | Method | Catches |
|---|---|---|
| Exact | Document hashes | Identical mirrors |
| Near-duplicate | MinHash + LSH (modern standard) | Re-published articles with edits; templates; shuffled content |
| Substring / n-gram | Long-span match removal | Partial duplication across many docs |
Combined: ~2-10x shrink on top of filtering. Less unique-clean beats more duplicated.
Mixing
Section titled “Mixing”Source weight != source byte-fractionCode: up-weighted (lifts code + reasoning)Wiki/books: up-weighted (high density)Web: bulk, but rarely dominates the sampled streamPass count asymmetry: once over the large web slice (or less), multi-epoch on small high-quality slices.
Learn the mix (DoReMi-class): train tiny proxy models on candidate mixes, fit loss vs weights, propose optimal mix. Same idea as scaling laws.
Synthetic data (the newer category)
Section titled “Synthetic data (the newer category)”| Use | What it is |
|---|---|
| Teacher-student distillation | Strong teacher generates target outputs; student trains on the pairs |
| Textbook-style (Phi-class) | Deliberately clean, structured pretraining text on chosen subjects |
| Instruction / dialogue pairs | Generated (prompt, response) for SFT (next lesson) |
| Self-improvement loops | Filter model’s best outputs, re-train (reasoning RL) |
Caveat (applies to all): synthetic data carries the teacher’s blind spots and characteristic phrasing into the student. Filter + dedup synthetic; mix with non-synthetic.
Diagnose: underperforming corpus
Section titled “Diagnose: underperforming corpus”If a model lags a baseline at similar token count, check:
- Dedup level (exact + near-duplicate + substring?): single biggest silent failure.
- Mixing weights (code/math/high-density over-weighted vs raw bytes?).
- Filtering depth (classifier-quality layer on top of heuristics?).
Fix before retraining at larger scale; bigger scale on the same flawed corpus compounds the problem.
The “data is the moat” framing
Section titled “The “data is the moat” framing”Sources (CC, Wikipedia, GitHub, arXiv) are largely shared. Moat = pipeline: filter recipes + dedup methods + learned mixing + synthetic-data strategy. The competitive surface in pretraining data is engineering, not collection.
Technical-not-legal scope
Section titled “Technical-not-legal scope”Filtering, dedup, mixing, and synthetic data are data-engineering decisions here. Legal and policy debates about training data are out of scope.
Words to use precisely
Section titled “Words to use precisely”- Heuristic / classifier filter: rule-based vs learned quality scoring.
- MinHash + LSH: the standard near-duplicate dedup method.
- DoReMi-class: small-scale fit of source mixing weights.
- Distillation: teacher generates outputs the student trains on.
- Textbook-style synthetic: deliberately clean generated pretraining text.
Source
Section titled “Source”- Stanford CS336, Lecture 14 (Data, filtering and deduplication, mixing, synthetic), by Hashimoto and Liang.
cs336.stanford.edu. Independent structural mirror in original prose; see references.