Skip to content

Summary: Data filtering, deduplication, mixing, synthetic

Lesson 11 ended at raw sources; this lesson opens the funnel. Filtering has two layers: heuristic (cheap rules on length, ratios, repetition, language) and classifier (a small model scoring “high-quality text?” with a curated positive/negative set). Together they shrink ~5-10x. Deduplication runs at three levels: exact (hash), near-duplicate (MinHash + LSH, the standard), and substring/n-gram. Together another ~2-10x; less unique-clean beats more duplicated. Mixing turns sources into a sampling stream; sampling weights are not byte fractions; small high-quality slices are passed multiple times. Modern recipes increasingly learn the mix from small-scale fits (DoReMi-class). Synthetic data is fast-growing: teacher-student distillation, textbook-style synthetic, instruction/dialogue pairs, self-improvement loops, with the caveat that it carries the teacher’s blind spots. Less data, cleaner and well-mixed, often beats more data poorly handled. This is the scan version; technical-not-legal throughout.

  • Filtering (two layers): heuristic (cheap rules) + classifier (a small “quality” model). ~5-10x shrink.
  • Deduplication (three levels): exact (hash), near-duplicate (MinHash + LSH; the standard), substring/n-gram. ~2-10x more shrink; same final tokens, cleaner content, often a better model.
  • Mixing. Sampling weights, not byte fractions. Small high-quality slices passed multiple times, large web slice once or less. Learn the mix (DoReMi-class) at small scale; extrapolate, like scaling laws.
  • Synthetic data. Teacher-student distillation, textbook-style (Phi-class), instruction/dialogue pairs (for SFT), self-improvement loops (for RL). Carries the teacher’s blind spots; same filtering/dedup ideas apply to synthetic.
  • Data engineering is the modern moat. Sources are largely shared; the pipeline (filter recipes, dedup, learned mixing, synthetic strategy) is the competitive surface.
  • Technical-not-legal. Filtering, dedup, mixing, synthetic are data-engineering decisions here; legal/policy debates about training data are out of scope.

This lesson connects the prior data-sourcing picture to the actual difference between a strong model and an average one at equal compute, the pipeline. The same compute spent on a better corpus produces a noticeably better model, and the same D of cleaner unique content beats more D of duplicates. The trend toward learning the mix with small proxy runs (DoReMi-class) is the same evidence-first discipline scaling laws and the evaluation lesson encouraged: do not hand-tune, fit small, extrapolate. Synthetic data is the newer lever and worth tracking, with the standard caveats. The next two lessons turn from pretraining data to post-training, where a separate kind of data, much smaller, much higher quality, much more curated, transforms a pretrained model into the assistant users actually talk to.

Pretraining data is not “Common Crawl”; it is what comes out of a deliberate funnel of heuristic filtering, classifier filtering, multi-level deduplication, mixing, and increasingly synthetic generation. The funnel is most of the data work, and increasingly the difference at equal compute.