Skip to content

Cheatsheet: Data filtering, deduplication, mixing, synthetic

LayerWhat it doesCatches
HeuristicCheap rules: length, letter ratio, repetition, stop-words, language ID, URL/domainObvious junk; placeholder pages; scraping artifacts
ClassifierA small model scoring “high-quality text?”, trained on curated +/- setsSubtler quality issues heuristics miss

Combined: ~5-10x shrink.

Deduplication (three levels, used together)

Section titled “Deduplication (three levels, used together)”
LevelMethodCatches
ExactDocument hashesIdentical mirrors
Near-duplicateMinHash + LSH (modern standard)Re-published articles with edits; templates; shuffled content
Substring / n-gramLong-span match removalPartial duplication across many docs

Combined: ~2-10x shrink on top of filtering. Less unique-clean beats more duplicated.

Source weight != source byte-fraction
Code: up-weighted (lifts code + reasoning)
Wiki/books: up-weighted (high density)
Web: bulk, but rarely dominates the sampled stream

Pass count asymmetry: once over the large web slice (or less), multi-epoch on small high-quality slices.

Learn the mix (DoReMi-class): train tiny proxy models on candidate mixes, fit loss vs weights, propose optimal mix. Same idea as scaling laws.

UseWhat it is
Teacher-student distillationStrong teacher generates target outputs; student trains on the pairs
Textbook-style (Phi-class)Deliberately clean, structured pretraining text on chosen subjects
Instruction / dialogue pairsGenerated (prompt, response) for SFT (next lesson)
Self-improvement loopsFilter model’s best outputs, re-train (reasoning RL)

Caveat (applies to all): synthetic data carries the teacher’s blind spots and characteristic phrasing into the student. Filter + dedup synthetic; mix with non-synthetic.

If a model lags a baseline at similar token count, check:

  1. Dedup level (exact + near-duplicate + substring?): single biggest silent failure.
  2. Mixing weights (code/math/high-density over-weighted vs raw bytes?).
  3. Filtering depth (classifier-quality layer on top of heuristics?).

Fix before retraining at larger scale; bigger scale on the same flawed corpus compounds the problem.

Sources (CC, Wikipedia, GitHub, arXiv) are largely shared. Moat = pipeline: filter recipes + dedup methods + learned mixing + synthetic-data strategy. The competitive surface in pretraining data is engineering, not collection.

Filtering, dedup, mixing, and synthetic data are data-engineering decisions here. Legal and policy debates about training data are out of scope.

  • Heuristic / classifier filter: rule-based vs learned quality scoring.
  • MinHash + LSH: the standard near-duplicate dedup method.
  • DoReMi-class: small-scale fit of source mixing weights.
  • Distillation: teacher generates outputs the student trains on.
  • Textbook-style synthetic: deliberately clean generated pretraining text.
  • Stanford CS336, Lecture 14 (Data, filtering and deduplication, mixing, synthetic), by Hashimoto and Liang. cs336.stanford.edu. Independent structural mirror in original prose; see references.