References: Data filtering, deduplication, mixing, synthetic
Source material
Section titled “Source material”Source curriculum (structural mirror, cited as further study):• Stanford CS336, "Language Modeling from Scratch", Lecture 14: Data (filtering, deduplication, mixing, synthetic) Instructors: Tatsunori Hashimoto and Percy Liang (Stanford) Course page: https://cs336.stanford.edu/ Lecture videos: YouTube playlist https://www.youtube.com/playlist?list=PLoROMvodv4rMqXOcazWaTUHhq-yembLCV License: no explicit license is published on the course site; lecture videos are on YouTube under standard terms; slides are public on GitHub without a stated license. Required attribution: "Based on the structure of Stanford CS336, 'Language Modeling from Scratch,' by Tatsunori Hashimoto and Percy Liang (cs336.stanford.edu). This is an independent structural mirror in original prose; it reproduces no course materials, and Stanford does not endorse it."This lesson mirrors the structure of Lecture 14 (data, part 2). Clawdemy'slessons are original prose that follows the pedagogical arc of the course.Because the source publishes no explicit license, we cite it as a recommendedcompanion and reproduce none of its materials. This lesson is taught at astrictly technical (data-engineering) level; legal and policy questionsabout training data are out of scope here.Watch this next
Section titled “Watch this next”- Stanford CS336, Lecture 14: Data, filtering and deduplication by Hashimoto and Liang. The lecture this lesson mirrors. It walks the filtering recipes and the MinHash + LSH dedup mechanics with worked examples.
Going deeper
Section titled “Going deeper”A short, durable list. Each link is a specific next step, not a generic pile.
-
“Deduplicating Training Data Makes Language Models Better” by Lee et al. (2021). The empirical paper that established just how much deduplication matters, including the per-section discussion of exact vs near-duplicate detection and their effects.
-
“DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining” by Xie et al. (2023). The first widely-cited “learn the mix” method, showing that small proxy runs can pick mixing weights that beat hand-tuned ratios on downstream loss.
-
“Textbooks Are All You Need” (Phi) by Gunasekar et al. (2023). The textbook-style synthetic data paper that started the Phi line; useful as the clearest worked example of small models trained on highly-curated synthetic data rivaling much larger models on specific evals.
Adjacent topics
Section titled “Adjacent topics”Where this connects inside the track.
-
Data, part 1 (lesson 11). The previous lesson opened the funnel at sources; this lesson runs through the rest of it.
-
Scaling laws (lesson 9). The “learn the mix” idea (small-scale fit, extrapolate) is the same discipline scaling laws established for choosing
(N, D). Both are evidence-first replacements for hand-tuned intuition. -
Post-training: SFT and RLHF (lesson 13). A separate, much smaller and much higher-quality body of data (instruction/dialogue + preference data) transforms the pretrained model into an assistant. Synthetic-data techniques from this lesson are commonly used to generate that post-training data.