Skip to content

References: Data filtering, deduplication, mixing, synthetic

Source curriculum (structural mirror, cited as further study):
• Stanford CS336, "Language Modeling from Scratch", Lecture 14:
Data (filtering, deduplication, mixing, synthetic)
Instructors: Tatsunori Hashimoto and Percy Liang (Stanford)
Course page: https://cs336.stanford.edu/
Lecture videos: YouTube playlist
https://www.youtube.com/playlist?list=PLoROMvodv4rMqXOcazWaTUHhq-yembLCV
License: no explicit license is published on the course site; lecture
videos are on YouTube under standard terms; slides are public on GitHub
without a stated license.
Required attribution: "Based on the structure of Stanford CS336,
'Language Modeling from Scratch,' by Tatsunori Hashimoto and Percy Liang
(cs336.stanford.edu). This is an independent structural mirror in
original prose; it reproduces no course materials, and Stanford does
not endorse it."
This lesson mirrors the structure of Lecture 14 (data, part 2). Clawdemy's
lessons are original prose that follows the pedagogical arc of the course.
Because the source publishes no explicit license, we cite it as a recommended
companion and reproduce none of its materials. This lesson is taught at a
strictly technical (data-engineering) level; legal and policy questions
about training data are out of scope here.

A short, durable list. Each link is a specific next step, not a generic pile.

Where this connects inside the track.

  • Data, part 1 (lesson 11). The previous lesson opened the funnel at sources; this lesson runs through the rest of it.

  • Scaling laws (lesson 9). The “learn the mix” idea (small-scale fit, extrapolate) is the same discipline scaling laws established for choosing (N, D). Both are evidence-first replacements for hand-tuned intuition.

  • Post-training: SFT and RLHF (lesson 13). A separate, much smaller and much higher-quality body of data (instruction/dialogue + preference data) transforms the pretrained model into an assistant. Synthetic-data techniques from this lesson are commonly used to generate that post-training data.