References: Curating high-quality datasets

Source material

Source curriculum (structural mirror, cited as further study):
• Hugging Face, "LLM Course", Chapter 10: "Introduction to Argilla"
  Authors: the Hugging Face team (Lewis Tunstall, Leandro von Werra,
    Lysandre Debut, Sylvain Gugger, Merve Noyan, and others)
  Course page: https://huggingface.co/learn/llm-course/chapter10
  Code and notebooks: https://github.com/huggingface/course
  License: Apache 2.0 (prose and code)
  Required attribution: "Based on the Hugging Face LLM Course
    (huggingface.co/learn/llm-course), © Hugging Face, used under the
    Apache 2.0 license. This is an independent structural mirror;
    Hugging Face does not endorse it."
This lesson mirrors the structure of the course's Argilla chapter (why data
quality matters, setting up Argilla, defining and annotating a dataset, and
exporting it to the Hub). Clawdemy's lessons are original prose that follows
the pedagogical arc of the course. We do not reproduce or transcribe the
course; we cite it as the recommended companion. Course materials are used
under the Apache 2.0 license with the attribution above, which requires a
link to the license and an indication of changes, and does not permit
implying endorsement.

Going deeper

A short, durable list. Each link is a specific next step, not a generic pile.

The Argilla documentation. The reference for the argilla SDK and UI: settings, questions, fields, and the annotation and feedback workflows. The canonical guide once you stand up your own instance.
The Hugging Face datasets Hub. Browse well-documented datasets to see what good structure and documentation look like, including how datasets describe their splits, fields, and intended use.
Data-centric AI. The broader movement behind this lesson’s thesis, that improving data, not just models, is the higher-leverage path. Useful framing for why curation deserves the effort.

Adjacent topics

Where this connects inside the track.

Wrangling data with the Datasets library (lesson 5). That lesson did the mechanical cleaning (map, filter); this one adds the human-judgment curation that turns clean data into good data.
Fine-tuning LLMs (lesson 10). SFT is only as good as its instruction data; this lesson is how you build and curate that data well. The curated dataset exported from Argilla feeds straight into the SFTTrainer.
Reasoning models and the road ahead (lesson 12). The final lesson closes the track by zooming out to the reasoning-model frontier and where the ecosystem is heading.

References: Curating high-quality datasets

Source material

Read this next

Going deeper

Adjacent topics