Cheatsheet: Curating high-quality datasets
Why data quality dominates
Section titled “Why data quality dominates”- A model faithfully learns its data’s patterns, including the bad ones.
- In SFT the model imitates the responses shown: bad examples teach bad behavior.
- Quality > quantity: a smaller curated set often beats a larger noisy one.
- Representativeness > volume: more data helps only if more representative.
- Consistency: contradictory labels teach a muddle.
Cleaning (lesson 5) vs curation (this lesson)
Section titled “Cleaning (lesson 5) vs curation (this lesson)”Cleaning (map / filter) | Curation (judgment) |
|---|---|
| Drop nulls, normalize, dedupe | Structure raw data into labeled examples |
| Mechanical, in code | Human labeling (often experts) |
| Quality filtering of ambiguous/wrong examples | |
| Gathering human feedback |
Argilla: connect
Section titled “Argilla: connect”import argilla as rgclient = rg.Argilla(api_url="https://your-space.hf.space", api_key="...")client.me # confirm connectionDeploy the easiest way via a Hugging Face Space (enable persistent storage).
Argilla: the curation workflow
Section titled “Argilla: the curation workflow”| Step | What you do |
|---|---|
| 1. Define structure | Declare fields (content shown) + questions (label, rating, free text) |
| 2. Load records | Push raw examples in (often from a HF dataset via load_dataset) |
| 3. Annotate | You / experts / a crowd answer the questions in the web UI |
| 4. Export | Send the curated dataset back to the Hub, ready for Trainer/SFTTrainer |
Evaluate the dataset (before training)
Section titled “Evaluate the dataset (before training)”| Check | What it catches |
|---|---|
| Annotator agreement | Ambiguous task / unclear guidelines (model learns contradictions) |
| Coverage + diversity | Gaps vs the real input distribution |
| Balance | Overrepresented labels (skew in, skew out) |
Checking the dataset early is cheaper than finding its flaws in the model.
The instinct to rewire
Section titled “The instinct to rewire”Model underperforms? Before reaching for a bigger model, ask: is the data consistent, representative, correctly labeled, and broad enough? Better data usually beats brute force.
Words to use precisely
Section titled “Words to use precisely”- Curation: the judgment-heavy work of structuring, labeling, filtering, and gathering feedback (beyond mechanical cleaning).
- Fields / questions (Argilla): the content shown to annotators / what they answer about it.
- Annotator agreement: how often labelers agree; a dataset-quality (ambiguity) signal.
- Representativeness: how well the data covers the real input distribution.
Recommended further study
Section titled “Recommended further study”- Hugging Face LLM Course, Chapter 10: “Introduction to Argilla.”
huggingface.co/learn/llm-course/chapter10. Released under Apache 2.0; this lesson mirrors its structure with original prose.