Skip to content

Cheatsheet: Curating high-quality datasets

  • A model faithfully learns its data’s patterns, including the bad ones.
  • In SFT the model imitates the responses shown: bad examples teach bad behavior.
  • Quality > quantity: a smaller curated set often beats a larger noisy one.
  • Representativeness > volume: more data helps only if more representative.
  • Consistency: contradictory labels teach a muddle.

Cleaning (lesson 5) vs curation (this lesson)

Section titled “Cleaning (lesson 5) vs curation (this lesson)”
Cleaning (map / filter)Curation (judgment)
Drop nulls, normalize, dedupeStructure raw data into labeled examples
Mechanical, in codeHuman labeling (often experts)
Quality filtering of ambiguous/wrong examples
Gathering human feedback
import argilla as rg
client = rg.Argilla(api_url="https://your-space.hf.space", api_key="...")
client.me # confirm connection

Deploy the easiest way via a Hugging Face Space (enable persistent storage).

StepWhat you do
1. Define structureDeclare fields (content shown) + questions (label, rating, free text)
2. Load recordsPush raw examples in (often from a HF dataset via load_dataset)
3. AnnotateYou / experts / a crowd answer the questions in the web UI
4. ExportSend the curated dataset back to the Hub, ready for Trainer/SFTTrainer
CheckWhat it catches
Annotator agreementAmbiguous task / unclear guidelines (model learns contradictions)
Coverage + diversityGaps vs the real input distribution
BalanceOverrepresented labels (skew in, skew out)

Checking the dataset early is cheaper than finding its flaws in the model.

Model underperforms? Before reaching for a bigger model, ask: is the data consistent, representative, correctly labeled, and broad enough? Better data usually beats brute force.

  • Curation: the judgment-heavy work of structuring, labeling, filtering, and gathering feedback (beyond mechanical cleaning).
  • Fields / questions (Argilla): the content shown to annotators / what they answer about it.
  • Annotator agreement: how often labelers agree; a dataset-quality (ambiguity) signal.
  • Representativeness: how well the data covers the real input distribution.
  • Hugging Face LLM Course, Chapter 10: “Introduction to Argilla.” huggingface.co/learn/llm-course/chapter10. Released under Apache 2.0; this lesson mirrors its structure with original prose.