Curating datasets: cheatsheet

Why data quality dominates

A model faithfully learns its data’s patterns, including the bad ones.
In SFT the model imitates the responses shown: bad examples teach bad behavior.
Quality > quantity: a smaller curated set often beats a larger noisy one.
Representativeness > volume: more data helps only if more representative.
Consistency: contradictory labels teach a muddle.

Cleaning (lesson 5) vs curation (this lesson)

Cleaning (`map` / `filter`)	Curation (judgment)
Drop nulls, normalize, dedupe	Structure raw data into labeled examples
Mechanical, in code	Human labeling (often experts)
	Quality filtering of ambiguous/wrong examples
	Gathering human feedback

Argilla: connect

import argilla as rg
client = rg.Argilla(api_url="https://your-space.hf.space", api_key="...")
client.me   # confirm connection

Deploy the easiest way via a Hugging Face Space (enable persistent storage).

Argilla: the curation workflow

Step	What you do
1. Define structure	Declare fields (content shown) + questions (label, rating, free text)
2. Load records	Push raw examples in (often from a HF dataset via `load_dataset`)
3. Annotate	You / experts / a crowd answer the questions in the web UI
4. Export	Send the curated dataset back to the Hub, ready for `Trainer`/`SFTTrainer`

Evaluate the dataset (before training)

Check	What it catches
Annotator agreement	Ambiguous task / unclear guidelines (model learns contradictions)
Coverage + diversity	Gaps vs the real input distribution
Balance	Overrepresented labels (skew in, skew out)

Checking the dataset early is cheaper than finding its flaws in the model.

The instinct to rewire

Model underperforms? Before reaching for a bigger model, ask: is the data consistent, representative, correctly labeled, and broad enough? Better data usually beats brute force.

Words to use precisely

Curation: the judgment-heavy work of structuring, labeling, filtering, and gathering feedback (beyond mechanical cleaning).
Fields / questions (Argilla): the content shown to annotators / what they answer about it.
Annotator agreement: how often labelers agree; a dataset-quality (ambiguity) signal.
Representativeness: how well the data covers the real input distribution.

Recommended further study

Hugging Face LLM Course, Chapter 10: “Introduction to Argilla.” huggingface.co/learn/llm-course/chapter10. Released under Apache 2.0; this lesson mirrors its structure with original prose.