Practice: Curating high-quality datasets

Self-check

Seven short questions. Answer each before opening the collapsible.

1. Why does data quality dominate LLM results, especially in fine-tuning?

Show answer

A model faithfully learns the patterns in its training data, including the bad ones. In supervised fine-tuning the model imitates the responses it is shown, so inconsistent, low-quality, or unrepresentative examples teach inconsistent, low-quality, or skewed behavior. You cannot fine-tune or prompt your way out of a bad dataset.

2. Why does a smaller curated dataset often beat a larger noisy one?

Show answer

Because every bad example is not neutral, it actively teaches the wrong thing. Noise, wrong labels, and contradictions degrade what the model learns. A smaller, carefully curated set gives consistent, correct signal, which often produces a better model than a bigger set full of problems.

3. What does “representativeness matters more than volume” mean for a dataset?

Show answer

More data helps only if it is more representative of what the model will actually encounter. Ten thousand examples of one narrow case do not teach the cases you left out. Coverage of the real input distribution matters more than raw count.

4. How is curation more than the map/filter cleaning from lesson 5?

Show answer

map/filter handle mechanical problems (nulls, normalization, duplicates) in code. Curation adds the judgment parts: turning unstructured data into structured labeled examples, human labeling and annotation (sometimes by domain experts), quality filtering of ambiguous or wrong examples, and gathering human feedback. Humans are in the loop.

5. What is Argilla, and what is the four-step workflow?

Show answer

Argilla is an open-source annotation and feedback platform from Hugging Face for human-in-the-loop curation. The workflow: define the dataset structure (fields shown to annotators, questions to answer), load records in (often from a Hugging Face dataset), annotate in the web UI (you, experts, or a crowd), and export the curated dataset back to the Hub. You deploy it (a Space) and drive it with the argilla SDK.

6. Name two ways to evaluate a dataset before training on it.

Show answer

Any two of: annotator agreement (do multiple labelers agree? low agreement signals an ambiguous task or unclear guidelines), coverage and diversity (does it span the inputs the model will meet?), and balance (are some labels or categories vastly overrepresented?). Checking these before training is cheaper than discovering the flaws in the model’s behavior afterward.

7. When a fine-tuned model underperforms, why is “look at the data” often a better move than “use a bigger model”?

Show answer

Because the data is frequently the actual constraint: inconsistent labels, gaps in coverage, skew, or wrong examples cap what any model can learn. A bigger model trained on the same flawed data often just learns the flaws more confidently. Improving the dataset is usually the higher-leverage fix, and far cheaper than scaling up.

Try it yourself: diagnose the dataset

About 10 minutes, no setup required. Curation judgment is the skill; practice spotting data problems.

Part A: what is wrong with each dataset? For each, name the data-quality problem and what you would do about it.

a. An instruction dataset where two annotators labeled the same 200 examples and agreed on only 55% of them.
b. A sentiment dataset that is 92% positive reviews, 8% negative.
c. A support-bot training set built entirely from billing questions, to be deployed for all support topics.
d. A QA dataset scraped automatically, never reviewed, with many answers that don't actually answer the question.

What you’ll get

a. Low annotator agreement. The task is ambiguous or the guidelines are unclear. Fix the guidelines and re-annotate; the model would otherwise learn contradictory signals.
b. Imbalance. The model will be confident on positives and weak on negatives. Rebalance (gather more negatives) or account for it; skew in, skew out.
c. Coverage gap. It is unrepresentative of the deployment distribution (all topics, not just billing). Broaden the data to cover the real range of questions.
d. Quality/labeling failure. Unreviewed wrong answers teach wrong behavior. This needs human curation (Argilla) to filter and correct before it is usable.

The pattern: the fix is almost always better data, not a bigger model.

Part B (reasoning). Your teammate wants to improve a fine-tuned model by doubling the training set, scraping twice as many examples with no review. Why might this make the model worse, and what is the alternative?

What you should notice

Doubling with unreviewed data doubles the noise, wrong labels, and skew along with the good examples, and the model learns all of it faithfully. More data helps only if it is more representative and at least as clean; otherwise it can degrade results. The alternative is curation: review and label for quality, fill coverage gaps deliberately, and check consistency, often with a smaller set that is actually good.

Part C (reasoning). Why is measuring annotator agreement a check on the dataset and not just on the annotators?

What you should notice

Low agreement usually means the task itself is ambiguous or the labeling guidelines are unclear, not that the annotators are careless. That ambiguity is a property of the dataset’s design, and it will pass straight into the model as contradictory training signal. Fixing it (clearer questions, better guidelines, sharper label definitions) improves the dataset, which is why agreement is a dataset-quality metric.

Flashcards

Nine cards. Click any card to reveal the answer. Use the Print flashcards button to lay the set out one card per page for offline review.

Q. Why does data quality dominate LLM results?

A model faithfully learns its training data’s patterns, including the bad ones. In SFT it imitates the responses shown, so low-quality or inconsistent examples teach low-quality or inconsistent behavior. You can’t fine-tune or prompt past a bad dataset.

Q. Why does a smaller curated dataset often beat a larger noisy one?

Every bad example actively teaches the wrong thing. Noise, wrong labels, and contradictions degrade learning. Consistent, correct signal from a curated set often produces a better model than a bigger problematic one.

Q. Representativeness vs volume?

More data helps only if it is more representative of what the model will actually see. Coverage of the real input distribution matters more than raw count; many examples of one narrow case don’t teach the rest.

Q. How is curation more than map/filter cleaning?

map/filter handle mechanical issues (nulls, normalization, duplicates) in code. Curation adds judgment: structuring, human labeling (sometimes by experts), quality filtering, and gathering feedback. Humans are in the loop.

Q. What is Argilla?

An open-source human-in-the-loop annotation and feedback platform from Hugging Face. You deploy it (a Space) and drive it with the argilla SDK to label, curate, and collect feedback on data.

Q. What is the Argilla curation workflow?

Define the dataset structure (fields shown + questions to answer), load records in (often from a HF dataset), annotate in the web UI (you/experts/crowd), and export the curated dataset back to the Hub.

Q. How do you evaluate a dataset before training?

Check annotator agreement (ambiguity signal), coverage and diversity (does it span real inputs?), and balance (overrepresented labels?). Catching data flaws early beats finding them in the model.

Q. What does low annotator agreement tell you?

The task is ambiguous or the guidelines are unclear, a dataset-design problem, not just careless annotators. It passes into the model as contradictory signal. Fix with clearer questions and guidelines.

Q. Model underperforms: bigger model or better data?

Usually better data. Inconsistent labels, coverage gaps, skew, and wrong examples cap what any model can learn; a bigger model on the same data often just learns the flaws more confidently. Improving data is higher-leverage and cheaper.