Skip to content

Lesson: Curating high-quality datasets

The last lesson ended on a line worth taking seriously: a fine-tuned model is only as good as the data it was tuned on. This lesson is about that data. It is the least glamorous part of building with LLMs and, increasingly, the part that decides whether you succeed. You cannot fine-tune your way out of a bad dataset, and you cannot prompt your way out of it either. When teams compare notes on what actually moved their results, the answer is more often “we cleaned and curated the data” than “we used a bigger model.” Curation is the work, and this lesson is how it is done.

A notebook helps for the hands-on parts, plus a free Hugging Face account if you want to stand up the tool.

A model learns the patterns in its training data, faithfully, including the bad ones. At fine-tuning scale this is sharp: in supervised fine-tuning the model imitates the responses it is shown, so inconsistent, low-quality, or unrepresentative examples teach inconsistent, low-quality, or skewed behavior. Three consequences follow:

  • Quality often beats quantity. A smaller, carefully curated dataset frequently produces a better model than a larger noisy one. Every bad example is not neutral; it actively teaches the wrong thing.
  • Representativeness, not just volume, is what helps. As lesson 1 put it, more data helps only if it is more representative of what the model will actually see. Ten thousand examples of one narrow case do not teach the cases you left out.
  • Consistency is a feature of the data, not just the model. If two annotators label similar examples differently, the model receives contradictory signals and learns a muddle. A consistent dataset is a teachable one.

This is why curation is not a preliminary chore you rush through to get to the “real” work. For LLM results, it often is the real work.

Curation is more than the programmatic cleaning you did in lesson 5 (map, filter). That handles the mechanical problems: dropping nulls, normalizing text, removing duplicates. Curation adds the parts that need judgment:

  • Turning unstructured data into structured data a model can train on (raw text into labeled examples).
  • Labeling and annotating examples, which often requires a human (and sometimes a domain expert) to decide the correct answer.
  • Filtering for quality, removing examples that are wrong, ambiguous, or off-distribution.
  • Gathering human feedback on model outputs, the raw material for instruction tuning and preference data.

The through-line is that humans are in the loop. Code gets you a clean dataset; people get you a good one. That is why there is tooling built specifically for the human side of curation.

Argilla is an open-source annotation and feedback platform from Hugging Face, built for exactly this human-in-the-loop curation. You deploy it (the easiest path is a Hugging Face Space), then drive it from Python with the argilla SDK. Connecting is a few lines:

import argilla as rg
client = rg.Argilla(api_url="https://your-space.hf.space", api_key="...")
client.me # confirms you are connected

The workflow from there has a clear shape:

  1. Define the dataset’s structure. You declare the fields (the content shown to the annotator, e.g. the text of a review) and the questions (what you want answered about it: a label, a rating, a free-text correction). This is the schema of the annotation task.
  2. Load records in. Push your raw examples, often loaded straight from a Hugging Face dataset with the load-dataset loader from lesson 5, into Argilla.
  3. Annotate in the UI. You, or invited domain experts, or a crowd, work through the records in Argilla’s web interface, answering the questions. This is where human judgment enters the data.
  4. Export back to the Hub. The curated, annotated dataset goes back out to the Hugging Face Hub, ready to feed the Trainer or SFT trainer from earlier lessons.

That loop, define, load, annotate, export, turns a pile of raw text into a structured, labeled dataset you can actually train on, with the human judgment recorded as data.

Evaluating a dataset, not just building it

Section titled “Evaluating a dataset, not just building it”

Building a dataset is half the job; knowing whether it is any good is the other half. A few checks earn their keep:

  • Annotator agreement. When multiple people label the same examples, do they agree? Low agreement means the task is ambiguous or the guidelines are unclear, and the model will inherit that confusion. Argilla lets multiple annotators work the same records so you can measure this.
  • Coverage and diversity. Does the dataset span the range of inputs the model will meet, or is it bunched on a few easy cases? Gaps in the data become gaps in the model.
  • Balance. Are some labels or categories vastly overrepresented? Skew in, skew out, and the model will be most confident exactly where your data was densest, which may not be where it matters.

Evaluating the dataset before training is far cheaper than discovering its flaws in the model’s behavior afterward, the same “check it early” logic from the debugging lesson, applied upstream to the data.

There is a persistent instinct, when a model underperforms, to reach for a bigger model or more training. Often the higher-leverage move is to look at the data: is it consistent, representative, correctly labeled, and broad enough? The history of practical LLM work keeps returning to this lesson, that careful data curation outperforms brute force, and the teams that internalize it move faster because they spend their effort where it actually pays. This also closes a loop across the whole track: lesson 5 cleaned data mechanically, lesson 10 fine-tuned on it, and this lesson is about making that data genuinely good, with humans in the loop, before it ever reaches the model. The unglamorous truth of applied AI is that the dataset is the product as much as the model is, and curation, not just modeling, is the craft.

  • Data quality, not model size, is increasingly the lever. A model faithfully learns its data’s patterns, good and bad; you cannot fine-tune or prompt your way out of a bad dataset.
  • Quality often beats quantity, representativeness matters more than raw volume, and consistency in the data is what makes it teachable.
  • Curation goes beyond mechanical cleaning (map/filter): it adds structuring, human labeling, quality filtering, and gathering feedback, all of which need judgment.
  • Argilla is the human-in-the-loop tool: deploy it (a Hugging Face Space), connect with the argilla SDK, define fields and questions, load records, annotate in the UI, and export the curated dataset to the Hub.
  • Evaluate the dataset, not just build it: check annotator agreement, coverage and diversity, and balance before training. Catching data flaws early is far cheaper than finding them in the model.
  • The dataset is the product as much as the model. Curation is the highest-leverage and most overlooked work in applied LLM development.

When a model disappoints, the instinct is a bigger model; the better move is usually a better dataset. Curating data well, with people in the loop and the quality actually checked, is the craft this lesson is about, and it is where most of the real gains hide.