Skip to content

Wrangling data with the Datasets library

This lesson opens Phase 2 by turning to the data side, because every lesson so far quietly assumed the data was clean. You will use the datasets library to load, clean, and transform real, messy data at scale. The source curriculum is the Hugging Face LLM Course, Chapter 5, freely available and Apache-2.0 licensed at huggingface.co/learn/llm-course/chapter5.

You will load datasets from the Hub and from your own files into a DatasetDict; transform data with Dataset.map (overwriting a column or adding a new one); remove rows with Dataset.filter; use batched=True to process many examples at once for a large speedup; understand why the Arrow backend lets you work with data larger than RAM; and carve a validation set with train_test_split before saving the result.

This is lesson 5 of 12, the first lesson of Phase 2 (data, tokenizers, and tasks). Phase 1 ran, adapted, and shared models; Phase 2 turns to the data and tokenizers that all of it depends on. It builds on lesson 3 (you first met map there, to tokenize) and sets up lesson 6, where the fast tokenizers that make batched=True so fast are opened up.

Prerequisites: lesson 3 of this track (fine-tuning), where you first used load_dataset and Dataset.map and learned the validation discipline this lesson extends. You should be comfortable with basic Python, including lambda functions and list comprehensions, which the cleaning examples use. Install with pip install datasets transformers.

None. This is a data-engineering lesson: loading, cleaning, transforming, and splitting. Everything is short Python (functions passed to map and filter), and the one performance idea (batched=True, and why Arrow scales past RAM) is explained rather than derived.

The single capability this lesson builds: load, filter, and transform a dataset efficiently with the datasets library. Concretely, you will be able to:

  • Load datasets from the Hub and from local files into a DatasetDict
  • Transform data with Dataset.map (update or add columns)
  • Remove rows with Dataset.filter
  • Use batched=True to process examples in batches for a large speedup
  • Carve a validation set with train_test_split and save/reload a dataset
  • Read time: about 12 minutes
  • Practice time: about 12 minutes (load a dataset, add a column, filter, and split, plus flashcards)
  • Difficulty: standard (lots of small code, but each operation is one line)