Wrangling data with Datasets: brief

What you’ll learn

This lesson opens Phase 2 by turning to the data side, because every lesson so far quietly assumed the data was clean. You will use the datasets library to load, clean, and transform real, messy data at scale. The source curriculum is the Hugging Face LLM Course, Chapter 5, freely available and Apache-2.0 licensed at huggingface.co/learn/llm-course/chapter5.

You will load datasets from the Hub and from your own files into a DatasetDict; transform data with Dataset.map (overwriting a column or adding a new one); remove rows with Dataset.filter; use batched=True to process many examples at once for a large speedup; understand why the Arrow backend lets you work with data larger than RAM; and carve a validation set with train_test_split before saving the result.

Where this fits

This is lesson 5 of 12, the first lesson of Phase 2 (data, tokenizers, and tasks). Phase 1 ran, adapted, and shared models; Phase 2 turns to the data and tokenizers that all of it depends on. It builds on lesson 3 (you first met map there, to tokenize) and sets up lesson 6, where the fast tokenizers that make batched=True so fast are opened up.

Before you start

Prerequisites: lesson 3 of this track (fine-tuning), where you first used load_dataset and Dataset.map and learned the validation discipline this lesson extends. You should be comfortable with basic Python, including lambda functions and list comprehensions, which the cleaning examples use. Install with pip install datasets transformers.

About the math

None. This is a data-engineering lesson: loading, cleaning, transforming, and splitting. Everything is short Python (functions passed to map and filter), and the one performance idea (batched=True, and why Arrow scales past RAM) is explained rather than derived.

By the end, you’ll be able to

The single capability this lesson builds: load, filter, and transform a dataset efficiently with the datasets library. Concretely, you will be able to:

Load datasets from the Hub and from local files into a DatasetDict
Transform data with Dataset.map (update or add columns)
Remove rows with Dataset.filter
Use batched=True to process examples in batches for a large speedup
Carve a validation set with train_test_split and save/reload a dataset

Time and difficulty

Read time: about 12 minutes
Practice time: about 12 minutes (load a dataset, add a column, filter, and split, plus flashcards)
Difficulty: standard (lots of small code, but each operation is one line)