Skip to content

Summary: Wrangling data with the Datasets library

Phase 2 opens on the data side, because real data is never as tidy as load_dataset("glue", "mrpc") made it look. The datasets library loads data from the Hub or your own files (a DatasetDict of splits, each a Dataset with features and num_rows), and it is backed by Apache Arrow on disk, so you can work with datasets larger than your RAM. You clean and transform with two methods: map applies a function that updates or adds columns, and filter keeps or drops rows. The key speedup is batched=True, which processes 1,000 examples at once and unlocks the fast tokenizers of the next lesson. You carve a validation set with train_test_split (leaving the test set untouched), drop to pandas when convenient, and save with save_to_disk. This is the scan version; the lesson cleans a messy real dataset end to end.

  • load_dataset loads Hub datasets and local files. Use a loader ("csv", "json") plus a data_files mapping for your own data. You get a DatasetDict of splits, each a Dataset with named features and a num_rows count.
  • Datasets are Arrow-backed on disk, not in RAM. That is why dataset size is bounded by disk, not memory.
  • map transforms, filter removes. map takes a function returning a dict (overwrite a column or add a new one); filter takes a function returning True/False per row.
  • batched=True is the big speedup. It passes 1,000 examples at once (values become lists), runs dramatically faster, and is essential for fast tokenizers, which parallelize in Rust.
  • Tools you reach for: rename_column, sort, unique, and train_test_split (carve a validation set, protect the test set until the end).
  • Pandas interop and saving. set_format("pandas") changes only the output format; from_pandas returns to a Dataset. Save and reload with save_to_disk / load_from_disk.

This lesson reframes where the real work of applied AI lives. The model is usually the easy, solved part; the difference between a project that works and one that does not is almost always the quality and preparation of the data. The datasets library makes that work expressible (cleaning and transformation as small functions over map and filter), fast (batched=True), and unbounded by memory (the Arrow backend). The habit to carry forward is the same evaluation honesty from lesson 3, now applied to data: carve a validation set and leave the test set untouched, so your final number actually means something. With clean, well-split data in hand, the next lesson opens up the tokenizer, the component that turns that text into the numbers every model in this track has quietly depended on.

The model is usually the easy part; the data is where the work lives. map, filter, and batched=True are how you do that work at scale without drowning in it.