Summary: Wrangling data with the Datasets library
Phase 2 opens on the data side, because real data is never as tidy as load_dataset("glue", "mrpc") made it look. The datasets library loads data from the Hub or your own files (a DatasetDict of splits, each a Dataset with features and num_rows), and it is backed by Apache Arrow on disk, so you can work with datasets larger than your RAM. You clean and transform with two methods: map applies a function that updates or adds columns, and filter keeps or drops rows. The key speedup is batched=True, which processes 1,000 examples at once and unlocks the fast tokenizers of the next lesson. You carve a validation set with train_test_split (leaving the test set untouched), drop to pandas when convenient, and save with save_to_disk. This is the scan version; the lesson cleans a messy real dataset end to end.
Core ideas
Section titled “Core ideas”load_datasetloads Hub datasets and local files. Use a loader ("csv","json") plus adata_filesmapping for your own data. You get aDatasetDictof splits, each aDatasetwith namedfeaturesand anum_rowscount.- Datasets are Arrow-backed on disk, not in RAM. That is why dataset size is bounded by disk, not memory.
maptransforms,filterremoves.maptakes a function returning a dict (overwrite a column or add a new one);filtertakes a function returningTrue/Falseper row.batched=Trueis the big speedup. It passes 1,000 examples at once (values become lists), runs dramatically faster, and is essential for fast tokenizers, which parallelize in Rust.- Tools you reach for:
rename_column,sort,unique, andtrain_test_split(carve a validation set, protect the test set until the end). - Pandas interop and saving.
set_format("pandas")changes only the output format;from_pandasreturns to aDataset. Save and reload withsave_to_disk/load_from_disk.
What changes for you
Section titled “What changes for you”This lesson reframes where the real work of applied AI lives. The model is usually the easy, solved part; the difference between a project that works and one that does not is almost always the quality and preparation of the data. The datasets library makes that work expressible (cleaning and transformation as small functions over map and filter), fast (batched=True), and unbounded by memory (the Arrow backend). The habit to carry forward is the same evaluation honesty from lesson 3, now applied to data: carve a validation set and leave the test set untouched, so your final number actually means something. With clean, well-split data in hand, the next lesson opens up the tokenizer, the component that turns that text into the numbers every model in this track has quietly depended on.
The model is usually the easy part; the data is where the work lives. map, filter, and batched=True are how you do that work at scale without drowning in it.