Cheatsheet: Wrangling data with the Datasets library
Load data
Section titled “Load data”from datasets import load_dataset
# From the Hubds = load_dataset("glue", "mrpc")
# From local files (csv/json/text)ds = load_dataset("csv", data_files={"train": "train.tsv", "test": "test.tsv"}, delimiter="\t")Returns a DatasetDict (splits) of Dataset objects, each with features and num_rows. Arrow-backed on disk, so it scales past RAM.
Inspect + sample
Section titled “Inspect + sample”print(ds) # splits, features, num_rowsds["train"][0] # one row as a dictds["train"].shuffle(seed=42).select(range(1000)) # reproducible sampleds["train"].unique("column") # distinct valuesTransform: map (add/change columns)
Section titled “Transform: map (add/change columns)”# Add a column (new key) or overwrite (existing key)ds = ds.map(lambda x: {"length": len(x["text"].split())})
# Faster: process 1,000 at once (values become lists)ds = ds.map(lambda x: {"text": [clean(t) for t in x["text"]]}, batched=True)
# Tokenize a whole dataset (fast tokenizer + batched = ~30x)ds = ds.map(lambda x: tokenizer(x["text"], truncation=True), batched=True)batched=True is the biggest speedup; num_proc=N adds multiprocessing for non-Rust functions.
Transform: filter (keep/drop rows)
Section titled “Transform: filter (keep/drop rows)”ds = ds.filter(lambda x: x["condition"] is not None) # drop Noneds = ds.filter(lambda x: x["length"] > 30) # drop short rowsmap changes columns; filter removes rows. Both take a per-example function (lambda is idiomatic).
Handy methods
Section titled “Handy methods”| Method | Does |
|---|---|
rename_column(old, new) | Rename a column across all splits |
sort("col") | Order rows by a column |
unique("col") | Distinct values in a column |
train_test_split(train_size=0.8, seed=42) | Carve a validation set |
Pandas interop + saving
Section titled “Pandas interop + saving”ds.set_format("pandas") # output as DataFrame (Arrow untouched)df = ds["train"][:]# ... pandas work ...ds.reset_format() # back to Arrownew = Dataset.from_pandas(df) # DataFrame -> Dataset
ds.save_to_disk("dir") # Arrowload_from_disk("dir") # reloadds.to_csv(...) / ds.to_json(...) # per-split exportWords to use precisely
Section titled “Words to use precisely”- DatasetDict / Dataset: the splits container, and a single split (columns =
features, rows =num_rows). - Arrow: the on-disk columnar format; why datasets scale past RAM.
batched=True: process a batch of examples per call; the key speed and the reason fast tokenizers are fast.- Validation set: a slice carved from training data to tune on, keeping the test set untouched.
Recommended further study
Section titled “Recommended further study”- Hugging Face LLM Course, Chapter 5: “The Datasets library.”
huggingface.co/learn/llm-course/chapter5. Released under Apache 2.0; this lesson mirrors its structure with original prose.