Debug your training: cheatsheet

Reading a traceback

Read bottom to top.
Last line = exception type + message (usually the key info).
Climb up to find the line in your code (vs library internals) that triggered it.
If the message names a fix, do that first.

The debugging loop

Read the error message.
Form a hypothesis: what kind of thing is wrong (type, shape, missing file, device)?
Check it directly (type(x), x.shape, list_repo_files(...)).
Search the exact message (search engine / Stack Overflow).
Still stuck? Minimize to a reproducible example, then ask.

Two classic errors and their fixes

Error	Likely cause	Fix
`OSError: Can't load config for '...'`	Bad model ID (typo) or missing `config.json`	Fix the ID; `list_repo_files(repo_id)` to inspect
`AttributeError: 'list' object has no attribute 'size'`	Tokenizer returned lists; model wants tensors	Add `return_tensors="pt"`

Where training pipelines break (check in order)

Area	Typical bug
Data + labels	Wrong columns, string labels, `num_labels` mismatch
Tokenization	Missing `return_tensors`, inconsistent lengths
Collator	Wrong collator for the task (labels not padded right)
Shapes + devices	Model-vs-batch mismatch; CPU vs GPU tensors

Systematic pipeline check

# Test one stage at a time before a full run:
ex = tokenizer(dataset[0]["text"], return_tensors="pt")   # 1. one example
batch = data_collator([dataset[i] for i in range(4)])      # 2. one batch: shapes/types
out = model(**batch)                                       # 3. push it through
# 4. only now: trainer.train()

Catching an error on one batch beats catching it after 20 minutes of training.

Minimal reproducible example

Smallest code that still triggers the error.
One example, not the whole dataset; bare load, not the full script.
Minimizing often reveals the cause; if not, it is what others need to help you.

Help-request template (forum or GitHub issue)

What I'm trying to do: <one line>
Minimal code to reproduce: <shortest runnable snippet>
Full traceback: <as text, not a screenshot>
Environment: <output of `transformers-cli env`>
Expected vs actual: <what I expected / what happened>
What I tried: <searches, fixes attempted>

Forums (discuss.huggingface.co) for “how/why”; GitHub issue when you believe it is a real bug.

Words to use precisely

Traceback / stack trace: the chain of calls to the error; read bottom-up.
Minimal reproducible example: the smallest code that reproduces the problem.
transformers-cli env: prints your library versions and platform for bug reports.

Recommended further study

Hugging Face LLM Course, Chapter 8: “How to ask for help.” huggingface.co/learn/llm-course/chapter8. Released under Apache 2.0; this lesson mirrors its structure with original prose.