Skip to content

Cheatsheet: Debug your training and get unstuck

  • Read bottom to top.
  • Last line = exception type + message (usually the key info).
  • Climb up to find the line in your code (vs library internals) that triggered it.
  • If the message names a fix, do that first.
  1. Read the error message.
  2. Form a hypothesis: what kind of thing is wrong (type, shape, missing file, device)?
  3. Check it directly (type(x), x.shape, list_repo_files(...)).
  4. Search the exact message (search engine / Stack Overflow).
  5. Still stuck? Minimize to a reproducible example, then ask.
ErrorLikely causeFix
OSError: Can't load config for '...'Bad model ID (typo) or missing config.jsonFix the ID; list_repo_files(repo_id) to inspect
AttributeError: 'list' object has no attribute 'size'Tokenizer returned lists; model wants tensorsAdd return_tensors="pt"

Where training pipelines break (check in order)

Section titled “Where training pipelines break (check in order)”
AreaTypical bug
Data + labelsWrong columns, string labels, num_labels mismatch
TokenizationMissing return_tensors, inconsistent lengths
CollatorWrong collator for the task (labels not padded right)
Shapes + devicesModel-vs-batch mismatch; CPU vs GPU tensors
# Test one stage at a time before a full run:
ex = tokenizer(dataset[0]["text"], return_tensors="pt") # 1. one example
batch = data_collator([dataset[i] for i in range(4)]) # 2. one batch: shapes/types
out = model(**batch) # 3. push it through
# 4. only now: trainer.train()

Catching an error on one batch beats catching it after 20 minutes of training.

  • Smallest code that still triggers the error.
  • One example, not the whole dataset; bare load, not the full script.
  • Minimizing often reveals the cause; if not, it is what others need to help you.

Help-request template (forum or GitHub issue)

Section titled “Help-request template (forum or GitHub issue)”
What I'm trying to do: <one line>
Minimal code to reproduce: <shortest runnable snippet>
Full traceback: <as text, not a screenshot>
Environment: <output of `transformers-cli env`>
Expected vs actual: <what I expected / what happened>
What I tried: <searches, fixes attempted>

Forums (discuss.huggingface.co) for “how/why”; GitHub issue when you believe it is a real bug.

  • Traceback / stack trace: the chain of calls to the error; read bottom-up.
  • Minimal reproducible example: the smallest code that reproduces the problem.
  • transformers-cli env: prints your library versions and platform for bug reports.
  • Hugging Face LLM Course, Chapter 8: “How to ask for help.” huggingface.co/learn/llm-course/chapter8. Released under Apache 2.0; this lesson mirrors its structure with original prose.