Practice: Debug your training and get unstuck

Self-check

Seven short questions. Answer each before opening the collapsible.

1. Which direction do you read a Python traceback, and where is the most useful information?

Show answer

Read it from the bottom up. The last line names the exception type and gives the error message, which is almost always where the useful information is. The lines above show the chain of calls that led there; work upward to find the line in your own code that triggered it.

2. You get OSError: Can't load config for '...'. What is the systematic first move?

Show answer

Read what the message suggests: check the model ID is correct (a typo is common), and check the repo actually contains a config.json. Inspect the repo directly with list_repo_files(repo_id). The error message itself often points straight at the fix; act on it before guessing.

3. You get AttributeError: 'list' object has no attribute 'size' when calling model(**inputs). What kind of problem is this, and how do you confirm it?

Show answer

A type mismatch: the model expects tensors but got a Python list. Confirm by checking type(inputs["input_ids"]). The fix is to call the tokenizer with return_tensors="pt" so it returns PyTorch tensors instead of lists. The general move is to form a hypothesis about what kind of thing is wrong and check it directly.

4. When the traceback and the message do not solve it, what is the next high-value move?

Show answer

Copy the error message into a search engine or Stack Overflow. You are rarely the first to hit a given error, and someone has usually posted the cause and fix. It is a five-second move with a high hit rate, not cheating.

5. Name three common places a training pipeline breaks.

Show answer

Any three of: data formatting and labels (wrong columns, string labels where integers are expected, mismatched num_labels); tokenization and tensors (missing return_tensors); the data collator (wrong collator for the task, so labels are not padded right); and shapes/devices (model-vs-batch mismatch, or tensors on CPU while the model is on GPU).

6. What is a minimal reproducible example, and why build one?

Show answer

The smallest piece of code that still triggers the error (one example instead of the whole dataset, a bare model load instead of the full script). Building it often reveals the cause on its own, and when it does not, it is exactly what someone else needs to help you, since no one can debug a long script they cannot run.

7. What should a good help request (forum post or GitHub issue) include?

Show answer

A minimal reproducible example, the full traceback as text (not a screenshot), your environment (library versions and platform, via transformers-cli env), and what you expected versus what happened plus what you already tried. The principle is to give the helper everything they need to reproduce it without a back-and-forth.

Try it yourself: diagnose the error

About 10 minutes, no setup. Read each error and name the likely cause and first fix.

Part A: match the error to its cause. For each, state what kind of problem it is and the first thing you would check or change.

a. OSError: Can't load config for 'my-user/my-modle'.
b. AttributeError: 'list' object has no attribute 'size' (on model(**inputs)).
c. ValueError: expected input_ids to have N labels but got M (during training).
d. RuntimeError: Expected all tensors to be on the same device.

What you’ll get

a. A bad model ID (note the typo “modle”) or a repo missing config.json. First move: fix the ID, then list_repo_files to check the files exist.
b. A type mismatch: tokenizer returned lists, model wants tensors. Fix: add return_tensors="pt".
c. A label/config mismatch: num_labels does not match the data, or labels are formatted wrong. Check how the model was loaded and the label column.
d. A device mismatch: some tensors are on CPU, the model on GPU (or vice versa). Move the inputs to the model’s device.

The pattern across all four: read the message, name the kind of error, check that one thing directly.

Part B (reasoning). You hit an error, search it, find nothing, and your training script is 250 lines. What do you do before posting on the forums, and why?

What you should notice

Build a minimal reproducible example: strip the script down to the fewest lines that still trigger the error. Often the minimizing itself reveals the cause (you delete the irrelevant parts until only the broken one remains). And if it does not, the minimal example is what makes your forum post answerable, because no one will debug 250 lines they cannot run. Then post it with the full traceback and your environment.

Part C (reasoning). Why are the skills in this lesson called more durable than any specific API in the track?

What you should notice

APIs change: the classes and arguments you learned will be renamed or replaced. But reading a traceback, forming and checking a hypothesis, searching an error, minimizing a repro, and writing a clear bug report work on every library and every version, and on open-source projects far beyond Hugging Face. The debugging method outlives the tools it is applied to.

Flashcards

Nine cards. Click any card to reveal the answer. Use the Print flashcards button to lay the set out one card per page for offline review.

Q. Which way do you read a Python traceback?

Bottom to top. The last line names the exception and the message (the most useful info); the lines above show the call chain. Work upward to the line in your own code that triggered it.

Q. First move on OSError: Can't load config for '...'?

Read the message: check the model ID for typos, and check the repo has a config.json. Inspect with list_repo_files(repo_id). The message often names the fix directly.

Q. AttributeError: 'list' object has no attribute 'size' on model(**inputs)?

A type mismatch: the model wants tensors, got Python lists. Confirm with type(…); fix by calling the tokenizer with return_tensors=‘pt’.

Q. What is the highest-value move when the traceback doesn't crack it?

Paste the error message into a search engine or Stack Overflow. You are rarely the first to hit it; someone has usually posted the cause and fix.

Q. Where do training pipelines commonly break?

Data and labels (wrong columns, string vs int labels, num_labels mismatch), tokenization/tensors (missing return_tensors), the collator (wrong one for the task), and shapes/devices (CPU vs GPU).

Q. How do you debug a training pipeline systematically?

Test one stage at a time: tokenize one example, check one batch from the collator for shape/type, push that batch through the model, then launch the full run. Catch errors on one batch, not after 20 minutes.

Q. What is a minimal reproducible example and why build one?

The smallest code that still triggers the error. Building it often reveals the cause, and it is what others need to help you, since no one can debug a long script they cannot run.

Q. What makes a good help request?

A minimal reproducible example, the full traceback as text, your environment (transformers-cli env), and expected-vs-actual plus what you tried. Give the helper everything to reproduce it without back-and-forth.

Q. Why are these debugging skills more durable than any API?

APIs get renamed and replaced; reading tracebacks, checking hypotheses, searching errors, minimizing repros, and writing clear bug reports work on every library, every version, and every open-source project.