Skip to content

Why tool-using models fail

This is lesson 3 of Phase 7, How we judge models and where they’re going, in Track 5 (AI Foundations). The previous lesson covered benchmark literacy and the lecturer’s “categorize errors before chasing them” methodology. This lesson applies that exact methodology to a specific kind of failure: tool-using models that misbehave. When a chat app gives a wrong answer to a tool-able query, the failure is almost always one of a small set of named sub-failures, organized into three buckets that match the three stages of a function call (Phase 6’s function-calling lesson). Stage 1 (tool prediction) failures: the LLM punted, hallucinated a function name, picked the wrong tool, or filled wrong arguments. Stage 2 (tool execution) failures: the tool returned an error or returned nothing useful. Stage 3 (response synthesis) failures: the LLM didn’t ground on the response, the response was buried under irrelevant data, or the response wasn’t structured meaningfully. This lesson walks each bucket and sub-failure with the matching debug recipe, so the next time you see a “the AI did X wrong” bug report you can place it in seconds. Course materials are at cme295.stanford.edu.

This is lesson 3 of Phase 7. The previous lesson (Why benchmarks can mislead) introduced the lecturer’s “categorize before chasing” methodology in the context of benchmark literacy. This lesson applies the same discipline to debugging tool-use failures. The next two lessons cover frontier directions in the field (transformers beyond text, new generation methods), and the track closes with a safety-lens recap that pulls together every safety thread woven through Phases 4-7.

Prerequisites: the function-calling lesson is required since this lesson directly maps tool-use failures onto the three-stage mechanism that lesson covered. The agent loops lesson is useful for understanding multi-turn agent contexts where tool-use failures cumulate.

  • Identify the three buckets where tool-use failures cluster (tool prediction, tool execution, response synthesis)
  • Recognize the named sub-failures inside each bucket (punt, tool hallucination, wrong-tool, wrong-arguments, tool-error, tool-returned-nothing, didn’t-ground, buried-in-noise, poorly-structured)
  • Apply the categorize-before-chasing discipline when debugging a tool-use failure
  • Distinguish failure modes that need AI work (model upgrade, SFT) from those that need engineering work (tool API design, structured outputs, error handling)
  • Recognize tool-implementation quality (naming, docstrings, structured responses) as high-leverage work for making tool-using AI features reliable
  • Read time: about 13 minutes
  • Practice time: about 12 minutes (a self-check on the three buckets and named sub-failures, a hands-on triage exercise on real-style failure descriptions, and flashcards)
  • Difficulty: standard