References: Why tool-using models fail
Source material
Section titled “Source material”Source material:• Stanford CME 295: Transformers & Large Language Models, Autumn 2025 Instructor: Afshine Amidi & Shervine Amidi, Stanford University Course site: https://cme295.stanford.edu/ Cheatsheet: https://cme295.stanford.edu/cheatsheet/ Source lecture (Lecture 8, LLM Evaluation): see course site at https://cme295.stanford.edu/ for the lecture URL License (lecture videos): as published on Stanford's public YouTube channel License (Amidi cheatsheets): MITThis lesson adapts the tool-use failure-modes section of Stanford CME 295Lecture 8, covering [01:01:19-01:06:16] Bucket 1 sub-failures (punt, toolhallucination, wrong-tool selection, wrong-arguments), [01:12:00-01:18:23]Bucket 2 sub-failures (tool errors, tool returned nothing), and[01:18:23-01:21:16] Bucket 3 sub-failures (didn't ground, buried in noise,poorly structured). The taxonomy itself is the lecturer's; the lecturenotes that the cheatsheet does not separately cover this taxonomy. Clawdemyprovides original notes, summaries, and quizzes derived from this materialfor educational purposes. All rights to the original lectures remain withStanford and the instructors.Foundational papers
Section titled “Foundational papers”The published research behind tool-use evaluation and debugging.
-
“Toolformer: Language Models Can Teach Themselves to Use Tools”, Schick et al., 2023. Already cited in the function-calling lesson; relevant here for the framing of how tool-use is taught and where it fails.
-
“Gorilla: Large Language Model Connected with Massive APIs”, Patil et al., 2023. Demonstrates fine-tuning an LLM specifically for tool-calling across thousands of available APIs. Relevant for understanding the tool-router problem at scale (mentioned in this lesson’s Bucket 1 sub-failures).
-
“A Comprehensive Evaluation of Tool-Assisted Generation Strategies”, Jacovi et al., 2023. Empirical evaluation of where tool-using LLMs fail across multiple benchmarks. Useful for seeing the failure-mode taxonomy applied at scale to real model outputs.
Practical references
Section titled “Practical references”-
OpenAI’s function-calling guide. Working reference for production-grade tool-use, including structured-output enforcement and error-handling patterns.
-
Anthropic’s tool use documentation. Same idea, different vendor. Useful comparison for vendor differences in tool-use protocols.
Going deeper
Section titled “Going deeper”A short list, chosen for durability.
-
“Survey on Evaluation of Large Language Models with Tool Use”, Wang et al., 2024. Surveys evaluation methodology specifically for tool-using LLMs. Useful for understanding how the field formally measures the failure modes this lesson taxonomy-ized.
-
“AgentBench”, Liu et al., 2023. A benchmark suite for evaluating LLM agents (tool-using systems running observe-plan-act loops, Phase 6’s territory). Useful for connecting this lesson’s failure-mode taxonomy to systematic agent evaluation.
Adjacent topics
Section titled “Adjacent topics”-
Production tool-use observability. The lesson’s “categorize before chasing” methodology is harder to apply at scale without observability tooling that lets you see the structured Stage 1 calls and Stage 2 returns separately. Search terms: “LLM observability,” “agent tracing,” “OpenTelemetry for AI agents.” Most of this lives in vendor product docs (Helicone, Langfuse, OpenLLMetry, etc.) rather than academic papers.
-
Argument validation patterns. Argument hallucination (Bucket 1’s wrong-arguments sub-failure) is heavily mitigated by JSON-schema validation between Stages 1 and 2. Search terms: “structured outputs validation,” “JSON schema enforcement in LLM APIs,” “Pydantic for LLM tool calls.” Practical engineering patterns documented in vendor SDKs.
-
The “AI bug report” as a categorization problem. This lesson’s framing applies more broadly: most production AI bugs benefit from being categorized before being chased. Search terms: “LLM evaluation pipelines,” “production AI debugging,” “AI feature observability.”
Stanford CME 295 cheatsheet
Section titled “Stanford CME 295 cheatsheet”- Stanford CME 295 cheatsheet by the Amidi twins. MIT-licensed. The lecturer notes that the cheatsheet does not separately cover this tool-use failure-mode taxonomy; for that material, the lecture itself (and this lesson) is the primary source.
Community discussion
Section titled “Community discussion”None selected for this lesson. The literature on production tool-use debugging is largely in vendor docs and team writeups; durable academic-grade community references are still consolidating. The OpenTelemetry/observability-tooling ecosystem will likely produce shared resources here over time.