References: Why tool-using models fail

Source material

Source material:
• Stanford CME 295: Transformers & Large Language Models, Autumn 2025
  Instructor: Afshine Amidi & Shervine Amidi, Stanford University
  Course site: https://cme295.stanford.edu/
  Cheatsheet: https://cme295.stanford.edu/cheatsheet/
  Source lecture (Lecture 8, LLM Evaluation):
    see course site at https://cme295.stanford.edu/ for the lecture URL
  License (lecture videos): as published on Stanford's public YouTube channel
  License (Amidi cheatsheets): MIT
This lesson adapts the tool-use failure-modes section of Stanford CME 295
Lecture 8, covering [01:01:19-01:06:16] Bucket 1 sub-failures (punt, tool
hallucination, wrong-tool selection, wrong-arguments), [01:12:00-01:18:23]
Bucket 2 sub-failures (tool errors, tool returned nothing), and
[01:18:23-01:21:16] Bucket 3 sub-failures (didn't ground, buried in noise,
poorly structured). The taxonomy itself is the lecturer's; the lecture
notes that the cheatsheet does not separately cover this taxonomy. Clawdemy
provides original notes, summaries, and quizzes derived from this material
for educational purposes. All rights to the original lectures remain with
Stanford and the instructors.

Foundational papers

The published research behind tool-use evaluation and debugging.

“Toolformer: Language Models Can Teach Themselves to Use Tools”, Schick et al., 2023. Already cited in the function-calling lesson; relevant here for the framing of how tool-use is taught and where it fails.
“Gorilla: Large Language Model Connected with Massive APIs”, Patil et al., 2023. Demonstrates fine-tuning an LLM specifically for tool-calling across thousands of available APIs. Relevant for understanding the tool-router problem at scale (mentioned in this lesson’s Bucket 1 sub-failures).
“A Comprehensive Evaluation of Tool-Assisted Generation Strategies”, Jacovi et al., 2023. Empirical evaluation of where tool-using LLMs fail across multiple benchmarks. Useful for seeing the failure-mode taxonomy applied at scale to real model outputs.

Practical references

OpenAI’s function-calling guide. Working reference for production-grade tool-use, including structured-output enforcement and error-handling patterns.
Anthropic’s tool use documentation. Same idea, different vendor. Useful comparison for vendor differences in tool-use protocols.

Going deeper

A short list, chosen for durability.

“Survey on Evaluation of Large Language Models with Tool Use”, Wang et al., 2024. Surveys evaluation methodology specifically for tool-using LLMs. Useful for understanding how the field formally measures the failure modes this lesson taxonomy-ized.
“AgentBench”, Liu et al., 2023. A benchmark suite for evaluating LLM agents (tool-using systems running observe-plan-act loops, Phase 6’s territory). Useful for connecting this lesson’s failure-mode taxonomy to systematic agent evaluation.

Adjacent topics

Production tool-use observability. The lesson’s “categorize before chasing” methodology is harder to apply at scale without observability tooling that lets you see the structured Stage 1 calls and Stage 2 returns separately. Search terms: “LLM observability,” “agent tracing,” “OpenTelemetry for AI agents.” Most of this lives in vendor product docs (Helicone, Langfuse, OpenLLMetry, etc.) rather than academic papers.
Argument validation patterns. Argument hallucination (Bucket 1’s wrong-arguments sub-failure) is heavily mitigated by JSON-schema validation between Stages 1 and 2. Search terms: “structured outputs validation,” “JSON schema enforcement in LLM APIs,” “Pydantic for LLM tool calls.” Practical engineering patterns documented in vendor SDKs.
The “AI bug report” as a categorization problem. This lesson’s framing applies more broadly: most production AI bugs benefit from being categorized before being chased. Search terms: “LLM evaluation pipelines,” “production AI debugging,” “AI feature observability.”

Stanford CME 295 cheatsheet

Stanford CME 295 cheatsheet by the Amidi twins. MIT-licensed. The lecturer notes that the cheatsheet does not separately cover this tool-use failure-mode taxonomy; for that material, the lecture itself (and this lesson) is the primary source.

Community discussion

None selected for this lesson. The literature on production tool-use debugging is largely in vendor docs and team writeups; durable academic-grade community references are still consolidating. The OpenTelemetry/observability-tooling ecosystem will likely produce shared resources here over time.