Skip to content

References: Why tool-using models fail

Source material:
• Stanford CME 295: Transformers & Large Language Models, Autumn 2025
Instructor: Afshine Amidi & Shervine Amidi, Stanford University
Course site: https://cme295.stanford.edu/
Cheatsheet: https://cme295.stanford.edu/cheatsheet/
Source lecture (Lecture 8, LLM Evaluation):
see course site at https://cme295.stanford.edu/ for the lecture URL
License (lecture videos): as published on Stanford's public YouTube channel
License (Amidi cheatsheets): MIT
This lesson adapts the tool-use failure-modes section of Stanford CME 295
Lecture 8, covering [01:01:19-01:06:16] Bucket 1 sub-failures (punt, tool
hallucination, wrong-tool selection, wrong-arguments), [01:12:00-01:18:23]
Bucket 2 sub-failures (tool errors, tool returned nothing), and
[01:18:23-01:21:16] Bucket 3 sub-failures (didn't ground, buried in noise,
poorly structured). The taxonomy itself is the lecturer's; the lecture
notes that the cheatsheet does not separately cover this taxonomy. Clawdemy
provides original notes, summaries, and quizzes derived from this material
for educational purposes. All rights to the original lectures remain with
Stanford and the instructors.

The published research behind tool-use evaluation and debugging.

A short list, chosen for durability.

  • “Survey on Evaluation of Large Language Models with Tool Use”, Wang et al., 2024. Surveys evaluation methodology specifically for tool-using LLMs. Useful for understanding how the field formally measures the failure modes this lesson taxonomy-ized.

  • “AgentBench”, Liu et al., 2023. A benchmark suite for evaluating LLM agents (tool-using systems running observe-plan-act loops, Phase 6’s territory). Useful for connecting this lesson’s failure-mode taxonomy to systematic agent evaluation.

  • Production tool-use observability. The lesson’s “categorize before chasing” methodology is harder to apply at scale without observability tooling that lets you see the structured Stage 1 calls and Stage 2 returns separately. Search terms: “LLM observability,” “agent tracing,” “OpenTelemetry for AI agents.” Most of this lives in vendor product docs (Helicone, Langfuse, OpenLLMetry, etc.) rather than academic papers.

  • Argument validation patterns. Argument hallucination (Bucket 1’s wrong-arguments sub-failure) is heavily mitigated by JSON-schema validation between Stages 1 and 2. Search terms: “structured outputs validation,” “JSON schema enforcement in LLM APIs,” “Pydantic for LLM tool calls.” Practical engineering patterns documented in vendor SDKs.

  • The “AI bug report” as a categorization problem. This lesson’s framing applies more broadly: most production AI bugs benefit from being categorized before being chased. Search terms: “LLM evaluation pipelines,” “production AI debugging,” “AI feature observability.”

  • Stanford CME 295 cheatsheet by the Amidi twins. MIT-licensed. The lecturer notes that the cheatsheet does not separately cover this tool-use failure-mode taxonomy; for that material, the lecture itself (and this lesson) is the primary source.

None selected for this lesson. The literature on production tool-use debugging is largely in vendor docs and team writeups; durable academic-grade community references are still consolidating. The OpenTelemetry/observability-tooling ecosystem will likely produce shared resources here over time.