Skip to content

Cheatsheet: Why tool-using models fail

Tool-use failures fall into three buckets matching
the three stages of a function call.
Categorize before chasing. Most "AI is broken" cases
resolve cleanly once placed in the right bucket.
BucketStage of the function callWhat goes wrong
1. Tool predictionStage 1 (LLM picks function + arguments)LLM picked wrong (or picked nothing)
2. Tool executionStage 2 (code runs the function)Tool itself misbehaved
3. Response synthesisStage 3 (LLM wraps response in natural language)LLM mishandled a correct response
Sub-failureRecognizable signatureCommon causeFirst fix
Punt”I’m not sure” or “I can’t help” when a tool could answerTool router miss, or SFT/prompt issueCheck router; add SFT examples; tighten system prompt
Tool hallucinationFunction name that doesn’t existModel too weak, or API poorly namedUpgrade model; rewrite docstring + names
Wrong-tool selectionReal function call, but wrong functionTool descriptions overlap in scopeRewrite descriptions to be precise about scope
Wrong-argumentsRight function, wrong values (fabricated ID, “0,0” coords)Missing context, or argument-format confusionEnsure context in prompt; tighten docstring with examples
Sub-failureRecognizable signatureFix
Tool returned an errorFunction hit an exceptionSanitize raw errors into structured responses; distinguish actionable vs internal errors
Tool returned nothingNone or empty silent returnAlways return structured output. Empty JSON {} beats None.

Bucket 3 sub-failures (response synthesis)

Section titled “Bucket 3 sub-failures (response synthesis)”
Sub-failureRecognizable signatureFix
Didn’t groundTool returned answer; model ignored itUpgrade model (rare on modern frontier models)
Buried in noiseAnswer present in tool response, but among many fields the model couldn’t filterTrim tool output to only what’s needed; post-process inside function
Poorly structuredTool output is flat or vaguely named ({"data": "Teddy Bear at coords"})Use named-field structured responses ({"name": "Teddy Bear", "coordinates": "..."}); Python dataclass or Pydantic
STEP 1: Did the model produce a tool call?
NO → Bucket 1 (sub-failure: punt)
YES → continue
STEP 2: Did the function run AND return useful data?
Function errored or returned nothing useful → Bucket 2
Function returned good data → continue
STEP 3: The model's final response is wrong.
It must be Bucket 3 (response synthesis).
WITHIN A BUCKET:
Match the failure to one of the named sub-failures.
Apply the corresponding fix.
WITHOUT categorization:
→ "the AI is broken"
→ reach for AI fixes (more SFT, bigger model)
→ may not be the right fix at all
WITH categorization:
→ "this is a Bucket 2 fail (tool returned None)"
→ fix: have the tool return structured output
→ 5-line code change, no AI work needed
Tool-implementation polishWhy it matters
Clear, specific function namesReduces tool-hallucination + wrong-tool failures
Precise docstrings (with argument examples)Reduces wrong-arguments failures
Always return structured responsesEliminates Bucket 2 “returned nothing” failures
Distinguish actionable vs internal errorsMakes Bucket 2 errors useful instead of harmful
Trim tool outputs to what’s neededReduces Bucket 3 “buried in noise” failures
Use semantic field namesReduces Bucket 3 “poorly structured” failures

None of this is AI work. All of it makes AI features more reliable.

PitfallReality
”The AI hallucinated, period.”Of the named sub-failures, only one (tool hallucination) is actually a hallucination. Most are non-hallucination failures with specific non-AI fixes.
”Upgrading the model fixes everything.”Often it doesn’t. A Bucket 2 failure persists across model upgrades. Categorize first.
”Every failure needs a different fix.”True only when not categorized. Within a bucket, fixes patternize.
”Tool implementation is just engineering glue.”False. Tool quality is the highest-leverage work for making AI features reliable.
  • Tool prediction: Stage 1 of the function-call mechanism; LLM picks function and arguments. Bucket 1 of failures.
  • Tool execution: Stage 2; regular code runs the function. Bucket 2 of failures.
  • Response synthesis: Stage 3; LLM wraps the structured response in natural language. Bucket 3 of failures.
  • Punt: assistant-design term for when the model says “I can’t help” instead of using an available tool.
  • Tool hallucination: model emits a call to a function name that doesn’t exist.
  • Tool router (or tool selector): intermediary system that filters the list of available tools before showing them to the LLM. Used at scale when the tool inventory is large.
  • Grounding: model’s ability to use information that’s present in its context (in this case, the structured tool response).
  • Structured output: API feature that guarantees the LLM (or tool) produces output matching a JSON schema. Required for production reliability.

Tool-use failures fall into three buckets: tool prediction, tool execution, response synthesis.
Categorize the failure before chasing the fix. Most “AI is broken” cases resolve cleanly once placed.
Tool quality (names, docstrings, structured outputs) is the high-leverage non-AI work that makes AI features reliable.