Summary: Why tool-using models fail

Most ‘the AI did X wrong’ bugs in tool-using systems are tool-use failures, and they cluster into three buckets. The buckets map onto the three stages of a function call from Phase 6: the LLM picks the function (Stage 1), code runs the function (Stage 2), the LLM wraps the response in natural language (Stage 3). Each stage has its own characteristic failure modes.

The lecturer’s discipline runs through Phase 7: categorize errors before chasing them. Identify the bucket first, then the specific sub-failure, then the fix. Each bucket has named sub-failures with their own debug recipes; trying to fix tool-use bugs without categorizing usually wastes time.

Bucket 1, tool prediction (LLM picked wrong). Punt (didn’t use the tool when it should have), tool hallucination (called a function that doesn’t exist), wrong-tool selection (picked the wrong one from a set), wrong-arguments (right tool, wrong values).

Bucket 2, tool execution (tool itself misbehaved). Returned an error, returned nothing meaningful.

Bucket 3, response synthesis (LLM mishandled a correct response). Didn’t ground on the structured output, output buried under irrelevant data, output not structured meaningfully.

This summary is the scan-it-in-five-minutes version. The full lesson covers all sub-failures with their causes and fixes plus the practitioner debugging recipe.

Core ideas

Three buckets. Stage 1 (tool prediction), Stage 2 (tool execution), Stage 3 (response synthesis). Each maps to a stage of the function call from Phase 6.
Bucket 1 sub-failures. Punt: didn’t use the available tool. Hallucination: invented a function name. Wrong-tool: picked the wrong one. Wrong-arguments: right tool, wrong values.
Bucket 1 fixes. Punt: tool-router miss or SFT/prompt issue. Hallucination: model too weak, or API too poorly named. Wrong-tool: descriptions overlap. Wrong-arguments: missing context or argument-format confusion.
Bucket 2 sub-failures. Tool returned an error (sometimes useful, often harmful). Tool returned nothing or None.
Bucket 2 fixes. Errors should be sanitized into structured responses; tools should always return meaningful structured output (empty JSON beats None).
Bucket 3 sub-failures. Didn’t ground (rare on modern models). Buried in noise (output too verbose for model to parse). Poorly structured (no semantic field names).
Bucket 3 fixes. Trim tool outputs to what’s needed; use named-field structured responses; upgrade the model only as a last resort.
Pitfall: “the model hallucinated” is too coarse. Of the named sub-failures, only one (tool hallucination) is what people usually mean by hallucination. Most are non-hallucination failures with specific fixes.
Pitfall: fixing in the wrong bucket. Common pattern: vague answer leads team to upgrade the model when the real failure was a tool-execution issue. Categorizing first prevents this.
Tool-implementation quality is high-leverage. Names, docstrings, structured outputs, error handling. None of this is AI work; all of it makes AI features more reliable.

What changes for you

After this lesson, “the AI is broken” stops being a useful frame. Bug reports against tool-using AI systems become triage-able: which bucket is this failure in, which sub-failure within that bucket, what’s the matching fix? You can also recognize when the right fix is engineering polish on the tool (better docstring, better structured response, meaningful error handling) rather than AI work (better model, more SFT data). The high-leverage work is often the unglamorous engineering work, and that’s a useful thing to know.

Tool-use failures fall into three buckets: tool prediction, tool execution, response synthesis.
Categorize the failure before chasing the fix. Most “AI is broken” cases resolve cleanly once placed.
Tool quality (names, docstrings, structured outputs) is the high-leverage non-AI work that makes AI features reliable.