Building trustworthy agents, in brief

What you’ll learn

This is lesson 10 of Track 20 (AI Agents and Tool Use) and the opener of Phase 3, Building agents you can trust and ship. The first nine lessons made an agent capable. This phase changes the question: capable is necessary, but it is not the same as ready. An agent that works in a demo, on the happy path, is not yet one you would let act on a real customer’s account.

The gap is failure modes. This lesson holds a clean boundary first: there are two reasons an agent does harm, it fails on its own (no adversary) or someone attacks it, and this lesson is strictly about the first kind. You will learn the six characteristic ways an agent fails on its own (hallucinated tool calls, runaway loops, confidently wrong answers, mishandled tool failures, silently incomplete work, and data over-exposure) and the guardrail that contains each. Most of those guardrails are pieces you have already met: tool-call validation and good tool descriptions, loop and retry budgets, a reflection step and output validation, proper tool-error handling, and input-completeness checks. The lesson then gives the organizing principle, match the guardrail to the blast radius, with human-in-the-loop confirmation reserved for hard-to-reverse actions, and stays honest that guardrails reduce risk without erasing it.

The track structurally mirrors Microsoft’s “AI Agents for Beginners” (MIT-licensed), with the Berkeley CS294 LLM Agents course as a depth reference. This lesson’s six-mode failure taxonomy is Clawdemy framing, used to hold a clean boundary between trustworthiness and security; full attribution and a source-scope note are in this lesson’s references.

Where this fits

This lesson begins the shift from building agents that work to building agents that are safe to put in front of real users. It gathers guardrails from across the track: good tool descriptions (the tool-use lesson), tool-error handling (the tool-use-as-an-agent lesson), and the reflection step (the metacognition lesson) all return here as defenses against specific failure modes. It also sets up the next lesson by drawing the trustworthy-versus-secure line explicitly: everything here assumes no attacker, and the next lesson, on securing agents, takes up the adversarial threat (prompt injection, tool abuse, data exfiltration) with its own different defenses.

Before you start

Prerequisites: the earlier lessons in the track, especially Agents that self-check (the immediately prior lesson; a reflection step is the guardrail against confidently wrong answers) and The tool-use design pattern in depth (good tool descriptions are the guardrail against hallucinated and misdirected tool calls). You do not need to code. If you understand an agent as a model in a loop with tools, you have the background; this lesson is about the ways that loop goes wrong and how to contain each one.

By the end, you’ll be able to

Distinguish trustworthiness (the agent failing on its own) from security (the agent under attack)
Name the six characteristic own-failure modes of an agent
Match each failure mode to the guardrail that contains it
Apply the blast-radius principle to decide when human-in-the-loop confirmation is warranted
Explain why guardrails reduce risk but do not erase it

Time and difficulty

Read time: about 11 minutes
Practice time: about 18 minutes (a self-check, a classify-the-failure-mode exercise across all six modes, a blast-radius judgment exercise, and flashcards)
Difficulty: standard