Skip to content

Summary: Building trustworthy agents

Capable is not the same as trustworthy, and this lesson is about the agent failing on its own, not under attack. An agent that works in a demo is not one you would let act on a real customer’s account. The gap is failure modes: the specific ways agents go wrong even with no adversary in sight. This lesson names six of them and the guardrail that contains each, then gives the organizing principle (match the guardrail to the blast radius). Adversarial attacks are a different problem, covered in the next lesson. This summary is the scan-in-five-minutes version of the full lesson.

  • Two different reasons an agent does harm. It fails on its own (a mistake, no adversary), or someone attacks it (a malicious input bends it). This lesson is the first kind; the next lesson is the second. They need different defenses, so keep the line clear: everything here assumes no attacker.
  • Failure mode 1, hallucinated tool calls. The agent calls a tool that does not exist, or a real tool with invented arguments. Guardrail: validate every call against real tools and expected argument shapes before running it; good tool descriptions (Lesson 4) reduce these at the source.
  • Failure mode 2, runaway loops. The loop that makes an agent powerful can trap it, retrying or replanning without progress, burning time and money. Guardrail: budget the loop, cap steps, retries, time, or cost, and fail cleanly at the cap.
  • Failure mode 3, confidently wrong answers. A wrong result presented as correct, with no signal anything is off. The most dangerous mode because it is silent. Guardrail: a reflection step (Lesson 9) plus output validation; human review for high stakes.
  • Failure mode 4, mishandled tool failures. The agent ignores a tool error and proceeds as if it succeeded, or panics on a recoverable one. Guardrail: read and act on tool results, never assume success; the error flows back into the loop (Lesson 2).
  • Failure mode 5, missing context, silently incomplete work. The agent acts on partial information and reports the job done. Guardrail: require and validate inputs, and flag the gap instead of papering over it.
  • Failure mode 6, data over-exposure. A correct, complete answer that reveals too much, or to the wrong recipient, often because a retrieval over-fetched. Not a wrong answer, not incomplete, and not the next lesson’s attacker-driven exfiltration; the agent over-shares on its own. Guardrail: scope output to the requester, least-privilege data access, redact unneeded fields.
  • Match the guardrail to the blast radius. A read-only lookup that is wrong costs a re-query; a hard-to-reverse action (money, deletion, outbound messages) deserves a human-in-the-loop checkpoint. The strongest guardrail, human review, belongs on the actions whose mistakes you cannot take back, not on every action.
  • Guardrails reduce risk; they do not erase it. A trustworthy agent is one whose failures are rare, bounded, and visible, not one that cannot fail.

Before this lesson, “trustworthy agent” was a vague reassurance. Now it is a concrete checklist: six characteristic own-failures, each with a guardrail you can name, and one principle (gate by blast radius) for deciding how much protection an action needs. When you meet an agent product, you can ask the sharper questions: how does it fail off the happy path, what contains each failure, and does the human checkpoint sit on the actions that are actually hard to reverse? And you can hold the line the marketing blurs: this is the agent failing on its own. What happens when someone attacks it on purpose is a different problem, and the next lesson.