Skip to content

Cheatsheet: Building trustworthy agents

Capable is not trustworthy. Trustworthiness is about how an agent behaves when things go wrong, which a demo never shows. This lesson is the agent failing on its own (no attacker); security (the agent under attack) is the next lesson.

Trustworthy (this lesson)Secure (next lesson)
ThreatThe agent fails on its ownAn attacker bends the agent
ExampleHallucinated tool call, runaway loopPrompt injection, tool abuse
DefenseValidation, budgets, reflection, human-in-loopAdversarial defenses (L11)

A reflection step does not stop a malicious input. Different problems, different defenses.

The six own-failure modes and their guardrails

Section titled “The six own-failure modes and their guardrails”
Failure modeWhat it looks likeGuardrail
Hallucinated tool callCalls a tool that does not exist, or invents argumentsValidate the call against real tools + shapes; good tool descriptions (L4)
Runaway loopRetries/replans without progress; never finishesCap steps, retries, time, cost; fail cleanly at the cap
Confidently wrong answerWrong result, presented as correct, no signalReflection step (L9) + output validation; human review for high stakes
Mishandled tool failureIgnores an error, or panics on a recoverable oneRead + act on tool results (L2); never assume success
Missing context / silent partialActs on incomplete info, reports doneRequire + validate inputs; flag gaps instead of papering over
Data over-exposureRight answer, wrong recipient; surfaces fields the task did not need; echoes another user’s dataScope output to the requester; least-privilege data access; redact unneeded fields; HITL for high-stakes recipients
  • Validate tool calls (real tool, right argument shape).
  • Cap loops, retries, time, cost.
  • Validate outputs / use structured-output schemas.
  • Add a reflection step (L9) as a self-check.
  • Read and handle tool errors (L2), never assume success.
  • Require complete inputs; flag gaps.
  • Scope outputs to the requester; least-privilege data access; redact unneeded fields.
  • Human-in-the-loop for high-stakes actions.
read-only lookup wrong -> cheap, a re-query -> light guardrails
hard-to-reverse action -> money / deletion / outbound -> human-in-the-loop checkpoint

Gate by blast radius. Human review on every trivial action is friction nobody tolerates; on no action is negligence. Gate the actions whose mistakes you cannot take back.

Guardrails reduce the rate and cost of failures; they do not erase them. A loop budget does not fix a wrong answer produced in three steps; output validation does not catch judgment calls; a human checkpoint is only as good as the human’s attention. Trustworthy = failures rare, bounded, and visible, not failures impossible.

  • Assuming a capable agent is trustworthy (the happy path hides the failures).
  • Confusing trustworthiness with security (this lesson vs the next).
  • Trusting a confident answer because it is confident.
  • Gating everything or nothing with human review (gate by blast radius).
  • Treating guardrails as guarantees.
  • Trustworthiness: how reliably an agent behaves when it fails on its own (no attacker).
  • Guardrail: a control that contains a failure mode (validation, budget, reflection, human checkpoint).
  • Blast radius: how hard an action is to reverse; sets how strong a guardrail it needs.
  • Human-in-the-loop: a person approves a high-stakes action before the agent takes it.