Cheatsheet: Building trustworthy agents
The one idea
Section titled “The one idea”Capable is not trustworthy. Trustworthiness is about how an agent behaves when things go wrong, which a demo never shows. This lesson is the agent failing on its own (no attacker); security (the agent under attack) is the next lesson.
Trustworthy vs secure
Section titled “Trustworthy vs secure”| Trustworthy (this lesson) | Secure (next lesson) | |
|---|---|---|
| Threat | The agent fails on its own | An attacker bends the agent |
| Example | Hallucinated tool call, runaway loop | Prompt injection, tool abuse |
| Defense | Validation, budgets, reflection, human-in-loop | Adversarial defenses (L11) |
A reflection step does not stop a malicious input. Different problems, different defenses.
The six own-failure modes and their guardrails
Section titled “The six own-failure modes and their guardrails”| Failure mode | What it looks like | Guardrail |
|---|---|---|
| Hallucinated tool call | Calls a tool that does not exist, or invents arguments | Validate the call against real tools + shapes; good tool descriptions (L4) |
| Runaway loop | Retries/replans without progress; never finishes | Cap steps, retries, time, cost; fail cleanly at the cap |
| Confidently wrong answer | Wrong result, presented as correct, no signal | Reflection step (L9) + output validation; human review for high stakes |
| Mishandled tool failure | Ignores an error, or panics on a recoverable one | Read + act on tool results (L2); never assume success |
| Missing context / silent partial | Acts on incomplete info, reports done | Require + validate inputs; flag gaps instead of papering over |
| Data over-exposure | Right answer, wrong recipient; surfaces fields the task did not need; echoes another user’s data | Scope output to the requester; least-privilege data access; redact unneeded fields; HITL for high-stakes recipients |
The guardrail toolkit
Section titled “The guardrail toolkit”- Validate tool calls (real tool, right argument shape).
- Cap loops, retries, time, cost.
- Validate outputs / use structured-output schemas.
- Add a reflection step (L9) as a self-check.
- Read and handle tool errors (L2), never assume success.
- Require complete inputs; flag gaps.
- Scope outputs to the requester; least-privilege data access; redact unneeded fields.
- Human-in-the-loop for high-stakes actions.
Match the guardrail to the blast radius
Section titled “Match the guardrail to the blast radius”read-only lookup wrong -> cheap, a re-query -> light guardrailshard-to-reverse action -> money / deletion / outbound -> human-in-the-loop checkpointGate by blast radius. Human review on every trivial action is friction nobody tolerates; on no action is negligence. Gate the actions whose mistakes you cannot take back.
The honest limit
Section titled “The honest limit”Guardrails reduce the rate and cost of failures; they do not erase them. A loop budget does not fix a wrong answer produced in three steps; output validation does not catch judgment calls; a human checkpoint is only as good as the human’s attention. Trustworthy = failures rare, bounded, and visible, not failures impossible.
Pitfalls to dodge
Section titled “Pitfalls to dodge”- Assuming a capable agent is trustworthy (the happy path hides the failures).
- Confusing trustworthiness with security (this lesson vs the next).
- Trusting a confident answer because it is confident.
- Gating everything or nothing with human review (gate by blast radius).
- Treating guardrails as guarantees.
Words to use precisely
Section titled “Words to use precisely”- Trustworthiness: how reliably an agent behaves when it fails on its own (no attacker).
- Guardrail: a control that contains a failure mode (validation, budget, reflection, human checkpoint).
- Blast radius: how hard an action is to reverse; sets how strong a guardrail it needs.
- Human-in-the-loop: a person approves a high-stakes action before the agent takes it.