Trustworthy agents: cheatsheet

The one idea

Capable is not trustworthy. Trustworthiness is about how an agent behaves when things go wrong, which a demo never shows. This lesson is the agent failing on its own (no attacker); security (the agent under attack) is the next lesson.

Trustworthy vs secure

	Trustworthy (this lesson)	Secure (next lesson)
Threat	The agent fails on its own	An attacker bends the agent
Example	Hallucinated tool call, runaway loop	Prompt injection, tool abuse
Defense	Validation, budgets, reflection, human-in-loop	Adversarial defenses (L11)

A reflection step does not stop a malicious input. Different problems, different defenses.

The six own-failure modes and their guardrails

Failure mode	What it looks like	Guardrail
Hallucinated tool call	Calls a tool that does not exist, or invents arguments	Validate the call against real tools + shapes; good tool descriptions (L4)
Runaway loop	Retries/replans without progress; never finishes	Cap steps, retries, time, cost; fail cleanly at the cap
Confidently wrong answer	Wrong result, presented as correct, no signal	Reflection step (L9) + output validation; human review for high stakes
Mishandled tool failure	Ignores an error, or panics on a recoverable one	Read + act on tool results (L2); never assume success
Missing context / silent partial	Acts on incomplete info, reports done	Require + validate inputs; flag gaps instead of papering over
Data over-exposure	Right answer, wrong recipient; surfaces fields the task did not need; echoes another user’s data	Scope output to the requester; least-privilege data access; redact unneeded fields; HITL for high-stakes recipients

The guardrail toolkit

Validate tool calls (real tool, right argument shape).
Cap loops, retries, time, cost.
Validate outputs / use structured-output schemas.
Add a reflection step (L9) as a self-check.
Read and handle tool errors (L2), never assume success.
Require complete inputs; flag gaps.
Scope outputs to the requester; least-privilege data access; redact unneeded fields.
Human-in-the-loop for high-stakes actions.

Match the guardrail to the blast radius

read-only lookup wrong   -> cheap, a re-query           -> light guardrails
hard-to-reverse action   -> money / deletion / outbound -> human-in-the-loop checkpoint

Gate by blast radius. Human review on every trivial action is friction nobody tolerates; on no action is negligence. Gate the actions whose mistakes you cannot take back.

The honest limit

Guardrails reduce the rate and cost of failures; they do not erase them. A loop budget does not fix a wrong answer produced in three steps; output validation does not catch judgment calls; a human checkpoint is only as good as the human’s attention. Trustworthy = failures rare, bounded, and visible, not failures impossible.

Pitfalls to dodge

Assuming a capable agent is trustworthy (the happy path hides the failures).
Confusing trustworthiness with security (this lesson vs the next).
Trusting a confident answer because it is confident.
Gating everything or nothing with human review (gate by blast radius).
Treating guardrails as guarantees.

Words to use precisely

Trustworthiness: how reliably an agent behaves when it fails on its own (no attacker).
Guardrail: a control that contains a failure mode (validation, budget, reflection, human checkpoint).
Blast radius: how hard an action is to reverse; sets how strong a guardrail it needs.
Human-in-the-loop: a person approves a high-stakes action before the agent takes it.