Skip to content

References: Building trustworthy agents

Source curriculum (structural mirror, cited as further study):
• Microsoft, "Building Trustworthy AI Agents" (AI Agents for Beginners, Lesson 06)
Author: Microsoft Cloud Advocates
Lesson page: https://github.com/microsoft/ai-agents-for-beginners/tree/main/06-building-trustworthy-agents
License: MIT
Clawdemy's lessons are original prose that follows the pedagogical arc of this
source. We do not reproduce or transcribe it; we cite it as the recommended
companion. All rights to the original materials remain with their authors.
Source-scope note: Microsoft Lesson 06 mixes trustworthiness (the agent's own
reliability) with security (adversarial threats), and its content leans toward
the adversarial side (threat understanding, input filtering, access control).
This lesson deliberately holds a clean boundary: it covers the agent's OWN
failure modes here, and reserves the adversarial material (prompt injection,
tool abuse, exfiltration) for the next lesson on securing agents. The six-mode
failure taxonomy here is Clawdemy framing, supplied to keep that boundary clean
and to deliver this lesson's stated capability. The parts of Lesson 06 that fit
trustworthiness directly (human-in-the-loop, conversation/turn limits, input
validation) are reflected here as guardrails.
  • Building Trustworthy AI Agents (Microsoft) by Microsoft Cloud Advocates. The practitioner version, strong on human-in-the-loop design and the threat-aware side of trust, with runnable samples. MIT-licensed. Read it alongside this lesson’s failure-mode taxonomy for both halves of the picture, and note that much of its threat content maps to our next lesson on securing agents.

A short, durable list.

  • A Practical Guide to Building Agents (OpenAI). A practitioner guide that covers guardrails, human oversight, and safe action-taking for agents. Good on the same blast-radius reasoning this lesson uses for high-stakes actions.
  • Building Effective Agents (Anthropic). Patterns for reliable agent design, including when to keep a human in the loop and how to bound agent behavior. Practical and provider-grounded.

Where this leads inside this track.

  • Securing agents. The next lesson. The other half of “agents you can trust and ship”: defending against an attacker who feeds the agent malicious input to hijack it, abuse its tools, or extract data. Different threat, different defenses.
  • The tool-use design pattern in depth. Earlier in the track. Good tool descriptions are a trustworthiness guardrail: clear definitions reduce hallucinated and misdirected tool calls at the source.
  • Agents that self-check: metacognition. The previous lesson. A reflection step is one of the guardrails against confidently wrong answers, so self-checking and trustworthiness are tightly linked.