Skip to content

Securing agents: defending against an attacker

This is lesson 11 of Track 20 (AI Agents and Tool Use) and the closer of Phase 3, Building agents you can trust and ship. The previous lesson named the ways an agent fails on its own and the guardrail that contains each. This lesson takes up the other half of the boundary it drew: an agent under attack. The threat model is different (a person is trying to make the agent do something it was not built to do), the defenses are different, and the lesson stays honest that defenses raise the cost of attack without eliminating it.

You will learn the structural fact that makes prompt injection a class of attack with no general solution, namely that text and data share one channel into the model (an attacker who can put text into anything the model reads can put instructions into the model). From that one fact you will see the three attack categories that follow (hijacking the agent’s goal, abusing the agent’s tools, exfiltrating data through the agent), the defense for each, and the sharper indirect variant where the attacker plants their text in a document the agent will later retrieve. The lesson then assembles the defense-in-depth toolkit (capability gating, input handling, output validation and routing, sandboxing, human-in-the-loop on high-stakes actions, and tamper-evident audit logs), and closes with the security-as-architecture principle and the Agents Rule of Two (never give one run both untrusted input and the ability to take high-stakes actions).

The track as a whole structurally mirrors Microsoft’s “AI Agents for Beginners” (MIT-licensed), with the Berkeley CS294 LLM Agents course as a depth reference. For this lesson specifically, the threat-model framework is anchored to OWASP’s Top 10 for LLM Applications 2025 (v2.0, released 2024-11-18), with Greshake et al. (2023) as the foundational source for indirect prompt injection and Simon Willison’s prompt-injection writing as the public-discourse anchor; Microsoft’s Lesson 18 contributes one specific defense pattern (cryptographic-receipt audit logs) but covers a narrower topic than this lesson does. Full attribution is in this lesson’s references.

This lesson closes the production-agents phase by taking the trust-and-ship question from “does the agent fail safely on its own?” (the previous lesson) to “can the agent be made to fail in ways that benefit an attacker?” Both questions matter to a team deciding whether to put an agent in front of real users; the lesson holds them as two distinct halves of the same shipping bar. It pulls forward the structural fact about text-only model input from Lesson 2 (tool calls were just text in an agreed shape; instructions in untrusted documents are too), the tool-definition discipline from Lesson 4 (vague tools widen the abuse surface), the agentic-retrieval surface from Lesson 6 (anything retrieved is, to the model, instructions), and the blast-radius principle from Lesson 10 (now applied with attackers as the new reason for human-in-the-loop gating). This is the closing lesson of the track.

Prerequisites: the earlier lessons in the track, especially Building trustworthy agents (the immediately prior lesson; this one is the security half of the trust-vs-security pair it sets up) and The tool-use design pattern in depth (the tool-abuse attack category turns directly on what tools an agent has and how their permissions are scoped). You do not need to code. If you understand an agent as a model in a loop with tools, you have the background; this lesson is about the threats that loop is exposed to once it is deployed, and the defenses that contain them.

  • Distinguish security (the agent under attack) from trustworthiness (the agent failing on its own)
  • Explain the structural fact that makes prompt injection a class of attack without a general solution (text and data share one channel into the model)
  • Name the three principal attack categories against an agent (hijacking the goal, abusing the tools, exfiltrating data) and the defense for each
  • Assemble a defense-in-depth toolkit (capability gating, input handling, output validation and routing, sandboxing, human-in-the-loop, tamper-evident audit logs) appropriate to a deployed agent
  • Apply the Agents Rule of Two and the security-as-architecture principle when designing an agent for deployment
  • Read time: about 11 minutes
  • Practice time: about 15 minutes (a self-check, two applied exercises on classifying the attack and designing a defense, and flashcards)
  • Difficulty: standard