References: monitoring and robustness
Primary source
Section titled “Primary source”Dan Hendrycks. Introduction to AI Safety, Ethics, and Society. Taylor & Francis, 2024. Center for AI Safety, free to read at aisafetybook.com. L3 draws from Chapter 3 (Single Agent Safety), specifically Sections 3.2 (Monitoring) and 3.3 (Robustness).
| Chapter section | Topic | URL |
|---|---|---|
| Ch 3.2 | Monitoring | aisafetybook.com/textbook/monitoring |
| Ch 3.3 | Robustness | aisafetybook.com/textbook/robustness |
Verbatim quotes used in the lesson
Section titled “Verbatim quotes used in the lesson”A1 discipline preserved on every quote; verbatim from the cited section, no paraphrasing inside quote marks.
- §3.2 Monitoring, core framing: “Current AI systems lack transparency and can exhibit surprising emergent capabilities. Research is needed to ensure we can understand models’ internal representations, monitor anomalies, and evaluate hazardous capabilities.”
- §3.2 on mechanistic interpretability: “aims to identify and combine low-level components of the model in order to understand its behaviour.”
- §3.2 on representation engineering: “starts from identifying how models represent concepts…and uses this to analyse and control them.”
- §3.2 on confabulation: “confabulate by giving reasons not faithful to the model’s real internal processes.”
The Hanoi rat-tail Goodhart’s-law story used in the robustness section is referenced by Hendrycks in the chapter’s framing of proxy failures; the lesson attributes it inline without quote marks because the lesson’s prose retells the story rather than quoting verbatim.
Posture and license
Section titled “Posture and license”Same posture as L1 and L2: the CAIS textbook is © 2026 Center for AI Safety, published by Taylor & Francis, free to read online with no explicit Creative Commons or reuse license. This lesson is a structural mirror with verbatim quotes anchored to specific chapter sections within fair-use limits, link-out only, no embed, no derivative runs.
Suggested companion reading
Section titled “Suggested companion reading”These are not required for L3; they extend each side of the robustness/monitoring distinction.
- Adversarial robustness: Ian Goodfellow, Jonathon Shlens, and Christian Szegedy, “Explaining and Harnessing Adversarial Examples” (ICLR 2015). The seminal paper introducing adversarial training; the “school bus → ostrich” example traces here. Available at arxiv.org/abs/1412.6572.
- Distribution-shift robustness: John Miller et al., “The Effect of Natural Distribution Shift on Question Answering Models” (ICML 2020) for a measured study of accuracy degradation under realistic shift. The medical-imaging case study most often cited is Christian Leibig et al., “Combining the strengths of UNets and ResNets for diabetic retinopathy detection” though many follow-ups document the cross-hospital-machine effect.
- Prompt injection: Simon Willison’s working catalog of prompt-injection incidents and patterns at simonwillison.net/tags/prompt-injection/ is the most updated public reference; the original write-up coined the term in late 2022.
- Interpretability: Anthropic’s mechanistic interpretability research at transformer-circuits.pub for the mechanistic-interpretability lineage Hendrycks references; the representation-engineering lineage traces to work by Andy Zou and colleagues, “Representation Engineering: A Top-Down Approach to AI Transparency” (arXiv 2023).
- Capability evaluation and sandbagging: the METR (Model Evaluation and Threat Research) public methodology for capability evaluations gives the most concrete picture of how the evaluation pipeline is built and where its limits lie. Sandbagging as a research topic is covered in Apollo Research’s published reports on deceptive alignment evaluations.
- Goodhart’s law in AI: the canonical formulation is Charles Goodhart’s 1975 paper on monetary policy; the AI-deployment generalization is most often cited via Manheim and Garrabrant, “Categorizing Variants of Goodhart’s Law” (2018), available at arxiv.org/abs/1803.04585.
What L4 builds on from here
Section titled “What L4 builds on from here”L4 enters Hendrycks Ch 3.4 (Alignment) and works the failure modes that sit underneath both halves of L3: specification gaming, proxy gaming, and deceptive alignment. Goodhart’s-law reasoning from L3 is the direct on-ramp; the confabulation problem and the sandbagging illustration both foreshadow alignment-specific failure modes. By L4 the reader should be able to recognize that a robustness-and-monitoring story can be complete while the underlying alignment problem remains untouched.