References: monitoring and robustness

Primary source

Dan Hendrycks. Introduction to AI Safety, Ethics, and Society. Taylor & Francis, 2024. Center for AI Safety, free to read at aisafetybook.com. L3 draws from Chapter 3 (Single Agent Safety), specifically Sections 3.2 (Monitoring) and 3.3 (Robustness).

Chapter section	Topic	URL
Ch 3.2	Monitoring	aisafetybook.com/textbook/monitoring
Ch 3.3	Robustness	aisafetybook.com/textbook/robustness

Verbatim quotes used in the lesson

A1 discipline preserved on every quote; verbatim from the cited section, no paraphrasing inside quote marks.

§3.2 Monitoring, core framing: “Current AI systems lack transparency and can exhibit surprising emergent capabilities. Research is needed to ensure we can understand models’ internal representations, monitor anomalies, and evaluate hazardous capabilities.”
§3.2 on mechanistic interpretability: “aims to identify and combine low-level components of the model in order to understand its behaviour.”
§3.2 on representation engineering: “starts from identifying how models represent concepts…and uses this to analyse and control them.”
§3.2 on confabulation: “confabulate by giving reasons not faithful to the model’s real internal processes.”

The Hanoi rat-tail Goodhart’s-law story used in the robustness section is referenced by Hendrycks in the chapter’s framing of proxy failures; the lesson attributes it inline without quote marks because the lesson’s prose retells the story rather than quoting verbatim.

Posture and license

Same posture as L1 and L2: the CAIS textbook is © 2026 Center for AI Safety, published by Taylor & Francis, free to read online with no explicit Creative Commons or reuse license. This lesson is a structural mirror with verbatim quotes anchored to specific chapter sections within fair-use limits, link-out only, no embed, no derivative runs.

What L4 builds on from here

L4 enters Hendrycks Ch 3.4 (Alignment) and works the failure modes that sit underneath both halves of L3: specification gaming, proxy gaming, and deceptive alignment. Goodhart’s-law reasoning from L3 is the direct on-ramp; the confabulation problem and the sandbagging illustration both foreshadow alignment-specific failure modes. By L4 the reader should be able to recognize that a robustness-and-monitoring story can be complete while the underlying alignment problem remains untouched.

References: monitoring and robustness

Primary source

Verbatim quotes used in the lesson

Posture and license

Suggested companion reading

What L4 builds on from here