References: the alignment problem

Primary source

Dan Hendrycks. Introduction to AI Safety, Ethics, and Society. Taylor & Francis, 2024. Center for AI Safety, free to read at aisafetybook.com. L4 draws from Chapter 3 Section 3.4 (Alignment).

Chapter section	Topic	URL
Ch 3.4	Alignment	aisafetybook.com/textbook/alignment

Verbatim quotes used in the lesson

A1 discipline preserved: verbatim from the cited section, no paraphrasing inside quote marks.

§3.4 chapter framing: “We need to develop better techniques to control AI systems and make them less hazardous. If we fail to do this, we face a number of risks from AI systems including deceptive or power-seeking tendencies.”
§3.4 on deceptive alignment: “Sophisticated systems could conceal their true intentions while being monitored, only taking a treacherous turn to pursue them once supervision is relaxed.”
§3.4 on the Stratego DeepNash example: “learned to bluff opponents, despite not being explicitly trained to do so.”

Posture and license

Same posture as L1, L2, L3: the CAIS textbook is © 2026 Center for AI Safety, published by Taylor & Francis, free to read online with no explicit Creative Commons or reuse license. This lesson is a structural mirror with verbatim quotes anchored to the chapter section within fair-use limits, link-out only, no embed, no derivative runs.

Suggested companion reading

Not required for L4; they extend each failure mode if a reader wants to go deeper before L5.

Specification gaming and the boat-racing example: the original 2016 OpenAI write-up of the Coast Runners specification-gaming incident is publicly available; DeepMind Safety maintains a curated Specification gaming examples in AI list at deepmindsafetyresearch.medium.com/specification-gaming-the-flip-side-of-ai-ingenuity-c85bdb0deeb4. The list has grown across years and is the field’s go-to catalog.
Goodhart’s law in AI / proxy gaming: Manheim and Garrabrant, “Categorizing Variants of Goodhart’s Law” (arXiv 2018), at arxiv.org/abs/1803.04585, is the cleanest taxonomy. Cited from L3; relevant again here.
Deceptive alignment and mesa-optimization: Hubinger et al., “Risks from Learned Optimization in Advanced Machine Learning Systems” (arXiv 2019), at arxiv.org/abs/1906.01820, is the foundational paper that names mesa-optimization and deceptive alignment as distinct concerns. Long and technical; the introduction alone gives the framing.
The Stratego DeepNash result: Perolat et al., “Mastering the Game of Stratego with Model-Free Multiagent Reinforcement Learning” (Science 2022), at science.org/doi/10.1126/science.add4679. The paper describes the agent and its emergent bluffing behavior; useful for seeing the actual research framing rather than the safety-takeaway summary.
RLHF and constitutional AI as alignment techniques: Bai et al., “Constitutional AI: Harmlessness from AI Feedback” (Anthropic, 2022), at arxiv.org/abs/2212.08073, and Christiano et al., “Deep Reinforcement Learning from Human Preferences” (NeurIPS 2017), at arxiv.org/abs/1706.03741, are the canonical entries on the most-deployed family of alignment techniques. Sits outside this track'''s scope; included here as orientation for readers who want to follow the alignment-techniques branch directly.
Scalable oversight and debate as alignment proposals: Irving, Christiano, and Amodei, “AI safety via debate” (arXiv 2018), at arxiv.org/abs/1805.00899, is a representative entry; the broader scalable-oversight literature has continued from there. Mostly relevant as orientation.
Goal misgeneralization as a related failure mode: Langosco et al., “Goal Misgeneralization in Deep Reinforcement Learning” (ICML 2022), at arxiv.org/abs/2105.14111, describes a failure mode that does not fit cleanly into specification gaming, proxy gaming, or deceptive alignment, but is in the same family. Useful for understanding that the three failure modes in L4 are not exhaustive.

What L5 builds on from here

L5 enters Hendrycks Chapter 4 (Safety Engineering) and brings the cross-disciplinary toolkit (nines of reliability, defense in depth, fault tree analysis, FMEA, Swiss-cheese model, normal-accident theory) into the AI safety discussion. The alignment slice from L4 is one of the slices the Swiss-cheese model has to compose; L5 will be explicit that alignment’s holes are the largest because the field has the fewest tools, which is why the other slices have to do more work. The path from L5 to L6 (complex systems) closes Phase 2.