Monitoring and robustness: two halves of the deployment-time safety problem
What you’ll learn
Section titled “What you’ll learn”Phase 2 of Track 23 opens with the deployment-time safety surface, and Hendrycks Chapter 3 splits that surface into two halves. Robustness covers the system-side: the model itself producing wrong or harmful behavior under conditions its training did not anticipate. Monitoring covers the observation-side: the system behaving badly and operators not noticing in time to do anything about it. The two halves overlap less than common usage suggests, and they need different fixes.
The lesson works each half in detail. On the robustness side: adversarial perturbations, distribution shift, prompt injection and trojan attacks, and Goodhart’s-law proxy gaming. Each sub-mechanism comes with a concrete worked illustration, the structural property that makes it hard to catch with conventional testing, and the intervention levers the field has developed. On the monitoring side: interpretability in its two main lineages (mechanistic and representation-engineering), anomaly detection, capability evaluation. The lesson also flags two specific monitoring failure modes worth holding in working memory: confabulation (model self-explanations can be plausible-sounding text that does not match internal process) and sandbagging (models behaving differently when they know they are being evaluated).
The closing section explains why both halves are needed. The Swiss-cheese intuition from safety engineering, formalized in L5, is the right picture: each layer has holes, and the layers compose into a useful safety property because the holes do not line up. A robust system without monitoring drifts undetected; a monitored system without robustness fails loudly but no faster. The L3 capability is the four-step classify-and-defend on real incident reports.
Where this fits
Section titled “Where this fits”This is lesson 3 of 9 and the first lesson of Phase 2 (safety and alignment). The previous lesson, The four catastrophic risk categories, closed Phase 1 by working Hendrycks’ four-bucket typology in detail. The next lesson, The alignment problem (L4, Ch 3.4), goes one layer deeper: even a perfectly robust and perfectly monitored system can be pursuing the wrong objective. Phase 2 will run L3 through L6, working single-agent safety, alignment, safety engineering, and complex systems in order, before Phase 3 (ethics and governance) opens at L7.
Before you start
Section titled “Before you start”Prerequisites: L1 (AI safety as a field) and L2 (The four catastrophic risk categories). The L2 vocabulary (rogue-AI sub-mechanisms, organizational-risk patterns) is the on-ramp into L3’s failure-mode vocabulary; the L1 descriptive-not-prescriptive register continues to anchor how the chapter is cited.
About the worked illustrations
Section titled “About the worked illustrations”L3 introduces two short worked illustrations the body relies on: a prompt-injection scenario for a deployed coding-assistant agent, and the sandbagging phenomenon in capability evaluation. Both are written generically (no specific vendor or product named) and are meant to make the abstract sub-mechanism concrete enough to recognize in a real incident report. Practice extends with five composite incident reports.
By the end, you’ll be able to
Section titled “By the end, you’ll be able to”- Distinguish a robustness failure from a monitoring failure on a real incident, naming which dominated cost when both are present
- Name three sub-mechanisms inside each half (six total) and give one example per
- Apply Goodhart’s-law reasoning to predict where a deployed system’s metric will diverge from its goal
- Recognize the confabulation problem and the sandbagging problem as named monitoring failure modes
- Walk the four-step classify-and-defend protocol on a new incident: identify harm, name primary half, name sub-mechanism, name lever
Time and difficulty
Section titled “Time and difficulty”- Read time: about 13 minutes (two worked illustrations make the body slightly longer than L1; the vocabulary load is similar to L2)
- Practice time: about 15 minutes (five composite incidents, one Goodhart’s-law metric-design exercise, ten flashcards)
- Difficulty: deep (Stage E specialized; L1 + L2 capabilities assumed)