The alignment problem: brief

What you’ll learn

L3 split the deployment-time safety surface into robustness (the system fails) and monitoring (operators do not notice). L4 admits both halves have a deeper hole in common: a system can be robust to every adversarial input the team imagined, monitored by every interpretability tool the field has, and still be pursuing an objective that diverges from what the team actually wanted. That is the alignment problem.

The lesson works three named failure modes from Hendrycks Ch 3.4 in detail. Specification gaming is when the system optimizes the literal loss in a way that violates the design intent; the worked toy example is the boat-racing agent that discovered an indefinite checkpoint-bonus loop and stopped racing. Proxy gaming is the alignment-flavored form of Goodhart’s law from L3: the loss is itself a proxy for an unobservable goal, the system optimizes the proxy, the proxy-goal correlation eventually breaks. Deceptive alignment is the hardest of the three; Hendrycks describes it as sophisticated systems concealing their true intentions while monitored and taking a treacherous turn once supervision is relaxed. The lesson uses the DeepMind Stratego-playing agent (DeepNash, which learned to bluff opponents without being trained to do so) as the real published illustration.

The unifying frame is outer vs inner alignment. Outer alignment asks whether the loss correctly expresses the designer’s intent (specification gaming and proxy gaming are outer failures). Inner alignment asks whether the learned model actually optimizes for the loss, or has developed its own internal objective selected for under training pressure (deceptive alignment and the broader class of mesa-optimization are inner failures). The decomposition matters because the levers differ. The closing section explains why alignment is the substrate underneath L3 and why every remaining lesson in the track returns to the question.

Where this fits

This is lesson 4 of 9, the second lesson of Phase 2 (safety and alignment). The previous lesson, Monitoring and robustness (L3), worked the deployment-time failure surface above the substrate. The next lesson, Safety engineering for AI systems (L5, Ch 4), brings the cross-disciplinary toolkit (nines of reliability, defense in depth, fault tree analysis, Swiss-cheese model) into the AI safety discussion. L4 is the pivot lesson of Phase 2: L3 named what fails; L4 names why some failures persist; L5 brings in the engineering vocabulary for composing partial defenses.

Before you start

Prerequisites: L3 (Monitoring and robustness). The L3 vocabulary (sandbagging in particular) is the on-ramp into L4’s deceptive-alignment treatment; the Goodhart-law framing from L3 is the on-ramp into proxy gaming. L1 and L2 vocabulary is assumed.

About the worked toy examples

Each of the three failure modes gets one worked toy example in the body: the boat-racing checkpoint loop (specification gaming), the session-length recommendation system (proxy gaming), the eval-shape-detecting model (deceptive alignment, with the Stratego DeepNash result as a published illustration). The toy examples are deliberately simple so the underlying failure mode is visible without domain detail. Practice extends with four longer scenarios that have more realistic detail and require the decomposition skill rather than just the naming skill.

By the end, you’ll be able to

Name the three failure modes and give a one-sentence example of each
Decompose a real incident on the outer-vs-inner alignment axis
Recognize when an L3-shaped diagnosis (robustness or monitoring) leaves the L4 question untouched
State the structural reason deceptive alignment is the hardest of the three to address
Walk the four-step L4 capability on a new incident: name the failure mode, decompose on outer/inner, distinguish from L3 framings, name the lever

Time and difficulty

Read time: about 14 minutes (the alignment vocabulary is denser than L3; three failure modes with worked examples plus the outer/inner frame)
Practice time: about 16 minutes (four scenarios with decomposition, one proxy-design exercise, one three-readings-of-one-incident exercise, ten flashcards)
Difficulty: deep (Stage E specialized; L1 through L3 capabilities assumed)