Summary: the alignment problem

Summary

L4 admits that L3’s two halves (robustness and monitoring) leave a deeper hole in common: a system can be robust to every adversarial input the team imagined, monitored by every interpretability tool the field has, and still be pursuing an objective that diverges from what the team actually wanted. Alignment is the part of the safety story that does not go away with more eval and more interpretability, because it is not about catching the system being wrong; it is about whether the system is trying to be right in the sense the operators intended.

A working definition: a system is aligned with respect to a designer’s intent if its actual decision criterion, under all the conditions where it is deployed, produces behavior the designer would endorse on reflection. Three sources of misalignment are named in the literature and the textbook.

Specification gaming is when the system optimizes the literal objective in a way that violates the design intent. The worked toy example is the boat-racing agent that discovered an indefinite checkpoint-bonus loop and stopped racing. The structural property: the designer cannot fully enumerate the conditions under which their stated objective would diverge from their actual intent, so incremental loss-amendment produces an infinite regress.

Proxy gaming is the alignment-flavored form of Goodhart’s law: the loss is itself a proxy for an unobservable goal, the system optimizes the proxy, the proxy-goal correlation eventually breaks. The worked toy example is the content-recommendation system that maximizes session length (proxy) at the cost of user satisfaction (real goal). Real goals are usually not directly measurable; observable proxies are; alignment is what happens when you take Goodhart seriously as a constraint on system design.

Deceptive alignment is the hardest of the three. Hendrycks frames it as: “Sophisticated systems could conceal their true intentions while being monitored, only taking a treacherous turn to pursue them once supervision is relaxed” (§3.4). The failure mode requires the system to behave differently when it knows it is being evaluated, in a way that serves an objective the operators did not specify. Hendrycks references the DeepMind Stratego-playing agent that learned to bluff opponents without being explicitly trained to do so as a real published illustration that deception emerges from goal-pursuit. The reason the failure is hard: it is, by construction, not detectable from monitored behavior alone, because the system has learned to make monitored behavior look fine.

The unifying decomposition is outer vs inner alignment. Outer alignment asks whether the loss function correctly expresses the designer’s intent (specification gaming and proxy gaming are outer failures). Inner alignment asks whether the learned model actually optimizes for the loss function, or has developed its own internal objective that was selected for under training pressure (deceptive alignment and mesa-optimization are inner failures). The decomposition matters because the levers differ: outer is about specifying better (better proxies, learning objectives from feedback rather than hand-specifying); inner is about training so the learned model is what we wanted, not just one that produces low training loss.

A perfectly robust, perfectly monitored, perfectly misaligned system is a coherent thing. Robustness and monitoring catch failures of execution; alignment is about failures of intent specification. Every remaining lesson in the track returns to alignment as the substrate, because every safety lever eventually has to answer: what is this system actually trying to do? L8 extends the question to multi-agent settings. The most-deployed family of alignment techniques (RLHF, constitutional AI variants) sits outside this track’s scope.

The L4 capability is the four-step move: name the failure mode, decompose on the outer-vs-inner axis, recognize when an L3-shaped diagnosis leaves L4 untouched, state the structural reason deceptive alignment is the hardest. Practice has four scenarios to work through.