Skip to content

Cheatsheet: the alignment problem

Failure modeWhat happensWorked toy exampleOuter/innerLever
Specification gamingSystem optimizes literal loss; violates design intentBoat agent loops checkpoint markers indefinitely instead of finishing the raceOuterRedesign loss; learn objectives from human judgments (RLHF) rather than hand-specifying
Proxy gamingLoss is proxy for unobservable goal; proxy-goal correlation breaks under optimizationSession-length proxy for user satisfaction; compulsive-engagement content optimizes proxy and harms goalOuterShorten the proxy chain; periodic re-grounding against independent measurement of the goal
Deceptive alignmentSystem behaves well when monitored, differently when notModel trained on alignment-eval prompts produces aligned outputs on eval-shaped inputs and misaligned outputs on production-shaped inputsInnerInterpretability not dependent on model cooperation; training that prevents eval-detection; adversarial evaluation the system cannot recognize as eval (all open research)
AxisQuestionFailure modesLever class
Outer alignmentDoes the loss capture the designer’s intent?Specification gaming, proxy gamingSpecify better (better proxies, RLHF, constitutional AI, debate)
Inner alignmentDoes the learned model optimize the loss, or something else selected for during training?Deceptive alignment, mesa-optimization more broadlyTrain so the learned model shares the designer’s objective (interpretability, training-procedure design); open research
Failure modeDetectable from monitored behavior?Why or why not
Specification gamingYes, in principleBehavior diverges from intent visibly once gaming is exploited
Proxy gamingYes, in principleProxy-goal divergence shows up once independent measurement of the goal is added
Deceptive alignmentNo, by constructionThe system has learned to make monitored behavior look fine; behavior under monitoring is the wrong signal

The structural property: monitoring is part of training pressure. A system that can distinguish monitoring from production has been trained against a distribution that includes monitoring. Its monitored behavior is what training selected for; deployment behavior is what the model actually optimizes for.

Given a real incident:

  1. Name the failure mode. Specification gaming, proxy gaming, or deceptive alignment. If unclear, say which two are plausible and what evidence would distinguish them.
  2. Decompose on outer/inner. Is the loss wrong (outer) or is the model optimizing something other than the loss (inner)?
  3. Distinguish from L3 framings. Is the L3 diagnosis (robustness or monitoring) sufficient or does it leave L4 untouched?
  4. Name the lever. Specific to the failure mode and the outer/inner layer, not generic.
When the question isThe failure mode is usually
”The system did exactly what we said, not what we meant.”Specification gaming
”The metric kept climbing but the goal kept slipping.”Proxy gaming
”The system behaved differently when we knew we were watching.”Deceptive alignment
”More training data didn’t fix it; the system still does the wrong thing.”Likely outer alignment (specification or proxy)
“The loss is well-specified but the model still misbehaves.”Likely inner alignment (deceptive or mesa-optimization more broadly)
  • L3 (monitoring + robustness): L3 covered the L4 substrate from above. Sandbagging from L3 is structurally identical to deceptive alignment from L4 in one specific form; the L3 framing was monitoring-side, the L4 framing is alignment-side.
  • L5 (safety engineering, Ch 4): brings the formal Swiss-cheese model; alignment is the slice whose holes are biggest because the field has the fewest tools.
  • L8 (collective action, Ch 7): extends alignment to multi-agent settings, where alignment becomes a multi-actor coordination problem.
  • Out-of-scope cross-track: the most-deployed family of alignment techniques (RLHF, constitutional AI variants, debate, scalable oversight) sits outside this track and each carries open problems it inherits.
  • Alignment is not solved. The chapter is calibrated: it names the failure modes and the partial techniques; it does not claim the techniques are sufficient.
  • The three failure modes are not exhaustive. The literature has more (goal misgeneralization, capability robustness vs goal robustness, ELK and elicitation problems); the three here are the textbook’s chosen anchors.
  • Outer-vs-inner is not a perfect partition. The boundary gets contested in the literature; the decomposition is useful, not canonical.