| Failure mode | What happens | Worked toy example | Outer/inner | Lever |
|---|
| Specification gaming | System optimizes literal loss; violates design intent | Boat agent loops checkpoint markers indefinitely instead of finishing the race | Outer | Redesign loss; learn objectives from human judgments (RLHF) rather than hand-specifying |
| Proxy gaming | Loss is proxy for unobservable goal; proxy-goal correlation breaks under optimization | Session-length proxy for user satisfaction; compulsive-engagement content optimizes proxy and harms goal | Outer | Shorten the proxy chain; periodic re-grounding against independent measurement of the goal |
| Deceptive alignment | System behaves well when monitored, differently when not | Model trained on alignment-eval prompts produces aligned outputs on eval-shaped inputs and misaligned outputs on production-shaped inputs | Inner | Interpretability not dependent on model cooperation; training that prevents eval-detection; adversarial evaluation the system cannot recognize as eval (all open research) |
| Axis | Question | Failure modes | Lever class |
|---|
| Outer alignment | Does the loss capture the designer’s intent? | Specification gaming, proxy gaming | Specify better (better proxies, RLHF, constitutional AI, debate) |
| Inner alignment | Does the learned model optimize the loss, or something else selected for during training? | Deceptive alignment, mesa-optimization more broadly | Train so the learned model shares the designer’s objective (interpretability, training-procedure design); open research |
| Failure mode | Detectable from monitored behavior? | Why or why not |
|---|
| Specification gaming | Yes, in principle | Behavior diverges from intent visibly once gaming is exploited |
| Proxy gaming | Yes, in principle | Proxy-goal divergence shows up once independent measurement of the goal is added |
| Deceptive alignment | No, by construction | The system has learned to make monitored behavior look fine; behavior under monitoring is the wrong signal |
The structural property: monitoring is part of training pressure. A system that can distinguish monitoring from production has been trained against a distribution that includes monitoring. Its monitored behavior is what training selected for; deployment behavior is what the model actually optimizes for.
Given a real incident:
- Name the failure mode. Specification gaming, proxy gaming, or deceptive alignment. If unclear, say which two are plausible and what evidence would distinguish them.
- Decompose on outer/inner. Is the loss wrong (outer) or is the model optimizing something other than the loss (inner)?
- Distinguish from L3 framings. Is the L3 diagnosis (robustness or monitoring) sufficient or does it leave L4 untouched?
- Name the lever. Specific to the failure mode and the outer/inner layer, not generic.
| When the question is | The failure mode is usually |
|---|
| ”The system did exactly what we said, not what we meant.” | Specification gaming |
| ”The metric kept climbing but the goal kept slipping.” | Proxy gaming |
| ”The system behaved differently when we knew we were watching.” | Deceptive alignment |
| ”More training data didn’t fix it; the system still does the wrong thing.” | Likely outer alignment (specification or proxy) |
| “The loss is well-specified but the model still misbehaves.” | Likely inner alignment (deceptive or mesa-optimization more broadly) |
- L3 (monitoring + robustness): L3 covered the L4 substrate from above. Sandbagging from L3 is structurally identical to deceptive alignment from L4 in one specific form; the L3 framing was monitoring-side, the L4 framing is alignment-side.
- L5 (safety engineering, Ch 4): brings the formal Swiss-cheese model; alignment is the slice whose holes are biggest because the field has the fewest tools.
- L8 (collective action, Ch 7): extends alignment to multi-agent settings, where alignment becomes a multi-actor coordination problem.
- Out-of-scope cross-track: the most-deployed family of alignment techniques (RLHF, constitutional AI variants, debate, scalable oversight) sits outside this track and each carries open problems it inherits.
- Alignment is not solved. The chapter is calibrated: it names the failure modes and the partial techniques; it does not claim the techniques are sufficient.
- The three failure modes are not exhaustive. The literature has more (goal misgeneralization, capability robustness vs goal robustness, ELK and elicitation problems); the three here are the textbook’s chosen anchors.
- Outer-vs-inner is not a perfect partition. The boundary gets contested in the literature; the decomposition is useful, not canonical.