| Sub-mechanism | What happens | Example | Lever |
|---|
| Adversarial robustness | Small input perturbations flip behavior | Image-classifier flips label under imperceptible noise | Adversarial training; expensive and partial |
| Distribution-shift | Model trained on A fails on B | Hospital-A X-ray model degrades on Hospital-B machines | Distribution-shift testing pre-deploy; continuous drift monitoring post-deploy |
| Prompt injection (LLM) | Input contains hijacking instructions | Doc-page comment instructs an agent to exfiltrate secrets | Trust-boundary enforcement (system trusted, retrieved content not); output filtering |
| Trojan attacks (training-time) | Poisoned training data plants triggers | Sticker pattern flips classification of one species | Training-data integrity, post-hoc trojan detection |
| Goodhart / proxy gaming | Metric becomes target, stops measuring goal | Rat-tail bounty incentivizes rat farming | Multi-metric monitoring; periodic proxy/goal divergence audits |
| Sub-mechanism | Strength | Weakness | Lever |
|---|
| Mechanistic interpretability | Faithful when it succeeds | Rare, per-model, does not yet scale | Continued research investment; small-model debugging |
| Representation engineering | Scales to frontier models | Indirect (learns what is represented, not how computed) | Concept-vector probes as control surfaces |
| Anomaly detection | Scales, aggregate-level | Defines anomaly against past behavior; lags novel failures | Multi-baseline monitoring; periodic re-baselining |
| Capability evaluation | Surfaces latent capabilities | Limited by evaluator imagination; sandbagging | Pair benchmarks with red-team campaigns; format-diverse evals |
| Confabulation guard | (None, this is a failure mode) | Model explanations may not be faithful to internal process | Never use self-explanation as ground truth for accountability |
Given an incident report:
- Identify the harm. What specifically went wrong?
- Name the primary half. Robustness (system broke), monitoring (no one noticed), or both (with which dominated cost)?
- Name the sub-mechanism. Pick from the bucket-specific column above.
- Name a lever. From the bucket’s lever column. Be specific: “more monitoring” is incomplete; “continuous drift monitoring with a 72-hour escalation threshold” is complete.
Any metric chosen as a measure of a goal will, when made a target, eventually be optimized against in a way that diverges from the goal. The robustness failure is that the system stays robust with respect to the proxy and produces outcomes the designer did not want.
| Layer | Holes (failure mode it does NOT catch) |
|---|
| Robustness | Failure modes that emerge slowly under conditions seen in training |
| Monitoring | Failures that the observation infrastructure does not signal or that look normal at aggregate |
Robustness without monitoring: failures eventually happen and go undetected for as long as the monitoring layer is missing. Monitoring without robustness: failures happen often, the monitoring catches them, but the underlying system stays brittle. The layers compose because their holes do not line up. L5 works the formal version (defense in depth, fault tree analysis).
| When the question is | The sub-mechanism is usually |
|---|
| ”The system worked fine in testing but breaks on this specific input class.” | Adversarial robustness or trojan |
| ”The system worked in lab but underperforms in production.” | Distribution shift |
| ”Third-party content in the input is steering behavior.” | Prompt injection |
| ”The metric is climbing but the goal is not being served.” | Goodhart / proxy gaming |
| ”We cannot see what the model is doing internally.” | Interpretability (mechanistic or representation engineering) |
| “The explanation looks reasonable but does not match the decision driver.” | Confabulation |
| ”We are not noticing the failure pattern in our aggregate metrics.” | Anomaly detection lag |
| ”The model scores low on the benchmark but the capability shows up in production.” | Capability evaluation gap (possibly sandbagging) |
- L4 (alignment, Ch 3.4): the substrate underneath both halves. Even perfect robustness and perfect monitoring do not catch a system pursuing the wrong objective. Specification gaming, proxy gaming, deceptive alignment all live here.
- L5 (safety engineering, Ch 4): the formal version of the Swiss-cheese intuition (defense in depth, fault tree analysis, FMEA, nines of reliability).
- L6 (complex systems, Ch 5): the framing for why correct components compose into incorrect systems; relevant when L3’s seams between robustness and monitoring become the failure mode.