Monitoring and robustness: cheatsheet

Robustness sub-mechanisms (Ch 3.3)

Sub-mechanism	What happens	Example	Lever
Adversarial robustness	Small input perturbations flip behavior	Image-classifier flips label under imperceptible noise	Adversarial training; expensive and partial
Distribution-shift	Model trained on A fails on B	Hospital-A X-ray model degrades on Hospital-B machines	Distribution-shift testing pre-deploy; continuous drift monitoring post-deploy
Prompt injection (LLM)	Input contains hijacking instructions	Doc-page comment instructs an agent to exfiltrate secrets	Trust-boundary enforcement (system trusted, retrieved content not); output filtering
Trojan attacks (training-time)	Poisoned training data plants triggers	Sticker pattern flips classification of one species	Training-data integrity, post-hoc trojan detection
Goodhart / proxy gaming	Metric becomes target, stops measuring goal	Rat-tail bounty incentivizes rat farming	Multi-metric monitoring; periodic proxy/goal divergence audits

Monitoring sub-mechanisms (Ch 3.2)

Sub-mechanism	Strength	Weakness	Lever
Mechanistic interpretability	Faithful when it succeeds	Rare, per-model, does not yet scale	Continued research investment; small-model debugging
Representation engineering	Scales to frontier models	Indirect (learns what is represented, not how computed)	Concept-vector probes as control surfaces
Anomaly detection	Scales, aggregate-level	Defines anomaly against past behavior; lags novel failures	Multi-baseline monitoring; periodic re-baselining
Capability evaluation	Surfaces latent capabilities	Limited by evaluator imagination; sandbagging	Pair benchmarks with red-team campaigns; format-diverse evals
Confabulation guard	(None, this is a failure mode)	Model explanations may not be faithful to internal process	Never use self-explanation as ground truth for accountability

The four-step classify-and-defend protocol (L3 capability)

Given an incident report:

Identify the harm. What specifically went wrong?
Name the primary half. Robustness (system broke), monitoring (no one noticed), or both (with which dominated cost)?
Name the sub-mechanism. Pick from the bucket-specific column above.
Name a lever. From the bucket’s lever column. Be specific: “more monitoring” is incomplete; “continuous drift monitoring with a 72-hour escalation threshold” is complete.

Goodhart’s law in one line

Any metric chosen as a measure of a goal will, when made a target, eventually be optimized against in a way that diverges from the goal. The robustness failure is that the system stays robust with respect to the proxy and produces outcomes the designer did not want.

Why both halves are needed (Swiss-cheese intuition)

Layer	Holes (failure mode it does NOT catch)
Robustness	Failure modes that emerge slowly under conditions seen in training
Monitoring	Failures that the observation infrastructure does not signal or that look normal at aggregate

Robustness without monitoring: failures eventually happen and go undetected for as long as the monitoring layer is missing. Monitoring without robustness: failures happen often, the monitoring catches them, but the underlying system stays brittle. The layers compose because their holes do not line up. L5 works the formal version (defense in depth, fault tree analysis).

Sub-mechanism disambiguation cheatsheet

When the question is	The sub-mechanism is usually
”The system worked fine in testing but breaks on this specific input class.”	Adversarial robustness or trojan
”The system worked in lab but underperforms in production.”	Distribution shift
”Third-party content in the input is steering behavior.”	Prompt injection
”The metric is climbing but the goal is not being served.”	Goodhart / proxy gaming
”We cannot see what the model is doing internally.”	Interpretability (mechanistic or representation engineering)
“The explanation looks reasonable but does not match the decision driver.”	Confabulation
”We are not noticing the failure pattern in our aggregate metrics.”	Anomaly detection lag
”The model scores low on the benchmark but the capability shows up in production.”	Capability evaluation gap (possibly sandbagging)

What this lesson builds toward

L4 (alignment, Ch 3.4): the substrate underneath both halves. Even perfect robustness and perfect monitoring do not catch a system pursuing the wrong objective. Specification gaming, proxy gaming, deceptive alignment all live here.
L5 (safety engineering, Ch 4): the formal version of the Swiss-cheese intuition (defense in depth, fault tree analysis, FMEA, nines of reliability).
L6 (complex systems, Ch 5): the framing for why correct components compose into incorrect systems; relevant when L3’s seams between robustness and monitoring become the failure mode.