Skip to content

Cheatsheet: monitoring and robustness

Sub-mechanismWhat happensExampleLever
Adversarial robustnessSmall input perturbations flip behaviorImage-classifier flips label under imperceptible noiseAdversarial training; expensive and partial
Distribution-shiftModel trained on A fails on BHospital-A X-ray model degrades on Hospital-B machinesDistribution-shift testing pre-deploy; continuous drift monitoring post-deploy
Prompt injection (LLM)Input contains hijacking instructionsDoc-page comment instructs an agent to exfiltrate secretsTrust-boundary enforcement (system trusted, retrieved content not); output filtering
Trojan attacks (training-time)Poisoned training data plants triggersSticker pattern flips classification of one speciesTraining-data integrity, post-hoc trojan detection
Goodhart / proxy gamingMetric becomes target, stops measuring goalRat-tail bounty incentivizes rat farmingMulti-metric monitoring; periodic proxy/goal divergence audits
Sub-mechanismStrengthWeaknessLever
Mechanistic interpretabilityFaithful when it succeedsRare, per-model, does not yet scaleContinued research investment; small-model debugging
Representation engineeringScales to frontier modelsIndirect (learns what is represented, not how computed)Concept-vector probes as control surfaces
Anomaly detectionScales, aggregate-levelDefines anomaly against past behavior; lags novel failuresMulti-baseline monitoring; periodic re-baselining
Capability evaluationSurfaces latent capabilitiesLimited by evaluator imagination; sandbaggingPair benchmarks with red-team campaigns; format-diverse evals
Confabulation guard(None, this is a failure mode)Model explanations may not be faithful to internal processNever use self-explanation as ground truth for accountability

The four-step classify-and-defend protocol (L3 capability)

Section titled “The four-step classify-and-defend protocol (L3 capability)”

Given an incident report:

  1. Identify the harm. What specifically went wrong?
  2. Name the primary half. Robustness (system broke), monitoring (no one noticed), or both (with which dominated cost)?
  3. Name the sub-mechanism. Pick from the bucket-specific column above.
  4. Name a lever. From the bucket’s lever column. Be specific: “more monitoring” is incomplete; “continuous drift monitoring with a 72-hour escalation threshold” is complete.

Any metric chosen as a measure of a goal will, when made a target, eventually be optimized against in a way that diverges from the goal. The robustness failure is that the system stays robust with respect to the proxy and produces outcomes the designer did not want.

Why both halves are needed (Swiss-cheese intuition)

Section titled “Why both halves are needed (Swiss-cheese intuition)”
LayerHoles (failure mode it does NOT catch)
RobustnessFailure modes that emerge slowly under conditions seen in training
MonitoringFailures that the observation infrastructure does not signal or that look normal at aggregate

Robustness without monitoring: failures eventually happen and go undetected for as long as the monitoring layer is missing. Monitoring without robustness: failures happen often, the monitoring catches them, but the underlying system stays brittle. The layers compose because their holes do not line up. L5 works the formal version (defense in depth, fault tree analysis).

When the question isThe sub-mechanism is usually
”The system worked fine in testing but breaks on this specific input class.”Adversarial robustness or trojan
”The system worked in lab but underperforms in production.”Distribution shift
”Third-party content in the input is steering behavior.”Prompt injection
”The metric is climbing but the goal is not being served.”Goodhart / proxy gaming
”We cannot see what the model is doing internally.”Interpretability (mechanistic or representation engineering)
“The explanation looks reasonable but does not match the decision driver.”Confabulation
”We are not noticing the failure pattern in our aggregate metrics.”Anomaly detection lag
”The model scores low on the benchmark but the capability shows up in production.”Capability evaluation gap (possibly sandbagging)
  • L4 (alignment, Ch 3.4): the substrate underneath both halves. Even perfect robustness and perfect monitoring do not catch a system pursuing the wrong objective. Specification gaming, proxy gaming, deceptive alignment all live here.
  • L5 (safety engineering, Ch 4): the formal version of the Swiss-cheese intuition (defense in depth, fault tree analysis, FMEA, nines of reliability).
  • L6 (complex systems, Ch 5): the framing for why correct components compose into incorrect systems; relevant when L3’s seams between robustness and monitoring become the failure mode.