Over-refusal is its own failure mode (failing the user). Safety = serves user goals while avoiding active harm.
”Safety is somebody else’s problem.”
Every layer of the AI stack has safety-relevant decisions. Framework, API, application, you.
”AI safety = existential risks.”
Most safety-relevant failures are mundane: subtle misleading, unexpected actions, overstated capability. Everyday, not extreme.
”If it’s not visibly broken, it’s safe.”
Reward hacking and bias propagation are invisible failure modes. The “looks fine” check doesn’t catch them.
”Defenses can be added later.”
Not really. Capability-vs-trust mismatch in particular has to be designed in. Adding defenses to a deployed system is much harder than building them in from the start.
Reward hacking: model optimizing too hard against an imperfect proxy reward; produces shortcuts that score high without delivering what users want.
Prompt injection: untrusted text containing instructions that override user intent.
Data exfiltration: an AI system tricked into sending sensitive data to an attacker, typically via a tool with outbound access.
Tool misuse: AI system using a destructive tool (delete, send, pay) in unintended ways.
Prompt caching side-channel: cache hits leak information about previous queries; mitigated by per-user isolation.
LaaJ bias: position, verbosity, or self-enhancement bias in LLM-as-a-Judge evaluation.
Synthetic preference data: preference labels generated by LaaJ instead of human raters; feeds reward-model training.
Over-refusal: model refusing benign requests because the safety-tuned reward incentivized refusal too broadly.
Capability-vs-trust mismatch: AI system granted capabilities (via tools) whose worst-case exceeds the system’s safety guarantees.
The technical frame and the safety frame are the same picture viewed from different angles. Every AI system has reward signals, untrusted inputs, capabilities, and evaluations. Each is a safety-relevant axis. Asking the five questions consistently is what lets you reason about AI safety without being preachy or perfunctory.