Skip to content

Cheatsheet: Where to be careful

The technical frame and the safety frame are the same picture
viewed from different angles.
Every AI system has: reward signals, untrusted inputs,
capabilities, and evaluations.
Each is a safety-relevant axis.
PhaseSafety threadCanonical example
Phase 4 (How models become helpful)Reward hackingLecturer optimizing for clap volume instead of informative talk
Phase 5 (How we steer models at inference)Prompt injectionWeb page contains “Ignore previous instructions and…” that the model reads
Phase 6 (How models reason and act)Data exfiltration, tool misuse, prompt caching side-channelEmail agent tricked into sending password to attacker; cache leaks across users
Phase 7 (How we judge models)Evaluation bias propagationLaaJ biases → synthetic preference data → reward model → aligned model
PrinciplePattern
Proxy-vs-goal gapAI systems optimize against measurements that imperfectly capture intent; drift toward shortcuts
Untrusted inputs everywhereAnytime an AI system reads text from outside the user, that text is potentially adversarial
Capability-vs-trust mismatchGranting tools grants capability for misuse; authority should match safety guarantees
Bias propagation through pipelinesModern AI is stacked; biases at upstream layers propagate downstream
Partial-and-stacked defensesNo mitigation eliminates fully; production systems combine many; ask which defenses, not “is it safe”
1. What's the REWARD SIGNAL it was trained on?
→ surfaces reward hacking (Phase 4)
2. What UNTRUSTED TEXT does it ingest?
→ surfaces prompt injection (Phase 5)
3. What can it DO beyond answering?
→ surfaces tool misuse + data exfiltration (Phase 6)
4. How is it EVALUATED?
→ surfaces evaluation bias propagation (Phase 7)
5. What DEFENSES are in place?
→ surfaces the cross-cutting frame

Asking consistently is what lets you reason about AI safety without being preachy or perfunctory.

- KL penalty in PPO (keeps policy close to SFT reference)
- Verifiable rewards where available (math, code with test cases)
- Periodic recalibration as policy drifts
- Caveat: NONE eliminate; expect residual reward-hacking signature
- STRUCTURE-AWARE processing (markers separating user instructions
from external content; never let untrusted text be interpreted
as instructions)
- Output validation (check what the model is about to do)
- Runtime constraints (limit tool calls without confirmation)
- Caveat: behavioral defenses ("we asked the model to be careful")
do not work; design-level defenses are required
- Scope limits (allow-lists for outbound tools)
- Explicit user confirmation for high-stakes actions
- Per-user cache isolation (no cross-user leakage)
- Audit logging
- Tool authority must match safety guarantees
- Position-swap verification (run pairwise LaaJ both directions)
- Length penalties + explicit instruction against verbosity bias
- Different judge model than generator (mitigates self-enhancement)
- Periodic human calibration (catches drift)

Worked example: ATM-vs-bank-agent vs research-summarizer

Section titled “Worked example: ATM-vs-bank-agent vs research-summarizer”
Open chat assistantBank agent with toolsResearch summarizer
RewardRLHF preferences (over-refusal risk)Customer SAT + fraud safetySummary quality (LaaJ)
UntrustedUser prompt mostly trusted; integrated content less soCustomer messages possibly injectedPaper itself (highly external)
CapabilitiesText only (low) or tools (medium)Read accounts, initiate transfers (HIGH)Read-only (medium prompt-injection surface)
EvaluationStandard benchmarks + RLHFInternal KPIs + fraud signalsGold summaries + LaaJ
Defenses neededRefusal tuning + injection guardsScope limits + audit + confirmationStructure-aware processing + output validation
Most-relevant axisPhase 4 (over-refusal); Phase 5 with integrationsPhase 6 (capability-vs-trust)Phase 5 (prompt injection)
PitfallReality
”Safe = refuses to do things.”Over-refusal is its own failure mode (failing the user). Safety = serves user goals while avoiding active harm.
”Safety is somebody else’s problem.”Every layer of the AI stack has safety-relevant decisions. Framework, API, application, you.
”AI safety = existential risks.”Most safety-relevant failures are mundane: subtle misleading, unexpected actions, overstated capability. Everyday, not extreme.
”If it’s not visibly broken, it’s safe.”Reward hacking and bias propagation are invisible failure modes. The “looks fine” check doesn’t catch them.
”Defenses can be added later.”Not really. Capability-vs-trust mismatch in particular has to be designed in. Adding defenses to a deployed system is much harder than building them in from the start.
  • Reward hacking: model optimizing too hard against an imperfect proxy reward; produces shortcuts that score high without delivering what users want.
  • Prompt injection: untrusted text containing instructions that override user intent.
  • Data exfiltration: an AI system tricked into sending sensitive data to an attacker, typically via a tool with outbound access.
  • Tool misuse: AI system using a destructive tool (delete, send, pay) in unintended ways.
  • Prompt caching side-channel: cache hits leak information about previous queries; mitigated by per-user isolation.
  • LaaJ bias: position, verbosity, or self-enhancement bias in LLM-as-a-Judge evaluation.
  • Synthetic preference data: preference labels generated by LaaJ instead of human raters; feeds reward-model training.
  • Over-refusal: model refusing benign requests because the safety-tuned reward incentivized refusal too broadly.
  • Capability-vs-trust mismatch: AI system granted capabilities (via tools) whose worst-case exceeds the system’s safety guarantees.

The technical frame and the safety frame are the same picture viewed from different angles.
Every AI system has reward signals, untrusted inputs, capabilities, and evaluations. Each is a safety-relevant axis.
Asking the five questions consistently is what lets you reason about AI safety without being preachy or perfunctory.