AI safety threads: cheatsheet

The one idea that matters

The technical frame and the safety frame are the same picture
viewed from different angles.

Every AI system has: reward signals, untrusted inputs,
capabilities, and evaluations.

Each is a safety-relevant axis.

The threads, by phase

Phase	Safety thread	Canonical example
Phase 4 (How models become helpful)	Reward hacking	Lecturer optimizing for clap volume instead of informative talk
Phase 5 (How we steer models at inference)	Prompt injection	Web page contains “Ignore previous instructions and…” that the model reads
Phase 6 (How models reason and act)	Data exfiltration, tool misuse, prompt caching side-channel	Email agent tricked into sending password to attacker; cache leaks across users
Phase 7 (How we judge models)	Evaluation bias propagation	LaaJ biases → synthetic preference data → reward model → aligned model

Cross-cutting principles

Principle	Pattern
Proxy-vs-goal gap	AI systems optimize against measurements that imperfectly capture intent; drift toward shortcuts
Untrusted inputs everywhere	Anytime an AI system reads text from outside the user, that text is potentially adversarial
Capability-vs-trust mismatch	Granting tools grants capability for misuse; authority should match safety guarantees
Bias propagation through pipelines	Modern AI is stacked; biases at upstream layers propagate downstream
Partial-and-stacked defenses	No mitigation eliminates fully; production systems combine many; ask which defenses, not “is it safe”

The five-question frame

1. What's the REWARD SIGNAL it was trained on?
   → surfaces reward hacking (Phase 4)

2. What UNTRUSTED TEXT does it ingest?
   → surfaces prompt injection (Phase 5)

3. What can it DO beyond answering?
   → surfaces tool misuse + data exfiltration (Phase 6)

4. How is it EVALUATED?
   → surfaces evaluation bias propagation (Phase 7)

5. What DEFENSES are in place?
   → surfaces the cross-cutting frame

Asking consistently is what lets you reason about AI safety without being preachy or perfunctory.

Phase-specific defenses (recap)

Phase 4 (reward hacking)

- KL penalty in PPO (keeps policy close to SFT reference)
- Verifiable rewards where available (math, code with test cases)
- Periodic recalibration as policy drifts
- Caveat: NONE eliminate; expect residual reward-hacking signature

Phase 5 (prompt injection)

- STRUCTURE-AWARE processing (markers separating user instructions
  from external content; never let untrusted text be interpreted
  as instructions)
- Output validation (check what the model is about to do)
- Runtime constraints (limit tool calls without confirmation)
- Caveat: behavioral defenses ("we asked the model to be careful")
  do not work; design-level defenses are required

Phase 6 (agents, tools)

- Scope limits (allow-lists for outbound tools)
- Explicit user confirmation for high-stakes actions
- Per-user cache isolation (no cross-user leakage)
- Audit logging
- Tool authority must match safety guarantees

Phase 7 (evaluation)

- Position-swap verification (run pairwise LaaJ both directions)
- Length penalties + explicit instruction against verbosity bias
- Different judge model than generator (mitigates self-enhancement)
- Periodic human calibration (catches drift)

Worked example: ATM-vs-bank-agent vs research-summarizer

	Open chat assistant	Bank agent with tools	Research summarizer
Reward	RLHF preferences (over-refusal risk)	Customer SAT + fraud safety	Summary quality (LaaJ)
Untrusted	User prompt mostly trusted; integrated content less so	Customer messages possibly injected	Paper itself (highly external)
Capabilities	Text only (low) or tools (medium)	Read accounts, initiate transfers (HIGH)	Read-only (medium prompt-injection surface)
Evaluation	Standard benchmarks + RLHF	Internal KPIs + fraud signals	Gold summaries + LaaJ
Defenses needed	Refusal tuning + injection guards	Scope limits + audit + confirmation	Structure-aware processing + output validation
Most-relevant axis	Phase 4 (over-refusal); Phase 5 with integrations	Phase 6 (capability-vs-trust)	Phase 5 (prompt injection)

Pitfalls to dodge

Pitfall	Reality
”Safe = refuses to do things.”	Over-refusal is its own failure mode (failing the user). Safety = serves user goals while avoiding active harm.
”Safety is somebody else’s problem.”	Every layer of the AI stack has safety-relevant decisions. Framework, API, application, you.
”AI safety = existential risks.”	Most safety-relevant failures are mundane: subtle misleading, unexpected actions, overstated capability. Everyday, not extreme.
”If it’s not visibly broken, it’s safe.”	Reward hacking and bias propagation are invisible failure modes. The “looks fine” check doesn’t catch them.
”Defenses can be added later.”	Not really. Capability-vs-trust mismatch in particular has to be designed in. Adding defenses to a deployed system is much harder than building them in from the start.

Glossary

Reward hacking: model optimizing too hard against an imperfect proxy reward; produces shortcuts that score high without delivering what users want.
Prompt injection: untrusted text containing instructions that override user intent.
Data exfiltration: an AI system tricked into sending sensitive data to an attacker, typically via a tool with outbound access.
Tool misuse: AI system using a destructive tool (delete, send, pay) in unintended ways.
Prompt caching side-channel: cache hits leak information about previous queries; mitigated by per-user isolation.
LaaJ bias: position, verbosity, or self-enhancement bias in LLM-as-a-Judge evaluation.
Synthetic preference data: preference labels generated by LaaJ instead of human raters; feeds reward-model training.
Over-refusal: model refusing benign requests because the safety-tuned reward incentivized refusal too broadly.
Capability-vs-trust mismatch: AI system granted capabilities (via tools) whose worst-case exceeds the system’s safety guarantees.

The technical frame and the safety frame are the same picture viewed from different angles.
Every AI system has reward signals, untrusted inputs, capabilities, and evaluations. Each is a safety-relevant axis.
Asking the five questions consistently is what lets you reason about AI safety without being preachy or perfunctory.