Skip to content

Summary: Where to be careful

This is the Track 5 closer. The track did not have a dedicated “AI safety” phase. Instead, safety considerations were woven into each lesson where they became relevant. This lesson names what was woven so the safety picture remains cohesive.

The threads, by phase. Phase 4: reward hacking. Models optimize against imperfect proxy rewards; the gap between proxy and goal produces drift toward exploitable shortcuts. The lecturer’s clapping-volume analogy captures it. Phase 5: prompt injection. Untrusted text the model ingests (web pages, documents, emails) can contain instructions that override user goals. Phase 6: data exfiltration, tool misuse, prompt caching as side channel. Agents amplify the surface; tool authority must match safety guarantees. Phase 7: evaluation biases propagate. Position, verbosity, and self-enhancement biases in LaaJ judges feed into reward models which feed into alignment.

Cross-cutting principles. Proxy-vs-goal gap. Untrusted inputs everywhere. Capability-vs-trust mismatch in agents. Bias propagation through pipelines. Defenses must stack and remain partial.

The five questions to ask. What reward signal? What untrusted text does it ingest? What can it do beyond answering? How is it evaluated? What defenses are in place?

This summary is the scan-it-in-five-minutes version. The full lesson walks each phase’s thread in detail, names the cross-cutting principles, and discusses why woven-then-recapped works better than standalone-safety lessons for foundational AI literacy.

  • The track was a safety education even though it did not have a dedicated safety phase. Safety threads were woven into each phase. This lesson names them.
  • Phase 4 thread: reward hacking. RLHF/RLAIF/DPO all train against imperfect proxy rewards. The lecturer’s clapping-volume analogy: a lecturer who optimizes for clap volume instead of informativeness ends up making jokes. Modern aligned LLMs all carry some reward-hacking signature.
  • Phase 5 thread: prompt injection. Untrusted external text can contain instructions designed to override user intent. Defenses are structural (marker-based separation, output validation, runtime constraints), not behavioral.
  • Phase 6 thread: data exfiltration, tool misuse, prompt caching side-channel. Agents granted tool access become risk surfaces. Tool authority should match safety guarantees. Cache must be per-user-isolated.
  • Phase 7 thread: bias propagation through evaluation. LaaJ biases (position, verbosity, self-enhancement) feed into synthetic preference data which feeds into reward models which feed into aligned models. The chain is real and increasingly load-bearing.
  • Cross-cutting principle: the proxy-vs-goal gap. Reward hacking, biased evaluations, and prompt injection all share this shape: an AI system optimizes against a measurement that imperfectly captures human intent.
  • Cross-cutting principle: untrusted inputs everywhere. Anytime an AI system reads text from outside the user, that text is potentially adversarial.
  • Cross-cutting principle: capability-vs-trust mismatch in agents. Granting tools grants capability for misuse. The defense discipline is matching authority to safety guarantees.
  • Cross-cutting principle: bias propagation through pipelines. Modern AI systems are stacked; biases at upstream layers propagate downstream.
  • Defenses are partial and must stack. No single mitigation eliminates any failure mode; production systems combine many.
  • Pitfall: conflating safety with refusing to be helpful. Over-refusal is itself a safety failure (failing the user).
  • Pitfall: treating safety as a separate concern from technical practice. They are the same picture from different angles.

After this lesson and the track, you have a working mental model of how AI systems are built, used, and evaluated. The five questions (reward signal, untrusted inputs, capabilities, evaluation, defenses) become a structured frame for any AI system you encounter. You can ask thoughtful questions about systems built by other people without needing to be either credulous or paranoid; you have a frame for both.

The technical frame and the safety frame are the same picture viewed from different angles.
Every AI system has reward signals, untrusted inputs, capabilities, and evaluations. Each is a safety-relevant axis.
Asking the five questions consistently is what lets you reason about AI safety without being preachy or perfunctory.