AI safety threads, in brief

What you’ll learn

This is the closing lesson of Track 5: AI Foundations. The track did not have a dedicated “AI safety” phase or a standalone safety lesson before this one. That was deliberate: pure-standalone safety lessons read either preachy or perfunctory, and most readers skip them. Instead, safety considerations were woven into the lessons where they actually matter. Phase 4 covered alignment and reward hacking alongside RLHF. Phase 5 covered prompt injection alongside chain-of-thought prompting. Phase 6 covered data exfiltration and tool misuse alongside agent loops. Phase 7 covered evaluation biases alongside LLM-as-a-Judge. Every phase had its own safety thread, woven into the technical material at the place where it became relevant. This lesson names those threads so the safety picture is cohesive and you leave with a frame, not a list. We walk each phase’s thread, surface the cross-cutting principles (proxy-vs-goal gap, untrusted inputs everywhere, capability-vs-trust mismatch in agents, bias propagation through pipelines, partial-and-stacked defenses), and give you the five questions to ask when you encounter an AI system in the wild. By the end, you will be able to reason about AI safety in a structured way without being preachy or perfunctory. Course materials are at cme295.stanford.edu.

Where this fits

This is the closing lesson of Phase 7 and the closing lesson of Track 5. The previous lesson (New ways to generate) covered speculative decoding and diffusion LLMs as alternatives to standard autoregressive generation. This lesson pulls together every safety thread woven through the track. After this, Track 5 is complete: you have walked from how models read text (Phase 1) through architecture, training, alignment, inference, reasoning, agents, and evaluation, with the safety frame named explicitly here at the end.

Before you start

Prerequisites: all preceding Phase 7 lessons are useful since this lesson references their content directly, but the lesson is designed to read as a self-contained recap. The prerequisites listed (the immediately previous lesson, New ways to generate) is for narrative continuity; the actual material draws from Phases 4 to 7.

By the end, you’ll be able to

Identify the safety thread woven through each phase (reward hacking from Phase 4; prompt injection from Phase 5; data exfiltration and tool misuse from Phase 6; evaluation biases from Phase 7)
Recognize the cross-cutting principles that show up across phases (proxy-vs-goal gap, untrusted inputs, capability-vs-trust mismatch, bias propagation, partial-and-stacked defenses)
Apply the five-question safety frame when evaluating an AI system (reward signal, untrusted inputs, capabilities, evaluation, defenses)
Distinguish “safe” from “refuses to be helpful” and recognize over-refusal as its own safety failure mode
Explain why woven-then-recapped works better than standalone-safety lessons for foundational AI literacy

Time and difficulty

Read time: about 13 minutes
Practice time: about 12 minutes (a self-check on each phase’s thread, a hands-on exercise applying the five-question frame to real-world AI systems, and flashcards)
Difficulty: standard