References: Where to be careful

Source material

Source material:
• Stanford CME 295: Transformers & Large Language Models, Autumn 2025
  Instructor: Afshine Amidi & Shervine Amidi, Stanford University
  Course site: https://cme295.stanford.edu/
  Cheatsheet: https://cme295.stanford.edu/cheatsheet/
  License (lecture videos): as published on Stanford's public YouTube channel
  License (Amidi cheatsheets): MIT
This lesson is a synthesis across all of Stanford CME 295. It does not
correspond to a single lecture timestamp; instead it references safety-
relevant material from Lectures 5 (RLHF and reward hacking), 7 (agentic
LLMs and data exfiltration), 8 (LaaJ biases), and others. Clawdemy provides
original notes, summaries, and quizzes derived from this material for
educational purposes. All rights to the original lectures remain with
Stanford and the instructors.

Cross-references to in-track lessons

This lesson is a recap. Each safety thread it surfaces was first introduced in a specific in-track lesson; the references for those threads live there.

Phase 4 thread (reward hacking): see How RLHF and DPO align models and its References page for the original PPO paper, DPO paper, and InstructGPT (the paper that brought RLHF to LLMs at scale).
Phase 5 thread (prompt injection): the prompt-injection material was woven into How chain of thought makes models think out loud and How few-shot examples teach in context. The canonical research framing is in the agent-loops references.
Phase 6 thread (data exfiltration, tool misuse, prompt caching): see How agent loops work and its References page for the agent-safety literature and the late-2025 Anthropic cyber attack disclosure.
Phase 7 thread (evaluation biases): see How we evaluate models, LLM-as-a-Judge and its References page for the LaaJ-bias literature.

Going deeper on AI safety

This curriculum did not have a dedicated safety reading list. If you want one, a few places to start:

“Concrete Problems in AI Safety”, Amodei et al., 2016. The classic introduction to a research-engineer-grade AI safety frame. Covers reward hacking, scalable supervision, distributional shift, and other framing concerns. Pre-LLM but the principles transfer cleanly. Worth reading early in any AI-safety reading list.
“A Comprehensive Survey of LLM Alignment Techniques”, Wang et al., 2024. Surveys the alignment-techniques landscape. Useful as a one-stop overview of how RLHF, DPO, RLAIF, and Constitutional AI relate.
“LLM Agents can Autonomously Hack Websites”, Fang et al., 2024. Documents what the data-exfiltration concern looks like as a measured capability. Useful for grounding the agent-safety thread in empirical results.
“Safety AI Lab” and similar research groups publish regularly on agent safety, evaluation, and alignment. Vendor blog posts (Anthropic, OpenAI, DeepMind) also publish safety-relevant work that’s worth tracking.

Adjacent topics

The relationship between AI safety and security. Many of the safety threads in this lesson (prompt injection, data exfiltration, tool misuse) are also security concerns. The two communities increasingly overlap. Search terms: “AI security,” “LLM security,” “agent security.” OWASP’s LLM Top 10 is a useful entry point.
Production safety frameworks. Anthropic’s Responsible Scaling Policy, OpenAI’s Preparedness Framework, and similar commitments are worth understanding if you build with frontier APIs. They constrain what kinds of capabilities the providers will deploy and under what conditions.
The “alignment-vs-control” debate. A research-community discussion about whether AI safety is best pursued through alignment (training models to want what humans want) or control (limiting what models can do regardless of what they want). Worth knowing as background for reading current research; both perspectives have substantial advocates.

Stanford CME 295 cheatsheet

Stanford CME 295 cheatsheet by the Amidi twins. MIT-licensed. Safety topics are folded into the lecture-specific sections rather than having a dedicated “safety” cheatsheet section. The material this lesson recaps is distributed across the cheatsheet’s preference-tuning, agent, and evaluation sections.

Community discussion

The AI safety community has substantial public discussion across blogs, podcasts, and newsletters. Specific durable resources are still consolidating; the field moves quickly enough that what’s worth reading today may be superseded next quarter. The references above are the ones that are likely to age well.

This is the closing lesson of Track 5. Thanks for reading.