Skip to content

References: Where to be careful

Source material:
• Stanford CME 295: Transformers & Large Language Models, Autumn 2025
Instructor: Afshine Amidi & Shervine Amidi, Stanford University
Course site: https://cme295.stanford.edu/
Cheatsheet: https://cme295.stanford.edu/cheatsheet/
License (lecture videos): as published on Stanford's public YouTube channel
License (Amidi cheatsheets): MIT
This lesson is a synthesis across all of Stanford CME 295. It does not
correspond to a single lecture timestamp; instead it references safety-
relevant material from Lectures 5 (RLHF and reward hacking), 7 (agentic
LLMs and data exfiltration), 8 (LaaJ biases), and others. Clawdemy provides
original notes, summaries, and quizzes derived from this material for
educational purposes. All rights to the original lectures remain with
Stanford and the instructors.

This lesson is a recap. Each safety thread it surfaces was first introduced in a specific in-track lesson; the references for those threads live there.

This curriculum did not have a dedicated safety reading list. If you want one, a few places to start:

  • “Concrete Problems in AI Safety”, Amodei et al., 2016. The classic introduction to a research-engineer-grade AI safety frame. Covers reward hacking, scalable supervision, distributional shift, and other framing concerns. Pre-LLM but the principles transfer cleanly. Worth reading early in any AI-safety reading list.

  • “A Comprehensive Survey of LLM Alignment Techniques”, Wang et al., 2024. Surveys the alignment-techniques landscape. Useful as a one-stop overview of how RLHF, DPO, RLAIF, and Constitutional AI relate.

  • “LLM Agents can Autonomously Hack Websites”, Fang et al., 2024. Documents what the data-exfiltration concern looks like as a measured capability. Useful for grounding the agent-safety thread in empirical results.

  • “Safety AI Lab” and similar research groups publish regularly on agent safety, evaluation, and alignment. Vendor blog posts (Anthropic, OpenAI, DeepMind) also publish safety-relevant work that’s worth tracking.

  • The relationship between AI safety and security. Many of the safety threads in this lesson (prompt injection, data exfiltration, tool misuse) are also security concerns. The two communities increasingly overlap. Search terms: “AI security,” “LLM security,” “agent security.” OWASP’s LLM Top 10 is a useful entry point.

  • Production safety frameworks. Anthropic’s Responsible Scaling Policy, OpenAI’s Preparedness Framework, and similar commitments are worth understanding if you build with frontier APIs. They constrain what kinds of capabilities the providers will deploy and under what conditions.

  • The “alignment-vs-control” debate. A research-community discussion about whether AI safety is best pursued through alignment (training models to want what humans want) or control (limiting what models can do regardless of what they want). Worth knowing as background for reading current research; both perspectives have substantial advocates.

  • Stanford CME 295 cheatsheet by the Amidi twins. MIT-licensed. Safety topics are folded into the lecture-specific sections rather than having a dedicated “safety” cheatsheet section. The material this lesson recaps is distributed across the cheatsheet’s preference-tuning, agent, and evaluation sections.

The AI safety community has substantial public discussion across blogs, podcasts, and newsletters. Specific durable resources are still consolidating; the field moves quickly enough that what’s worth reading today may be superseded next quarter. The references above are the ones that are likely to age well.

This is the closing lesson of Track 5. Thanks for reading.