Skip to content

Cheatsheet: AI safety as a field

The four catastrophic-risk categories (Hendrycks, CAIS, 2024)

Section titled “The four catastrophic-risk categories (Hendrycks, CAIS, 2024)”
BucketOne-line definitionWhat changes the dial
Malicious usePeople intentionally using AI to cause harmLiability rules, access controls, content provenance, abuse detection
AI raceStructural competitive pressure to ship before evaluation is doneDeployment moratoria, compute caps, shared eval benchmarks, regulatory deadlines
Organizational risksAccidents inside the labs and companies building AISafety-engineering practices, audit functions, post-mortem culture, clear ownership
Rogue AIsSystems pursuing objectives in ways designers did not intend and cannot correctAlignment research, interpretability, oversight mechanisms, kill-switches

The buckets are categorically distinct: an intervention that addresses one usually does not address another. The point of the typology is to refuse the single-bucket move where every AI concern collapses into “danger.”

PropertyDisciplineStance
Borrows toolsFrom neighboring fields (safety eng, complex systems, governance, ethics, game theory)Accumulates supporters
Admits uncertaintyDistinguishes well-characterized from contested risksFlattens the distinction
Produces vocabularyPrecise terms that let practitioners disagree productivelyBanners that can only be taken or rejected
Editorial registerDescriptive, attributed (“the chapter argues”)Prescriptive (“we should”)
Outlasts current debateYes (vocabulary is portable)No (loyalties shift)

The vocabulary pairs that show up later in the track

Section titled “The vocabulary pairs that show up later in the track”
  • Robustness failure (model breaks under distribution shift) vs monitoring failure (operators do not notice). Different interventions. → L3
  • Specification gaming (literal-objective optimization that violates spirit) vs proxy gaming (measurable-proxy optimization that diverges from real goal). Distinct, both well-named. → L4
  • Deceptive alignment (the model behaves well in training because it knows it is being trained) → L4
  • Defense in depth, nines of reliability, fault-tree analysis (safety-engineering borrowings) → L5
  • Normal accident theory (correct components can compose into incorrect systems) → L6
  • Moral uncertainty (designers may not know the right value-loading approach) → L7
  • Race to the bottom, free-rider, escalation (multi-agent failure modes) → L8
  • Corporate / national / international / compute governance (Hendrycks’ four-layer taxonomy) → L9

A paragraph (6-8 sentences) that meets the L1 bar will:

  • Name the subject in the first sentence
  • Name at least two of the four risk categories
  • Name at least one cross-disciplinary tool the field borrows
  • Name the descriptive-not-prescriptive commitment
  • Close on why “discipline” is the right word and “stance” is the wrong one

The Practice section has a model paragraph; do not read it before writing your own.

The track doesThe track does not
Survey Hendrycks’ field-framing across 9 lessons / 3 phasesCover the textbook’s appendices (long-tail distributions, evolutionary game theory, etc.)
Anchor each lesson to specific chapters with action-verb capabilitiesTake a position on whether AI development should slow down or speed up
Use the descriptive, attributed register on every lessonSubstitute its own vocabulary for the field’s published terms
Foreshadow the alignment, safety-engineering, and governance lessonsRe-teach neural network mechanics (prereq: T11 + T12)