AI safety as a field: cheatsheet

The four catastrophic-risk categories (Hendrycks, CAIS, 2024)

Bucket	One-line definition	What changes the dial
Malicious use	People intentionally using AI to cause harm	Liability rules, access controls, content provenance, abuse detection
AI race	Structural competitive pressure to ship before evaluation is done	Deployment moratoria, compute caps, shared eval benchmarks, regulatory deadlines
Organizational risks	Accidents inside the labs and companies building AI	Safety-engineering practices, audit functions, post-mortem culture, clear ownership
Rogue AIs	Systems pursuing objectives in ways designers did not intend and cannot correct	Alignment research, interpretability, oversight mechanisms, kill-switches

The buckets are categorically distinct: an intervention that addresses one usually does not address another. The point of the typology is to refuse the single-bucket move where every AI concern collapses into “danger.”

Discipline vs stance

Property	Discipline	Stance
Borrows tools	From neighboring fields (safety eng, complex systems, governance, ethics, game theory)	Accumulates supporters
Admits uncertainty	Distinguishes well-characterized from contested risks	Flattens the distinction
Produces vocabulary	Precise terms that let practitioners disagree productively	Banners that can only be taken or rejected
Editorial register	Descriptive, attributed (“the chapter argues”)	Prescriptive (“we should”)
Outlasts current debate	Yes (vocabulary is portable)	No (loyalties shift)

The vocabulary pairs that show up later in the track

Robustness failure (model breaks under distribution shift) vs monitoring failure (operators do not notice). Different interventions. → L3
Specification gaming (literal-objective optimization that violates spirit) vs proxy gaming (measurable-proxy optimization that diverges from real goal). Distinct, both well-named. → L4
Deceptive alignment (the model behaves well in training because it knows it is being trained) → L4
Defense in depth, nines of reliability, fault-tree analysis (safety-engineering borrowings) → L5
Normal accident theory (correct components can compose into incorrect systems) → L6
Moral uncertainty (designers may not know the right value-loading approach) → L7
Race to the bottom, free-rider, escalation (multi-agent failure modes) → L8
Corporate / national / international / compute governance (Hendrycks’ four-layer taxonomy) → L9

The paragraph-write capability checklist

A paragraph (6-8 sentences) that meets the L1 bar will:

Name the subject in the first sentence
Name at least two of the four risk categories
Name at least one cross-disciplinary tool the field borrows
Name the descriptive-not-prescriptive commitment
Close on why “discipline” is the right word and “stance” is the wrong one

The Practice section has a model paragraph; do not read it before writing your own.

What this track does and does not do

The track does	The track does not
Survey Hendrycks’ field-framing across 9 lessons / 3 phases	Cover the textbook’s appendices (long-tail distributions, evolutionary game theory, etc.)
Anchor each lesson to specific chapters with action-verb capabilities	Take a position on whether AI development should slow down or speed up
Use the descriptive, attributed register on every lesson	Substitute its own vocabulary for the field’s published terms
Foreshadow the alignment, safety-engineering, and governance lessons	Re-teach neural network mechanics (prereq: T11 + T12)