Cheatsheet: AI safety as a field
The four catastrophic-risk categories (Hendrycks, CAIS, 2024)
Section titled “The four catastrophic-risk categories (Hendrycks, CAIS, 2024)”| Bucket | One-line definition | What changes the dial |
|---|---|---|
| Malicious use | People intentionally using AI to cause harm | Liability rules, access controls, content provenance, abuse detection |
| AI race | Structural competitive pressure to ship before evaluation is done | Deployment moratoria, compute caps, shared eval benchmarks, regulatory deadlines |
| Organizational risks | Accidents inside the labs and companies building AI | Safety-engineering practices, audit functions, post-mortem culture, clear ownership |
| Rogue AIs | Systems pursuing objectives in ways designers did not intend and cannot correct | Alignment research, interpretability, oversight mechanisms, kill-switches |
The buckets are categorically distinct: an intervention that addresses one usually does not address another. The point of the typology is to refuse the single-bucket move where every AI concern collapses into “danger.”
Discipline vs stance
Section titled “Discipline vs stance”| Property | Discipline | Stance |
|---|---|---|
| Borrows tools | From neighboring fields (safety eng, complex systems, governance, ethics, game theory) | Accumulates supporters |
| Admits uncertainty | Distinguishes well-characterized from contested risks | Flattens the distinction |
| Produces vocabulary | Precise terms that let practitioners disagree productively | Banners that can only be taken or rejected |
| Editorial register | Descriptive, attributed (“the chapter argues”) | Prescriptive (“we should”) |
| Outlasts current debate | Yes (vocabulary is portable) | No (loyalties shift) |
The vocabulary pairs that show up later in the track
Section titled “The vocabulary pairs that show up later in the track”- Robustness failure (model breaks under distribution shift) vs monitoring failure (operators do not notice). Different interventions. → L3
- Specification gaming (literal-objective optimization that violates spirit) vs proxy gaming (measurable-proxy optimization that diverges from real goal). Distinct, both well-named. → L4
- Deceptive alignment (the model behaves well in training because it knows it is being trained) → L4
- Defense in depth, nines of reliability, fault-tree analysis (safety-engineering borrowings) → L5
- Normal accident theory (correct components can compose into incorrect systems) → L6
- Moral uncertainty (designers may not know the right value-loading approach) → L7
- Race to the bottom, free-rider, escalation (multi-agent failure modes) → L8
- Corporate / national / international / compute governance (Hendrycks’ four-layer taxonomy) → L9
The paragraph-write capability checklist
Section titled “The paragraph-write capability checklist”A paragraph (6-8 sentences) that meets the L1 bar will:
- Name the subject in the first sentence
- Name at least two of the four risk categories
- Name at least one cross-disciplinary tool the field borrows
- Name the descriptive-not-prescriptive commitment
- Close on why “discipline” is the right word and “stance” is the wrong one
The Practice section has a model paragraph; do not read it before writing your own.
What this track does and does not do
Section titled “What this track does and does not do”| The track does | The track does not |
|---|---|
| Survey Hendrycks’ field-framing across 9 lessons / 3 phases | Cover the textbook’s appendices (long-tail distributions, evolutionary game theory, etc.) |
| Anchor each lesson to specific chapters with action-verb capabilities | Take a position on whether AI development should slow down or speed up |
| Use the descriptive, attributed register on every lesson | Substitute its own vocabulary for the field’s published terms |
| Foreshadow the alignment, safety-engineering, and governance lessons | Re-teach neural network mechanics (prereq: T11 + T12) |