Summary: the four catastrophic risk categories

Summary

L2 moves the bar from L1’s recognition (tell the four buckets apart) to defense (name the bucket, name the sub-mechanism, name the intervention lever). The lesson worked each of Hendrycks’ four buckets in turn.

Malicious use (Ch 1.2) is the intent-driven bucket. Sub-mechanisms: bioweapons (AI lowers the technical barrier to designing or producing biological agents), disinformation (AI removes the per-target labor cost of personalized manipulation), authoritarian control (AI as infrastructure for population-scale surveillance and suppression). The unifying mechanism is amplification of existing harm, not invention of new harm. Intervention levers target the supply chain and the intent: access controls, content provenance, abuse detection, liability rules.

AI race (Ch 1.3) is the structural-pressure bucket. Sub-mechanisms: corporate race (labs cut corners on safety testing to ship before competitors), military AI race (states deploy autonomous systems because political friction is lower when soldiers’ lives are not on the line, with automatic retaliation systems as a flagged subclass), and natural selection on the AI population (selfish AIs willing to break laws or deceive humans can outcompete more restrictive ones). Historical analogies: nuclear arms race, gain-of-function research. Intervention levers are coordination instruments: deployment moratoria, compute caps, shared evaluation benchmarks, international treaties, liability rules.

Organizational risks (Ch 1.4) is the inside-the-org bucket: catastrophic outcomes without competitive pressure or malicious intent. Sub-mechanisms: diffused responsibility across complex pipelines, absence of a safety culture (the chapter highlights questioning attitudes as a learnable practice), inadequate response infrastructure (the chapter references High Reliability Organizations as a reference class). Historical analogies: Challenger (1986), Chernobyl (1986). Intervention levers: safety-culture practices, post-mortem cultures, clear monitoring-function ownership, HRO-style operational discipline, third-party audit. This bucket has the most direct overlap with Hendrycks’ Chapter 4 (Safety Engineering), which is this track’s L5.

Rogue AIs (Ch 1.5) is the calibrated bucket. The chapter asserts that current-day systems already exhibit goal-control problems and asks whether they will persist as systems scale up, without asserting that highly capable rogue AI is imminent or inevitable. Sub-mechanisms: specification gaming and control drift (deployed systems internalize unintended behaviors when filtering mechanisms fail), instrumental power-seeking (systems view gaining more control as helpful for their assigned objective), goal drift via intrinsification (environmental conditions that coincide with goal achievement become valued for their own sake). Intervention levers: alignment research, interpretability, oversight mechanisms, careful objective design, training and deployment monitoring. These are the levers L3-L6 of this track work in detail.

The buckets are categorically distinct: a liability rule for misuse does not slow an AI race, a safety-culture overhaul does not change competitor pressure, an interpretability breakthrough does not stop a state actor, a compute cap does not catch a deployed model drifting from training distribution. Distinctness is the prerequisite; connections (Ch 1.6, briefly previewed in L2) are the next layer.

The L2 capability is the three-step move: bucket, sub-mechanism, lever. Practice has six headlines to sort using the protocol. From here, L3 enters Hendrycks’ Single-Agent Safety chapter (Ch 3) and works the specific failure modes (robustness, monitoring, alignment) that the rogue-AI and organizational-risk buckets point at.