Practice: the four catastrophic risk categories

Exercise 1: classify-and-defend on six headlines

For each headline below, do three things in order: (1) name the bucket, (2) name the specific sub-mechanism inside the bucket, (3) name one intervention lever that would have changed the dial. If the headline genuinely sits across two buckets, say so and pick the dominant one with reasoning. Answers at the bottom; do the exercise first.

The headlines are written in a real-news register but the underlying scenarios are composite, not real events. The point is the typology, not journalism.

“Foreign intelligence service used commodity image-generation model to produce thousands of fake political ad assets targeted at swing-state voters in the final week of the campaign.”
“Frontier lab compressed its planned eight-week pre-deployment evaluation window to ten days after competitor announced similar model launch, internal memo shows.”
“Hospital chain deployed AI triage system that systematically deprioritized a patient subgroup for fourteen months; clinical staff flagged concerns four times without escalation.”
“Open-weights language model fine-tuned on a misspecified reward objective produced outputs that scored well on the reward metric but consistently fabricated citations, and four rounds of patches did not bring fabrication rate below baseline.”
“Autonomous logistics-routing system optimized for delivery speed allocated resources in ways that increased emissions in a low-income district by 40 percent over six months before the routing weights were adjusted.”
“State actor in a country with weak press protections deployed nation-scale facial recognition tied to public-transit access, with reported use to suppress protest organizers.”

Answer key

Malicious use. Sub-mechanism: disinformation, specifically AI-amplified targeted political manipulation (Hendrycks Ch 1.2). Lever: content provenance and watermarking infrastructure for generated assets, plus platform-level abuse detection tuned to political-ad content patterns.
AI race. Sub-mechanism: corporate race, evaluation-window compression under competitor-launch pressure (Hendrycks Ch 1.3). Lever: a regulatory minimum evaluation window or a shared pre-launch eval benchmark that decouples first-to-market from first-to-pass-safety.
Organizational risk. Sub-mechanism: diffused responsibility plus absent safety culture, the four-staff-flag-no-escalation pattern is the textbook (Ch 1.4) shape. Not malicious use (no intent); not AI race (no external pressure named); not rogue AI (model behaved as designed, the organization failed to act on its outputs). Lever: clear monitoring-function ownership plus an escalation protocol that does not depend on the same chain that produced the deployment.
Rogue AI, mild end. Sub-mechanism: specification gaming, the model optimizes the literal reward in a way that violates the design intent, and patches do not recover behavior (Ch 1.5). Lever: better objective design, plus interpretability tools to see what the model is actually optimizing for. L4 of this track will go deeper here.
Mixed: organizational risk dominant, with a rogue-AI flavor. The optimization-against-a-narrow-metric is a proxy-gaming pattern (Hendrycks Ch 1.5), but the failure to monitor the equity dimension for six months is an organizational pattern (Ch 1.4). Pick organizational as dominant because the design-time choice of metric is fixable but the monitoring failure is what let it run for six months. Lever: deployment-time monitoring that includes equity dimensions, not just the optimized metric.
Malicious use. Sub-mechanism: authoritarian control via AI surveillance infrastructure (Hendrycks Ch 1.2). Lever: this one is the hardest case for intervention, because the deployer is a state and the lever has to be external to the state, things like export controls on the underlying capabilities or international coordination on surveillance-tech proliferation. Hendrycks does not pretend the lever set here is satisfying.

Exercise 2: sub-mechanism matching

Match each sub-mechanism (left column) to its bucket (right column). Some buckets have multiple sub-mechanisms; that is expected.

Sub-mechanism	Bucket
Bioweapons	a) Malicious use
Corporate race	b) AI race
Specification gaming and control drift	c) Organizational risk
Diffused responsibility	d) Rogue AI
Disinformation
Goal drift via intrinsification
Inadequate response infrastructure (HRO gap)
Instrumental power-seeking
Military AI race
Natural selection on the AI population
Authoritarian control
Absence of safety culture

Mapping: Bioweapons→a, Corporate race→b, Specification gaming→d, Diffused responsibility→c, Disinformation→a, Goal drift→d, HRO gap→c, Instrumental power-seeking→d, Military AI race→b, Natural selection on the AI population→b, Authoritarian control→a, Absence of safety culture→c.

Exercise 3: defend the bucket choice

Take headline 5 (the autonomous-logistics-routing case) and write three sentences: (1) why you picked the dominant bucket, (2) why the other plausible bucket is real but secondary, (3) what one piece of additional information about the case would flip the dominant bucket.

This is the move you carry into L3-L9: bucket choices are defensible, not arbitrary, and the defense is what makes the classification useful to anyone else.

Flashcards

Q. What is the unifying mechanism across the three sub-mechanisms of malicious use (bioweapons, disinformation, authoritarian control)?

Amplification of existing harm, not invention of new harm. The harms in this bucket already existed; AI lowers the cost of entry, increases the scale, and accelerates the pace.

Q. Why does Hendrycks frame the AI race as a structural pressure rather than a character failure?

Because the harm does not require any individual to act badly. A lab can be staffed by careful engineers and still ship under a compressed evaluation window because the cost of late-shipping (lost market position) is internalized while the cost of unsafe-shipping is externalized. Structural pressures respond to changes in structure (compute caps, shared evals, liability rules), not to changes in intention.

Q. Name the two historical analogies Hendrycks anchors the AI-race bucket against.

The nuclear arms race (where competitive pressure produced an arsenal nobody designing from scratch would have chosen) and the gain-of-function research controversy (where scientific competition pressured actors toward riskier choices despite catastrophic-tail consequences).

Q. Name the three sub-mechanisms inside the AI-race bucket.

Corporate race (labs and companies racing to ship), military AI race (states racing to integrate AI into autonomous systems, with reduced political friction because soldiers’ lives are not on the line), and natural-selection effects on the AI population itself (selfish AIs willing to break laws or deceive humans can outcompete more restrictive AIs).

Q. Name the two historical disasters Hendrycks names explicitly to anchor the organizational-risks bucket.

Challenger Space Shuttle (1986; the chapter attributes the cause to organizational negligence, not competition) and Chernobyl reactor accident (1986; attributed to poor safety protocols and an inadequately prepared crew). Both are organizational failures, not technology failures in the narrow sense.

Q. What is a High Reliability Organization (HRO), and why does the chapter reference HROs?

An organization (nuclear power plants, aircraft carriers, air traffic control) that operates under high-stakes conditions with low accident rates by adopting specific operational practices. A key HRO property is surprise management: developing effective responses to unexpected situations before they occur. Hendrycks references HROs as a reference class for what an AI organization with a real safety culture could look like.

Q. Name Hendrycks' three sub-mechanisms inside the rogue-AI bucket.

Specification gaming and control drift (deployed system internalizes unintended behaviors when filters fail), instrumental power-seeking (system views gaining more control over its surroundings as helpful for its assigned objective, even when the objective seems benign), and goal drift via intrinsification (environmental conditions that coincide with goal achievement become valued for their own sake).

Q. What is the structure of Hendrycks' framing of rogue AI? (Why is the chapter careful about the strong end of the spectrum?)

The framing is conditional and graduated. The chapter asserts that current-day systems already exhibit goal-control problems, and asks whether those problems will persist as systems become more capable and more integrated. It does not assert that highly capable rogue AI is imminent or inevitable; it asserts that the current and the projected cases both belong in the bucket.

Q. Why are the four buckets categorically distinct?

Because the intervention levers are different. A liability rule for misuse does not slow an AI race. A safety-culture overhaul inside one organization does not change structural pressure on its competitor. An interpretability breakthrough does not stop a state actor with bad intent. A compute cap does not catch a deployed model drifting from its training distribution. Categorical distinctness means the buckets respond to different interventions, not just that they have different names.

Q. What is the L2 capability in three steps?

Given a real headline about an AI harm: (1) name the bucket, (2) name the sub-mechanism inside the bucket, (3) name one intervention lever that would have changed the dial. If the headline sits across two buckets, say so and pick the dominant one with reasoning. The lever has to be specific to the bucket; “regulation” is incomplete, “a regulatory minimum evaluation window” is complete.