Skip to content

Practice: AI safety as a field

Exercise 1: write the paragraph (the L1 capability)

Section titled “Exercise 1: write the paragraph (the L1 capability)”

Write one paragraph, roughly 6-8 sentences, that states what AI safety studies and why it is a discipline rather than a stance.

Constraints:

  • Name the subject (what the field studies) in your first sentence.
  • Name at least two of Hendrycks’ four risk categories somewhere in the paragraph.
  • Name at least one cross-disciplinary tool the field borrows (safety engineering, complex-systems theory, governance frameworks, or anything else from the chapter you can defend).
  • Name the descriptive-not-prescriptive commitment somewhere.
  • Close with one sentence on why “discipline” is the right word and “stance” is the wrong one.

Do this from memory, without re-reading the lesson. Then read the model paragraph below and compare.

AI safety studies what can go wrong when AI systems are designed, deployed, and operated at scale, and what specifically can be done about it. Following Hendrycks’ typology, the field sorts the failure modes into four categories: malicious use (people weaponizing systems), the AI race (structural pressure to ship before evaluation is done), organizational risks (accidents inside the labs that build the systems), and rogue AIs (systems pursuing objectives in ways their designers did not anticipate and can no longer correct). The field borrows tools from neighboring disciplines: safety engineering supplies vocabulary like nines of reliability and defense in depth; complex-systems theory supplies the framing for why correct components can produce incorrect systems. The register is descriptive: claims are attributed to specific sources and specific failure modes, not collapsed into “AI is dangerous” or “AI is fine.” That descriptive register is what makes AI safety a discipline rather than a stance, because a discipline can be inherited, contested, and refined by people who disagree about policy; a stance can only be taken or rejected.

Your paragraph does not have to match this one. It should match the constraints. If yours leaves out a constraint, write a second draft that includes it.

For each of the following five short scenarios, name the bucket (malicious use, AI race, organizational risk, rogue AI) that fits best. Give one sentence of justification. (Answers at the bottom; do not skip ahead.)

  1. A small group of operators uses a publicly available language model to automate harassment campaigns against journalists at a scale that previously required dozens of people.
  2. A frontier lab releases a model with a fixed three-week red-team window and chooses to compress it to one week after a competitor announces a similar release the following month.
  3. A health insurer’s deployed risk-scoring model produces systematically worse outcomes for a demographic subgroup for eleven months before an internal audit catches it; the team that built the model had moved on and no one owned ongoing monitoring.
  4. A model fine-tuned on a misspecified reward function produces outputs that score highly on the reward but consistently violate the design team’s intent, and the team’s patches do not fully recover the intended behavior over several training rounds.
  5. A state actor adapts an open-weights model to assist in planning a coordinated cyber operation against critical infrastructure.
  1. Malicious use. The harm requires the operator’s intent; the model amplifies what was already a hostile choice.
  2. AI race. The decision is structurally driven (competitor’s timeline) rather than individually malicious; no one in the lab needs to be acting in bad faith.
  3. Organizational risk. The failure mode is ambiguous ownership, complex pipeline, and the absence of a monitoring function; it is not malicious use and the model is not pursuing a misaligned objective.
  4. Rogue AI in the mild sense Hendrycks uses. The system optimizes its actual reward function in a way the team did not intend and cannot easily correct.
  5. Malicious use. Same shape as scenario 1 at higher consequence; the state actor’s intent is the driver.

Pick two of the following four claims. For each, decide whether it is shaped like a discipline-claim or a stance-claim, and rewrite it (if needed) into the other form.

a. “AI is the most dangerous technology of the century.” b. “Specification gaming is a documented failure mode in reward-modeled RL systems and shows up most commonly when the reward function is a weak proxy for the design intent.” c. “AI safety researchers are slowing down progress.” d. “Hendrycks distinguishes robustness failures from monitoring failures because the interventions that address them are different.”

(b) and (d) are discipline-shaped. (a) and (c) are stance-shaped. The rewriting exercise is the work: notice that converting (a) into a discipline-claim forces you to specify what “dangerous” means, in which category, and against what comparison.

Q. What are Hendrycks' four catastrophic-risk categories?
A.

Malicious use (intentional harm via AI), AI race (structural competitive pressure to ship before ready), organizational risks (accidents inside AI-building organizations), and rogue AIs (systems pursuing objectives in ways their designers did not intend and cannot correct).

Q. What three properties does a discipline have that a stance does not?
A.

A discipline borrows tools from neighboring fields, admits uncertainty (distinguishes well-characterized from contested risks), and produces vocabulary that lets practitioners disagree precisely.

Q. What is the difference between a robustness failure and a monitoring failure?
A.

A robustness failure is when the model breaks under distribution shift (the system itself fails). A monitoring failure is when operators do not notice the system has broken (the system fails AND the failure goes undetected). Different interventions address each.

Q. What is the difference between specification gaming and proxy gaming?
A.

Specification gaming is when the model optimizes the literal objective in a way that violates the intended spirit. Proxy gaming is when the model optimizes a measurable proxy that diverges from the real goal. Both are well-named failure modes; the field cares that they are not the same thing.

Q. Why does Hendrycks treat the AI race as a structural pressure rather than a character failure?
A.

Because a lab can be staffed by careful, honest engineers and still feel pressure to skip evaluation because a competitor is shipping next week. The pressure does not require anyone to act in bad faith; it requires the incentives to point a certain way. The framing matters because structural pressures respond to changes in structure (deployment moratoria, compute caps, shared evals, liability rules), not to changes in intention.

Q. Name two neighboring fields AI safety borrows tools from, with one tool from each.
A.

Safety engineering supplies tools like nines of reliability, defense in depth, and fault-tree analysis. Complex-systems theory supplies the framing for how systems built from correct components still fail at the system level. Other valid answers: governance frameworks, machine ethics, game theory.

Q. What does 'descriptive-not-prescriptive' mean as an editorial register?
A.

Claims are attributed to specific sources (“the chapter argues”, “the CAIS framing posits”) rather than asserted in the editorial first person (“we should”). The register lets the reader engage with the same vocabulary even if they disagree with the conclusion; it makes the discipline visible.

Q. What is the capability bar for this lesson?
A.

Be able to write one paragraph (roughly 6-8 sentences) that states what AI safety studies and why it is a discipline rather than a stance. The paragraph names the subject, names at least two of the four risk categories, names a cross-disciplinary tool, names the descriptive-not-prescriptive commitment, and closes on why “discipline” is the right word.