Skip to content

AI safety as a field: what it studies and why it is a discipline, not a stance

If you read about AI online for a week, “AI safety” arrives in your feed in three different costumes. It is a slogan (“AI safety, AI safety,” chanted by people who want a moratorium). It is a side in a fight (“AI safety types” vs. “accelerationists,” each accusing the other of bad faith). It is a vibe (the word lands in a paragraph the way “sustainability” lands in a corporate brochure, signaling care without committing to a method).

The textbook we are working through this track, Dan Hendrycks’ Introduction to AI Safety, Ethics, and Society (Center for AI Safety, 2024), treats AI safety as none of these. It treats it as a field. A field has a subject (what it studies), a vocabulary (the technical terms that let practitioners disagree precisely rather than just loudly), a method (how claims are made and checked), and a connection to neighboring fields (where it borrows tools from). This lesson is about what that subject is and why “field” is the right word for it.

The reason this distinction matters is practical, not philosophical. If you treat AI safety as a stance, every question about a deployed model collapses into “are you for or against AI?” If you treat it as a field, you can ask the question Hendrycks’ textbook is built to support: what specifically might fail, what specifically might go right, what specifically might we do about it? The rest of this track lives in that question.

Hendrycks opens the textbook by sorting the things that can go wrong with AI systems into four categories. We will spend the next lesson (L2) inside the four buckets in detail; here we just name them so the shape of the field is visible.

  • Malicious use. People intentionally using AI to cause harm. Includes the obvious (a state actor weaponizing a model) and the less obvious (a small group automating a kind of harm that previously required many hands).
  • The AI race. The competitive pressure on labs, companies, and countries to ship before they are ready. Hendrycks frames this as a structural pressure: it does not require any individual to act badly, only for incentives to point a certain way.
  • Organizational risks. Accidents that happen inside the labs and companies that build AI, not from anyone’s malice. The usual mechanisms: complex systems, ambiguous responsibility, decisions that look reasonable at each step and disastrous in aggregate.
  • Rogue AIs. The class of problems that emerges when an AI system pursues an objective in a way its designers did not intend, or develops capabilities its designers did not anticipate, in such a way that operators can no longer correct it.

This four-bucket typology is the field’s first deliverable. It is not the only framing in AI safety (the literature has plenty of others), but Hendrycks’ choice is deliberate: the four buckets are categorically distinct, in the sense that the interventions that help in one bucket typically do not help in another. A liability rule that constrains malicious use does not necessarily slow an AI race. An evaluation regime that catches organizational drift does not necessarily address a rogue-AI scenario. The point of the typology is to refuse the move where every AI concern gets dumped into one bucket called “danger.”

The other thing to notice about the four buckets: none of them are stances. “Malicious use is a risk category” is not the same kind of sentence as “AI is bad.” The first is descriptive, falsifiable in principle, and lives inside a framework that lets you argue about what counts and what doesn’t. The second is a banner.

To make the typology concrete, here are four hypothetical headlines, one per bucket. None of them are real incidents; they are deliberately generic so the exercise is about the typology, not the news.

  • “A foreign intelligence service used an open-weights model to generate targeted misinformation at scale during an election.” Malicious use. The harm exists because someone intended it; the model is the lever, not the actor.
  • “Lab X shipped a frontier model two weeks before its planned external-evaluation window because Lab Y announced a competitor.” AI race. No one inside Lab X is acting in bad faith; the incentives are structural.
  • “A production deployment of an AI-driven scheduling system at a hospital network produced quietly worse outcomes for a subset of patients for six months before anyone noticed.” Organizational risk. Complex system, ambiguous responsibility, decisions that looked reasonable at each step.
  • “A model fine-tuned on a misspecified objective optimized for the metric in a way the designers did not anticipate, and the workarounds the team tried did not bring its behavior back inside the intended envelope.” Rogue AI in Hendrycks’ sense, on the milder end of the spectrum.

The point of the sort is not to argue about which bucket is most important; it is to feel the difference between the four. By the time you finish L2, this sort should be quick rather than effortful.

A discipline has three things a stance does not: it borrows tools, it admits uncertainty, and it produces vocabulary that lets practitioners disagree precisely. Hendrycks builds all three into the textbook’s structure.

It borrows tools. A large fraction of the textbook (chapters 4 and 5, which become this track’s L5 and L6) is not about AI at all. Chapter 4 reaches into safety engineering, the field that developed around nuclear plants, aviation, and bridges; it pulls in concepts like nines of reliability, defense in depth, fault tree analysis, and black swans. Chapter 5 reaches into complex-systems theory, the body of work that studies how systems composed of correct components can still fail at the system level. The implicit argument: AI safety is not the first field to confront the problem of high-stakes systems whose failure modes are non-obvious, and a sensible field starts by reading the literature its predecessors left.

If AI safety were a stance, this would be a strange move. A stance does not borrow tools; it accumulates supporters. A field borrows the tools that work and discards the ones that don’t, and one mark of seriousness in a field is the willingness to write a chapter that imports the vocabulary of a neighbor field rather than reinventing it under new names.

It admits uncertainty. Hendrycks repeatedly distinguishes between risks that are well-characterized today (a deployed model exhibits a specific failure under known conditions) and risks that are projected or contested (a sufficiently capable future system might develop misaligned objectives). The textbook treats both as worth studying, but it is careful not to collapse them. A stance flattens this distinction; a field preserves it, because the methods that work on one kind of risk are not the methods that work on the other.

It produces vocabulary. This is the part the rest of the track will lean on most. The field’s terms are precise and they distinguish things that look similar from a distance. Robustness failure (the model breaks under distribution shift) is not the same as monitoring failure (the operators do not notice the model broke). Specification gaming (the model optimizes the literal objective in a way that violates the spirit) is not the same as proxy gaming (the model optimizes a measurable proxy that diverges from the real goal). Deceptive alignment (the model behaves well in training because it knows it is being trained) is not the same as either. Three lessons from now, when we land in the alignment chapter (L4), these distinctions will be the work; for now, the point is that the field has them and the discourse around AI online generally does not.

Vocabulary does something a stance cannot: it lets two people who disagree continue to disagree productively. Two researchers can both accept that deceptive alignment names a real failure mode and disagree, sharply, about how likely it is in current systems or what the right mitigation would be. The disagreement stays inside the field because both sides are arguing about the same thing. By contrast, a disagreement between “AI doomer” and “AI accelerationist” cannot stay inside any field, because the terms do not point at distinct, checkable claims; they point at coalitions. This is the practical reason Hendrycks bothers with a textbook in the first place: a textbook is the kind of artifact that can transmit vocabulary, and vocabulary is the kind of asset that lets a field outlast its current debates.

What you trade off when you treat it as a stance

Section titled “What you trade off when you treat it as a stance”

The practical cost of the stance framing is precision. Once a question gets routed through “are you AI safety or AI acceleration,” it has stopped being a question about the world and started being a question about which team you are on. The answer can no longer be specific.

Hendrycks’ textbook does not make the case against the stance framing directly; it makes the case by simply not adopting it. There is no chapter where the book stakes a position on whether the field’s net policy posture should be “slow down” or “speed up.” There are chapters that name what specifically might go wrong under what specific conditions and what specific levers (regulatory, technical, organizational) might address those specifics. The implicit argument is that a position-free vocabulary is more useful than a position. People with very different policy preferences can use the same vocabulary; people with no vocabulary cannot make policy at all.

This descriptive register is the discipline we are carrying through the entire track. Every lesson is going to attribute claims to the textbook or to specific cited sources, not to “what the field believes.” When the book argues for a specific governance lever in L9, the lesson will say “the chapter argues” or “the CAIS framing posits,” not “we should.” This is not a hedge against controversy; it is a structural commitment to keeping the discipline visible. If a reader disagrees with Hendrycks’ framing, they should be able to do so with the same vocabulary.

Concrete, by the time you finish this track:

  • You can read a news headline about an AI deployment that went wrong and place it in Hendrycks’ four-bucket typology, with reasons. This is the L2 capability.
  • You can read an alignment paper or post that uses the field’s vocabulary (proxy gaming, specification gaming, deceptive alignment, etc.) and tell what is being claimed without the terms being a blocker. L3 + L4 deliver this.
  • You can read a safety-engineering argument about nines of reliability or defense-in-depth and explain how it constrains a specific AI deployment decision. L5 delivers this.
  • You can read a governance proposal (an EU AI Act article, a US executive order, a compute-cap argument) and locate it inside Hendrycks’ four-layer governance taxonomy (corporate, national, international, compute). L9 delivers this.

What this track does not deliver: it does not deliver a position on AI. The reason it does not is not modesty; it is method. Positions are downstream of vocabulary, and the work this track is doing is upstream.

By the end of this lesson, you should be able to write one paragraph (roughly 6-8 sentences) that states what AI safety studies and why it is a discipline rather than a stance. The Practice section has a worked example to compare against, and the rough shape is: name the subject, name two or three of the four risk categories, name one cross-disciplinary tool the field borrows, name the descriptive-not-prescriptive commitment, close with one sentence on why “discipline” is the right word.

Do not write the paragraph yet. The practice exercise is more useful if you produce it under the constraint of the prompt, then compare. The rest of the track assumes you can hold the paragraph; if you cannot, return here.