Beneficial AI and machine ethics

Phase 3 opens

Phase 2 was about what fails when AI systems are deployed: the failure surface (L3), the alignment substrate (L4), the engineering toolkit (L5), the complex-systems constraints on that toolkit (L6). Phase 3 changes the question entirely. The question is no longer what fails; it is what are we trying to do, and for whom. Once you have a robust monitored aligned system, you still have to specify what the system should be aligned with, and the answer is genuinely contested. The chapter that handles this is Hendrycks Chapter 6, Beneficial AI and Machine Ethics.

The first move in Ch 6 is to admit that the question does not have a single answer. There is no consensus ethical framework that a designer can simply hand to an AI system and treat as settled. The chapter’s name for this fact is moral uncertainty, and the lesson uses it as the substrate.

The L7 capability is explicit. By the end you should be able to explain what moral uncertainty is and why it changes how a designer picks a value-loading approach. Naming the question is most of the work; what you do about it (the strategies in §6.9) is the rest.

Moral uncertainty as the substrate (Ch 6.9)

The chapter defines moral uncertainty plainly: not knowing which moral beliefs are correct. The condition arises because different ethical frameworks and different stakeholders endorse different values, and the disagreements survive sustained reflection. Utilitarian and deontological frameworks reach different verdicts on familiar cases; consequentialist and rights-based reasoning reach different verdicts on trade-off situations; libertarian and egalitarian intuitions reach different verdicts on distributive questions. The disagreement is not a measurement problem one more experiment will resolve.

Why this matters for AI design: a system designed to optimize a specific moral framework can, with high confidence and high capability, produce outcomes that proponents of a different framework would judge as harm. The chapter is direct on this point. From Ch 6.9: “AI systems should represent moral uncertainty to avoid acting on overconfidence, which could lead to outcomes that humans consider morally reprehensible.” The framing is the alignment-from-L4 question taken seriously at the level of the loss function itself: outer alignment is hard not just because operationalizing intent is hard, but because there is no settled intent to operationalize.

The chapter names three strategies for acting under moral uncertainty. Each has a tradeoff worth holding.

My Favorite Theory. Pick the ethical framework you have the highest credence in and act consistently with it; treat the others as mistaken. The advantage is decisiveness: the AI system has a tractable specification. The cost is exactly the failure mode the framing is trying to prevent: high confidence in one framework, high capability, and outcomes other frameworks reject. The strategy fails the chapter’s stated bar.

Maximizing expected choiceworthiness. Treat ethical frameworks like uncertain hypotheses about the world; weight them by your credence in each; compute, for each possible action, its expected moral value across frameworks; pick the action with the highest expectation. The advantage is principled: this is what Bayesian decision theory under uncertainty would do. The cost is that “expected moral value across frameworks” requires units that translate between frameworks (a utilitarian assessment in utils, a deontological assessment in something else); the translation is not obvious and may not even be coherent. The strategy is honest about the uncertainty but tractable only in narrow cases.

Moral parliament. Simulate representatives of different moral perspectives and stakeholder viewpoints; have them deliberate and reach compromises; act on the compromise. The advantage is structural: it does not require a common unit across frameworks, it can handle stakeholder heterogeneity that goes beyond formal ethical frameworks (different communities, different cultures, different futures), and the chapter notes it has the property of adaptability to evolving values (the parliament can change composition as values change) and preference for compromise over extreme positions. The cost is that the parliament has to be designed, and the design choices (which perspectives get a seat, how votes are weighted, what compromise mechanism is used) are themselves load-bearing ethical decisions that the moral-uncertainty framing was supposed to defer.

The chapter does not declare one strategy correct; it names them as the current state of the art and notes that moral parliament has gathered the most attention in recent AI-ethics literature because it scales to stakeholder heterogeneity.

A different layer of the same problem. Even within a single framework, once the system is acting on behalf of a population rather than an individual, you need an aggregation rule that turns many individual welfares into one collective welfare measure. These are social welfare functions. The chapter introduces them directly: “Social welfare functions aggregate individual wellbeing into overall societal wellbeing.” The lesson works two named families and one practical incarnation.

Utilitarian. Sum individual wellbeing directly. The collective welfare is the sum (or mean) of individual welfares. The advantage is mathematical tractability and treating each individual symmetrically. The cost is that a utilitarian function will accept a very bad outcome for a small minority if the gains to the majority outweigh it, and the function offers no internal way to refuse the tradeoff.

Prioritarian. Give extra weight to the wellbeing of worse-off individuals. The collective welfare is a weighted sum where the weight increases as individual welfare decreases. The advantage is that the function bakes in a preference against very-bad outcomes for any individual; the cost is that the weighting function (how much extra weight do worse-off individuals get?) is itself a parameter the designer must pick, and the choice is the ethical decision the function was supposed to formalize.

Cost-benefit analysis as the practical incarnation. Most real-world AI deployment decisions, when they engage with ethics at all, do so through cost-benefit analysis: estimate the expected monetary value of benefits and harms across the affected population, compare. The chapter notes the limits: cost-benefit analysis “relies on financial proxies for wellbeing and neglects distributional impacts.” The financial proxy assumption is load-bearing (a thousand dollars of harm to someone in the bottom decile is not equivalent to a thousand dollars of harm to someone in the top decile, but utility-translated dollars treat them symmetrically); the distributional blindness is the structural property a prioritarian function would correct for. Most deployment decisions use cost-benefit analysis anyway because it is operationally cheap; the chapter’s note is the right hedge to apply when reading any such analysis.

Applied to AI deployment: the SWF choice changes which deployments look acceptable. A loan-approval AI deployed under a utilitarian SWF and the same deployment under a prioritarian SWF can produce different ship/do-not-ship decisions when the same model produces different error rates across demographic groups; the SWF is the thing that turns the disparity-information into a decision. The chapter’s broader Ch 6 sections on fairness, wellbeing, preferences, and happiness are each different angles on the same family of aggregation questions.

Worked illustration of the SWF choice. Consider a loan-approval AI evaluated on two metrics: total expected revenue from approved loans, and false-rejection rate disaggregated by demographic group. The model produces 89 percent accuracy overall, 92 percent accuracy on the majority demographic group, and 78 percent accuracy on a smaller demographic group A. The false-rejection rate for Group A is consequently higher than the population average, with the rejected Group A applicants disproportionately being creditworthy. Under a utilitarian SWF the deployment shipping decision is straightforward: total expected revenue is high and the model outperforms its predecessor on aggregate. Under a prioritarian SWF the deployment is harder to defend: the worse-off-by-rejection subgroup is being made systematically worse off, and the prioritarian weighting amplifies the cost of their false rejections relative to the gains for the majority. Different SWFs, same data, different ship/do-not-ship verdicts. The ethics question is not what the data says; the ethics question is which SWF you are willing to defend in front of the people whose loan applications you reject.

The chapter’s section on fairness (Ch 6.3) extends this point with the field’s named criteria: demographic parity (rejection rates equal across groups), equalized odds (true-positive and false-positive rates equal across groups), calibration (predicted-probability matches realized-rate across groups). The criteria are not jointly satisfiable in general, as the formal-fairness literature has shown: a model that satisfies demographic parity can fail equalized odds and vice versa, and the only way to satisfy all three simultaneously is in degenerate cases (perfect prediction, or no underlying difference between groups). So even within a single SWF, picking which fairness criterion to enforce is another value-loading decision the chapter argues should be made transparently.

The wider Ch 6 catalog (briefly)

The chapter does not stop at moral uncertainty and SWFs. Sections 6.2 through 6.7 work through the categories of value the literature has developed: law as the codified-institutional substrate, fairness as the framework for distributing benefits and harms, the economic engine as the framing of value-creation incentives, wellbeing as the unit aggregated by SWFs (not the same as preference satisfaction or happiness), preferences as the revealed-choice unit, happiness as the affective-state unit. The L7 reader does not need each section in detail; the move is to notice that wellbeing, preferences, and happiness are not interchangeable, and a value-loading approach that conflates them is making an ethical choice without naming it. If a deployed system optimizes for preferences (what users click), it can produce outcomes that reduce wellbeing (what makes lives go well) and reduce happiness (subjective affect). The L4 specification-gaming and proxy-gaming framings are operating on these distinctions when the proxy and the goal diverge.

The references file has reading suggestions on each section for the reader who wants deeper engagement; the lesson body is calibrated to the Ch 6 anchors most consequential for value-loading.

Why ethics is a multi-stakeholder coordination problem

A theme the chapter develops, and that L7 carries forward: ethics for AI is not a single-designer specification problem. It is a coordination problem among stakeholders with heterogeneous values. The moral parliament approach has the right shape for this reason: it admits stakeholder diversity into the value-loading process rather than pretending the designer can resolve the diversity ahead of time. L8 (collective action) will extend this framing to multi-agent settings; L9 (governance) will extend it to policy-layer levers.

The callback to L4: outer alignment is hard because the loss function does not capture the goal. L7 names a deeper reason: there is no single goal to capture; there are many stakeholders, many ethical frameworks, and any specification is implicitly choosing between them. The honest version of value-loading is to make that choice transparent and contestable, not to hide it inside a loss function.

The chapter is not arguing that machine ethics is impossible; it is arguing that machine ethics is unavoidable and that the question of whose ethics is the load-bearing part of value-loading. The strategies (moral parliament, expected choiceworthiness, principled SWFs with named criteria) are partial. They are also better than the alternative, which is to embed an unstated ethical commitment in the loss function and discover it at deployment.

The L7 capability

You should now be able to:

Explain moral uncertainty in two to three sentences without leaning on technical jargon: not knowing which moral beliefs are correct, where the disagreement survives reflection, and where AI systems that act with high confidence on any single framework can produce outcomes other frameworks reject.
Name the three strategies for acting under moral uncertainty (My Favorite Theory, expected choiceworthiness, moral parliament) and the tradeoff each entails.
Distinguish utilitarian from prioritarian social welfare functions and identify a deployment decision the choice changes.
Recognize cost-benefit analysis as an incomplete approximation and name two specific blind spots (financial-proxy assumption, distributional-impact neglect).
Connect to L4 outer alignment: the value-loading question Ch 6 is asking is the substrate question Ch 3.4 left open.

Practice has a deployment-design exercise using the moral parliament framing, a utilitarian-vs-prioritarian decision exercise on a worked loan-approval scenario, and a cost-benefit-analysis critique exercise.