Skip to content

Summary: beneficial AI and machine ethics

Phase 3 of Track 23 opens at L7 by changing the question. Phase 2 worked what fails when AI systems are deployed; Phase 3 asks what we are trying to do, and for whom. Hendrycks Chapter 6 names the foundational concept: moral uncertainty, defined as not knowing which moral beliefs are correct, where the disagreement survives sustained reflection. The chapter is direct on why it matters: “AI systems should represent moral uncertainty to avoid acting on overconfidence, which could lead to outcomes that humans consider morally reprehensible” (Hendrycks Ch 6.9).

Three strategies for acting under moral uncertainty. My Favorite Theory: pick the framework you trust most and act on it consistently. Decisive; produces exactly the high-confidence-bad-outcome failure mode the framing was trying to prevent. Expected choiceworthiness: treat frameworks like uncertain hypotheses, weight by credence, maximize expected moral value across frameworks. Principled but requires units that translate between frameworks, which may not be coherent. Moral parliament: simulate representatives of different moral perspectives and stakeholder viewpoints, deliberate to compromise, act on the compromise. Handles heterogeneity; relocates value judgments to the parliament-design layer rather than avoiding them. The parliament has gathered the most attention in recent AI-ethics literature because it scales to stakeholder diversity.

Social welfare functions (Ch 6.8) aggregate individual welfare into a society-wide measure. Utilitarian: sum directly. Prioritarian: weight worse-off individuals more heavily. Same data, different SWFs, different deployment verdicts. Cost-benefit analysis as the practical incarnation: cheap, widely used, with two named blind spots. Financial-proxy assumption: a thousand dollars of harm to a low-income person is not equivalent to a thousand dollars to a high-income person, but utility-translated dollars treat them symmetrically. Distributional-impact neglect: cost-benefit aggregates costs and benefits without weighting by who bears them, so a deployment whose benefits go to one group and whose costs fall on another can show net-positive while being prioritarian-rejected.

The worked loan-approval illustration: a model producing 92 percent accuracy on the majority group and 78 percent on a minority Group A ships under a utilitarian SWF (aggregate revenue is positive) and does not ship under a prioritarian SWF (the higher false-rejection rate on Group A is amplified by the prioritarian weighting). The chapter’s section on fairness (Ch 6.3) extends with the named criteria (demographic parity, equalized odds, calibration), which the formal-fairness literature has shown are not jointly satisfiable in general. So even within a single SWF, picking which fairness criterion to enforce is another value-loading decision.

The wider Ch 6 catalog (sections 6.2 through 6.7) works law, fairness, the economic engine, wellbeing, preferences, happiness. The lesson body notes that wellbeing, preferences, and happiness are not interchangeable: a system optimizing for preferences (what users click) can produce outcomes that reduce wellbeing (what makes lives go well) and reduce happiness (subjective affect). The L4 proxy-gaming failures are operating on these distinctions.

The L4 callback: outer alignment is hard because the loss function does not capture the goal. L7 names a deeper reason: there is no single goal to capture; there are many stakeholders, many ethical frameworks, and any specification is implicitly choosing between them. The honest version of value-loading is to make that choice transparent and contestable, not to hide it inside a loss function.

The L7 capability is the move from naming to defending: explain moral uncertainty without leaning on jargon, name the three strategies with their tradeoffs, distinguish utilitarian from prioritarian SWFs and identify a deployment decision the choice changes, recognize cost-benefit analysis as an incomplete approximation, connect to L4’s outer-alignment problem. Practice has a moral-parliament design exercise, a worked loan-approval SWF decision, and a cost-benefit-analysis critique.

L8 takes the multi-agent dynamics L6 previewed and works them at full depth (game theory, cooperation, conflict). L9 brings governance as the policy-layer instrument. The Phase 3 picture closes the track.