Cheatsheet: beneficial AI and machine ethics
Moral uncertainty: definition and strategies
Section titled “Moral uncertainty: definition and strategies”Definition (Ch 6.9): not knowing which moral beliefs are correct, where the disagreement survives sustained reflection. Not a measurement problem one more experiment resolves.
Why for AI: “AI systems should represent moral uncertainty to avoid acting on overconfidence, which could lead to outcomes that humans consider morally reprehensible” (Hendrycks Ch 6.9).
Three strategies
Section titled “Three strategies”| Strategy | How it works | Advantage | Cost |
|---|---|---|---|
| My Favorite Theory | Pick the framework you trust most, act on it consistently | Decisive; tractable specification | Produces exactly the high-confidence-bad-outcome failure mode the framing was trying to prevent |
| Maximizing expected choiceworthiness | Weight frameworks by credence; pick action maximizing expected moral value across frameworks | Principled (Bayesian decision theory) | Requires translation units between frameworks; may not be coherent |
| Moral parliament | Simulate representatives of different perspectives + stakeholders; deliberate to compromise; act on compromise | Handles heterogeneity; adaptable; prefers compromise | Parliament design itself encodes value judgments (which seats, what weights, what decision rule) |
Social welfare functions (Ch 6.8)
Section titled “Social welfare functions (Ch 6.8)”Aggregate individual wellbeing into societal wellbeing. The chapter framing: “Social welfare functions aggregate individual wellbeing into overall societal wellbeing” (Ch 6.8).
| SWF | Rule | When deployment verdict changes |
|---|---|---|
| Utilitarian | Sum (or mean) individual welfares directly | Aggregate metric is positive but distribution is skewed |
| Prioritarian | Weighted sum with extra weight to worse-off individuals | The same skewed-distribution case rejects |
| Egalitarian (referenced in the literature, not the chapter’s named pair) | Reduce inequality between individuals | Equal-distribution shipping criteria, even at aggregate cost |
| Maximin / Rawls | Maximize the welfare of the worst-off individual | Worst-off-individual shipping criteria |
Cost-benefit analysis: the two named blind spots
Section titled “Cost-benefit analysis: the two named blind spots”| Blind spot | Mechanism | Operational fix |
|---|---|---|
| Financial-proxy assumption | A thousand dollars of harm is not the same wellbeing impact across income strata, but utility-translated dollars treat them symmetrically | Convert dollar-impact to utility-adjusted impact using subgroup-specific marginal utility; explicit distributional reporting |
| Distributional-impact neglect | Costs and benefits are aggregated across the population without weighting by who bears them | Report distributional impact by group alongside net; apply a non-utilitarian SWF to the same data |
Fairness criteria (Ch 6.3)
Section titled “Fairness criteria (Ch 6.3)”| Criterion | What it requires |
|---|---|
| Demographic parity | Approval / rejection rates equal across groups |
| Equalized odds | True-positive AND false-positive rates equal across groups |
| Calibration | Predicted probability matches realized rate across groups |
Not-jointly-satisfiable result: the formal-fairness literature has shown these three cannot all hold simultaneously except in degenerate cases (perfect prediction, or no underlying difference between groups). Picking which to enforce is itself a value-loading decision.
Value-types: not interchangeable
Section titled “Value-types: not interchangeable”| Value type | What it measures | Optimization risk if confused |
|---|---|---|
| Preferences | What users choose / click | Optimizing preferences alone can reduce wellbeing (engagement vs life going well) |
| Wellbeing | What makes lives go well (broad, includes flourishing, health, capability) | Hardest to measure; usually requires proxies |
| Happiness | Subjective affective state | Can be high while wellbeing is reduced (e.g., addiction) |
The L4 proxy-gaming failures often operate on these distinctions. A recommendation system optimizing for preferences (clicks) can produce outcomes that reduce wellbeing without changing the optimization target.
The L7 capability (five-part protocol)
Section titled “The L7 capability (five-part protocol)”For a deployment that touches ethical judgment:
- Name moral uncertainty. Explain in plain language that there is no single correct ethical framework, the disagreement survives reflection, and AI acting on high confidence in one framework risks outcomes other frameworks reject.
- Name the strategy you would use. My Favorite Theory, expected choiceworthiness, or moral parliament. State the tradeoff.
- Pick an SWF. Utilitarian, prioritarian, or something else. Defend the choice against the alternative.
- Recognize cost-benefit blind spots. If the deployment was justified via cost-benefit analysis, identify the financial-proxy assumption and distributional-impact neglect in the specific case.
- Connect to L4 outer alignment. The value-loading question Ch 6 is asking is the substrate question Ch 3.4 left open: the loss function cannot capture an intent that has not been chosen.
Cross-track and within-track pointers
Section titled “Cross-track and within-track pointers”- L4 (alignment): the substrate question Ch 3.4 left open is what Ch 6 asks. L7 is the deeper layer of the outer-alignment problem.
- L3 (monitoring + robustness): the proxy-gaming failure mode operates on the wellbeing-vs-preferences distinction L7 names. L7 is the layer that diagnoses the distinction.
- L8 (collective action, Ch 7): extends the multi-stakeholder framing to multi-agent dynamics. The moral-parliament logic becomes the multi-actor coordination problem at L8.
- L9 (governance, Ch 8): brings the policy-layer instrument that operates outside any individual deployment. The fairness-criterion choice (which L7 surfaces) becomes a regulatory choice in L9.