Beneficial AI and machine ethics: cheatsheet

Moral uncertainty: definition and strategies

Definition (Ch 6.9): not knowing which moral beliefs are correct, where the disagreement survives sustained reflection. Not a measurement problem one more experiment resolves.

Why for AI: “AI systems should represent moral uncertainty to avoid acting on overconfidence, which could lead to outcomes that humans consider morally reprehensible” (Hendrycks Ch 6.9).

Three strategies

Strategy	How it works	Advantage	Cost
My Favorite Theory	Pick the framework you trust most, act on it consistently	Decisive; tractable specification	Produces exactly the high-confidence-bad-outcome failure mode the framing was trying to prevent
Maximizing expected choiceworthiness	Weight frameworks by credence; pick action maximizing expected moral value across frameworks	Principled (Bayesian decision theory)	Requires translation units between frameworks; may not be coherent
Moral parliament	Simulate representatives of different perspectives + stakeholders; deliberate to compromise; act on compromise	Handles heterogeneity; adaptable; prefers compromise	Parliament design itself encodes value judgments (which seats, what weights, what decision rule)

Aggregate individual wellbeing into societal wellbeing. The chapter framing: “Social welfare functions aggregate individual wellbeing into overall societal wellbeing” (Ch 6.8).

SWF	Rule	When deployment verdict changes
Utilitarian	Sum (or mean) individual welfares directly	Aggregate metric is positive but distribution is skewed
Prioritarian	Weighted sum with extra weight to worse-off individuals	The same skewed-distribution case rejects
Egalitarian (referenced in the literature, not the chapter’s named pair)	Reduce inequality between individuals	Equal-distribution shipping criteria, even at aggregate cost
Maximin / Rawls	Maximize the welfare of the worst-off individual	Worst-off-individual shipping criteria

Blind spot	Mechanism	Operational fix
Financial-proxy assumption	A thousand dollars of harm is not the same wellbeing impact across income strata, but utility-translated dollars treat them symmetrically	Convert dollar-impact to utility-adjusted impact using subgroup-specific marginal utility; explicit distributional reporting
Distributional-impact neglect	Costs and benefits are aggregated across the population without weighting by who bears them	Report distributional impact by group alongside net; apply a non-utilitarian SWF to the same data

Fairness criteria (Ch 6.3)

Criterion	What it requires
Demographic parity	Approval / rejection rates equal across groups
Equalized odds	True-positive AND false-positive rates equal across groups
Calibration	Predicted probability matches realized rate across groups

Not-jointly-satisfiable result: the formal-fairness literature has shown these three cannot all hold simultaneously except in degenerate cases (perfect prediction, or no underlying difference between groups). Picking which to enforce is itself a value-loading decision.

Value-types: not interchangeable

Value type	What it measures	Optimization risk if confused
Preferences	What users choose / click	Optimizing preferences alone can reduce wellbeing (engagement vs life going well)
Wellbeing	What makes lives go well (broad, includes flourishing, health, capability)	Hardest to measure; usually requires proxies
Happiness	Subjective affective state	Can be high while wellbeing is reduced (e.g., addiction)

The L4 proxy-gaming failures often operate on these distinctions. A recommendation system optimizing for preferences (clicks) can produce outcomes that reduce wellbeing without changing the optimization target.

The L7 capability (five-part protocol)

For a deployment that touches ethical judgment:

Name moral uncertainty. Explain in plain language that there is no single correct ethical framework, the disagreement survives reflection, and AI acting on high confidence in one framework risks outcomes other frameworks reject.
Name the strategy you would use. My Favorite Theory, expected choiceworthiness, or moral parliament. State the tradeoff.
Pick an SWF. Utilitarian, prioritarian, or something else. Defend the choice against the alternative.
Recognize cost-benefit blind spots. If the deployment was justified via cost-benefit analysis, identify the financial-proxy assumption and distributional-impact neglect in the specific case.
Connect to L4 outer alignment. The value-loading question Ch 6 is asking is the substrate question Ch 3.4 left open: the loss function cannot capture an intent that has not been chosen.

Cross-track and within-track pointers

L4 (alignment): the substrate question Ch 3.4 left open is what Ch 6 asks. L7 is the deeper layer of the outer-alignment problem.
L3 (monitoring + robustness): the proxy-gaming failure mode operates on the wellbeing-vs-preferences distinction L7 names. L7 is the layer that diagnoses the distinction.
L8 (collective action, Ch 7): extends the multi-stakeholder framing to multi-agent dynamics. The moral-parliament logic becomes the multi-actor coordination problem at L8.
L9 (governance, Ch 8): brings the policy-layer instrument that operates outside any individual deployment. The fairness-criterion choice (which L7 surfaces) becomes a regulatory choice in L9.