Skip to content

Cheatsheet: beneficial AI and machine ethics

Moral uncertainty: definition and strategies

Section titled “Moral uncertainty: definition and strategies”

Definition (Ch 6.9): not knowing which moral beliefs are correct, where the disagreement survives sustained reflection. Not a measurement problem one more experiment resolves.

Why for AI: “AI systems should represent moral uncertainty to avoid acting on overconfidence, which could lead to outcomes that humans consider morally reprehensible” (Hendrycks Ch 6.9).

StrategyHow it worksAdvantageCost
My Favorite TheoryPick the framework you trust most, act on it consistentlyDecisive; tractable specificationProduces exactly the high-confidence-bad-outcome failure mode the framing was trying to prevent
Maximizing expected choiceworthinessWeight frameworks by credence; pick action maximizing expected moral value across frameworksPrincipled (Bayesian decision theory)Requires translation units between frameworks; may not be coherent
Moral parliamentSimulate representatives of different perspectives + stakeholders; deliberate to compromise; act on compromiseHandles heterogeneity; adaptable; prefers compromiseParliament design itself encodes value judgments (which seats, what weights, what decision rule)

Aggregate individual wellbeing into societal wellbeing. The chapter framing: “Social welfare functions aggregate individual wellbeing into overall societal wellbeing” (Ch 6.8).

SWFRuleWhen deployment verdict changes
UtilitarianSum (or mean) individual welfares directlyAggregate metric is positive but distribution is skewed
PrioritarianWeighted sum with extra weight to worse-off individualsThe same skewed-distribution case rejects
Egalitarian (referenced in the literature, not the chapter’s named pair)Reduce inequality between individualsEqual-distribution shipping criteria, even at aggregate cost
Maximin / RawlsMaximize the welfare of the worst-off individualWorst-off-individual shipping criteria

Cost-benefit analysis: the two named blind spots

Section titled “Cost-benefit analysis: the two named blind spots”
Blind spotMechanismOperational fix
Financial-proxy assumptionA thousand dollars of harm is not the same wellbeing impact across income strata, but utility-translated dollars treat them symmetricallyConvert dollar-impact to utility-adjusted impact using subgroup-specific marginal utility; explicit distributional reporting
Distributional-impact neglectCosts and benefits are aggregated across the population without weighting by who bears themReport distributional impact by group alongside net; apply a non-utilitarian SWF to the same data
CriterionWhat it requires
Demographic parityApproval / rejection rates equal across groups
Equalized oddsTrue-positive AND false-positive rates equal across groups
CalibrationPredicted probability matches realized rate across groups

Not-jointly-satisfiable result: the formal-fairness literature has shown these three cannot all hold simultaneously except in degenerate cases (perfect prediction, or no underlying difference between groups). Picking which to enforce is itself a value-loading decision.

Value typeWhat it measuresOptimization risk if confused
PreferencesWhat users choose / clickOptimizing preferences alone can reduce wellbeing (engagement vs life going well)
WellbeingWhat makes lives go well (broad, includes flourishing, health, capability)Hardest to measure; usually requires proxies
HappinessSubjective affective stateCan be high while wellbeing is reduced (e.g., addiction)

The L4 proxy-gaming failures often operate on these distinctions. A recommendation system optimizing for preferences (clicks) can produce outcomes that reduce wellbeing without changing the optimization target.

For a deployment that touches ethical judgment:

  1. Name moral uncertainty. Explain in plain language that there is no single correct ethical framework, the disagreement survives reflection, and AI acting on high confidence in one framework risks outcomes other frameworks reject.
  2. Name the strategy you would use. My Favorite Theory, expected choiceworthiness, or moral parliament. State the tradeoff.
  3. Pick an SWF. Utilitarian, prioritarian, or something else. Defend the choice against the alternative.
  4. Recognize cost-benefit blind spots. If the deployment was justified via cost-benefit analysis, identify the financial-proxy assumption and distributional-impact neglect in the specific case.
  5. Connect to L4 outer alignment. The value-loading question Ch 6 is asking is the substrate question Ch 3.4 left open: the loss function cannot capture an intent that has not been chosen.
  • L4 (alignment): the substrate question Ch 3.4 left open is what Ch 6 asks. L7 is the deeper layer of the outer-alignment problem.
  • L3 (monitoring + robustness): the proxy-gaming failure mode operates on the wellbeing-vs-preferences distinction L7 names. L7 is the layer that diagnoses the distinction.
  • L8 (collective action, Ch 7): extends the multi-stakeholder framing to multi-agent dynamics. The moral-parliament logic becomes the multi-actor coordination problem at L8.
  • L9 (governance, Ch 8): brings the policy-layer instrument that operates outside any individual deployment. The fairness-criterion choice (which L7 surfaces) becomes a regulatory choice in L9.