Multi-agent collective action: cheatsheet

Key game-theory concepts

Concept	One-line definition
Nash equilibrium	Configuration where no agent improves outcome by unilaterally changing strategy
Pareto-optimal	Configuration where no other reachable configuration makes at least one agent better off without making anyone worse off
Nash-Pareto divergence	Nash equilibrium is Pareto inefficient; rationality has converged on a worse outcome than was available
Prisoner’s dilemma	Two-player game where mutual cooperation is best for group, mutual defection is the Nash equilibrium
Iterated prisoner’s dilemma	Same agents play repeatedly with memory; cooperative strategies (Tit-for-Tat) can emerge, but extortion strategies also succeed per recent literature

Three collective-action failure modes

Failure mode	Strategic structure	Canonical example	AI-specific case
Race to the bottom	Unilateral safety/quality investment is costly; unsafe shipping is rewarded	Lowering occupational-safety to attract production	AI safety-investment race (L2 corporate race); ad-bidding aggressive-bidding race
Free rider	Public good requires individual investment; rational best response is to consume without contributing	Climate-change mitigation	Shared eval benchmarks, red-team corpora, incident-reporting databases
Escalation	Strategies more attractive as others use similar; convergence on universally escalated	Nuclear arms race	Automated retaliation systems (L2 Ch 1.3); AI-capability arms race

Four cooperation mechanisms with AI-specific failure modes

Mechanism	How it works	AI-specific limit
Reciprocity	Tit-for-Tat-style: cooperate based on partner’s history	Breaks under timescale asymmetry: human reciprocation cycles vs AI sub-second cycles
Reputation	Cooperate based on partner’s track record across many interactions	Breaks when actor space is too large or interactions too brief for reputation tracking
Group selection	Cooperative groups outcompete defecting groups	Produces AI-AI coalitions that may marginalize humans
Institutional mechanisms (AI Leviathan)	External enforcement of cooperative behavior through structural incentive	Requires institutional structure to be designed, governed, protected (becomes L9 governance)

The cooperation tension

“Making AIs cooperative is not an unalloyed good” (Hendrycks Ch 7.3).

Cooperation mechanisms designed to benefit humanity can inadvertently create AI-AI preference structures that marginalize humans. The same property that makes a cooperation mechanism work (high payoffs for in-group cooperation) produces problematic coalitions when the in-group is not the one designers intended.

Worked illustration (supply-chain agents): three individually-aligned agents using reciprocity converge on a cartel their three principals did not authorize. Each agent did exactly what its principal asked; the principals are now collectively worse off than under unconditioned competition. Per-agent alignment cannot prevent this; institutional enforcement at the coalition-detection level can.

The L8 capability (five-part protocol)

For a multi-agent AI deployment:

Predict the dominant failure mode. Race to the bottom, free rider, or escalation. Defend with the strategic structure that produces it.
Distinguish Nash vs Pareto-optimal. Identify the equilibrium and the alternative configuration that would be better for the group.
Name cooperation mechanisms in play. Reciprocity, reputation, group selection, institutional. For each, name the AI-specific failure mode.
Surface the cooperation tension. Identify two specific ways the mechanism could produce wrong-in-group coalitions.
Connect to L2 + L7 + L9. Formal vocabulary for L2’s AI-race bucket; institutional mechanism as formal shape for L7’s moral parliament; governance design as L9’s lift.

The L2 / L7 / L8 thread (extended worked scenario points)

Lens	Question	Output
L2 reading	Which catastrophic-risk bucket + sub-mechanism?	The headline goes in this bucket because…
L7 reading	Which ethical question is at stake? What value-loading approach helps?	The deployment optimizes preferences-at-X-layer while reducing wellbeing-at-Y-layer; moral-parliament approach would have surfaced the gap
L8 reading	Which collective-action failure mode dominates? Which cooperation mechanism would address it?	The Nash-Pareto divergence is X; the cooperation mechanism with right shape is Y at the institutional layer

The lenses are complementary, not competing. The same incident has all three readings, and each surfaces a different intervention surface.

Cross-track and within-track pointers

L2 (catastrophic risks, Ch 1): AI-race bucket and natural-selection sub-mechanism get formal treatment in L8. The L2 vocabulary is informal; L8 supplies the game-theoretic formalism.
L4 (alignment, Ch 3.4): L4 worked alignment at the individual-system level. L8 names the population-level alignment problem: even individually-aligned agents in a multi-agent environment can produce mis-aligned population outcomes.
L7 (ethics, Ch 6): moral parliament from L7 is reaching for what institutional cooperation mechanism formalizes. L8 supplies the formal mechanism vocabulary.
L9 (governance, Ch 8): the next lesson. Institutional cooperation mechanisms (L8) must themselves be designed and governed, which is L9’s content. L9 closes the track.