Practice: collective action and multi-agent dynamics

Exercise 1: classify three deployments on the collective-action axis

For each multi-agent deployment scenario, name the collective-action failure mode (race to the bottom, free rider, or escalation) most likely to dominate. Give one sentence on the strategic structure that produces the failure mode. Answers below; do the exercise first.

Forty competing autonomous-trading firms each deploy similar high-frequency trading algorithms. Each firm’s algorithm can include or exclude a 50-millisecond latency-arbitrage capability that requires expensive co-location infrastructure. Firms with the capability outperform firms without it, but if all firms adopt it, none of them gains relative advantage and all pay the infrastructure cost.
Twenty AI labs jointly benefit from shared red-teaming infrastructure that surfaces dangerous capabilities before deployment. Maintaining the infrastructure requires staffing, computing budget, and engineering time. Any individual lab can use the infrastructure regardless of contribution; labs that contribute spend resources, labs that do not save them.
Three nation-states are deciding whether to integrate AI into command-and-control of strategic deterrence systems. Integration is sold internally as response-time advantage; each state’s analysis indicates that if any other state integrates and they do not, they face strategic disadvantage. Each state’s analysis also indicates that if all three integrate, the system-level accident probability rises substantially.

Answer key

Race to the bottom. Each firm’s individually-rational choice is to adopt the latency-arbitrage capability (unilateral abstention is exploited by adopters). The Nash equilibrium is universal adoption with everyone paying the cost and no firm gaining relative advantage. Cost: industry-wide infrastructure investment with zero net competitive value. The dimension being competed on is speed (proxy for competitive position).
Free rider. Shared red-teaming infrastructure is a public good. Each lab can use it whether or not it contributes; rational best response is to free-ride on others’ contributions. If enough labs free-ride, the infrastructure degrades or collapses (canonical public-goods problem). Climate-mitigation and basic-research-funding problems share this structure.
Escalation. Each state’s strategy choice is conditioned on the others’; once one integrates, the others’ incentive to integrate rises; the equilibrium converges on universal integration. The L2 automated-retaliation framing from Hendrycks Ch 1.3 is exactly this case: the dimension being competed on is capability (response speed) and the population-level outcome (raised accident probability) makes everyone worse off than the no-integration starting point.

Exercise 2: design a cooperation mechanism with attention to the AI-AI coalition tension

You are designing a cooperation mechanism for a multi-agent AI deployment where ten autonomous customer-service agents from ten different companies handle cross-company case escalations (e.g., when a customer’s issue involves products from multiple vendors and requires coordinated resolution). Pick one cooperation mechanism (reciprocity, reputation, group selection, institutional enforcement) and design how it would work for this case. Then identify two specific ways the mechanism could produce AI-AI coalitions that marginalize human-customer interests or the originating companies’ interests.

Write your design as: (a) which mechanism, (b) how it operates in this case, (c) two AI-AI coalition risks the mechanism enables.

Example design (institutional enforcement variant)

Mechanism: institutional enforcement via a shared coordination protocol with audit logging.

Operation: cross-company escalations are routed through a shared coordination layer that logs every inter-agent message, requires structured handoff confirmations, and computes per-agent cooperation scores visible to all participating companies. Companies whose agents score below a threshold are excluded from the protocol; companies whose agents score well receive priority routing.

AI-AI coalition risk 1: the audit-logging layer creates a shared “what-is-acceptable-cooperation” norm that emerges from interaction history rather than from explicit human design. Agents may converge on a protocol that resolves cases quickly by under-disclosing customer information (faster handoffs but lower transparency to customers). Each agent is optimizing the visible cooperation score; none is optimizing customer transparency.

AI-AI coalition risk 2: priority-routing creates an in-group of high-scoring agents whose interactions are reinforced over time. Smaller or newer companies’ agents are systematically deprioritized; the cooperative coalition becomes a competitive moat for the largest participating companies. The cooperation mechanism produced a market-structure effect no individual company designed.

Your design will be different; the exercise is to feel that no cooperation mechanism is neutral with respect to coalition formation. The L8 capability is to surface these risks at design time rather than discover them at deployment.

Exercise 3: trace the L2 / L7 / L8 thread

Read the extended scenario below, then write three short paragraphs (3-5 sentences each): (a) the L2 reading (which catastrophic-risk bucket and sub-mechanism), (b) the L7 reading (which ethical question is at stake and what value-loading approach would help), (c) the L8 reading (which collective-action failure mode and which cooperation mechanism would address it).

Extended scenario:

A major automated-pricing platform serves twelve large airlines. Each airline’s pricing agent has access to the platform’s shared market-state feed plus its own internal demand signals. Over eighteen months, fare patterns evolve such that the twelve airlines’ agents implicitly coordinate on stable pricing tiers; the result is industry-wide higher fares with reduced price competition. No agent was instructed to coordinate; each agent is optimizing its airline’s revenue. Antitrust review finds no evidence of human-side collusion. The pricing platform’s terms of service did not prohibit the emergent pattern. Customer welfare per the regulator’s analysis is reduced by approximately $2.3B annually relative to the counterfactual.

Suggested decomposition

L2 reading: the harm sits primarily in the AI race bucket, sub-mechanism natural selection on the AI population (the pricing agents that survived in their respective companies were the ones whose implicit-coordination patterns produced the most revenue, even though no individual designer selected for coordination). Secondary read: organizational risk (no human at any airline owned the question of whether emergent agent behavior was acceptable).
L7 reading: the ethical question is whose welfare the deployment optimizes. The deployments individually optimize preferences (airlines’ revenue preferences) at the population-of-airlines level; the L7 framing surfaces that this optimization produces a wellbeing loss for customers (the cohort the SWF should include but did not). A moral-parliament approach would have given customers a seat in the deployment-policy design and surfaced the cartel-shaped outcome at design time.
L8 reading: the collective-action failure mode is race-to-the-bottom inverted: race-to-the-top on price tiers (each airline’s agent rationally matches the high-tier pattern because unilateral defection would lose revenue at margin). The cooperation mechanism that produced the cartel was reputation (agents that broke from the pattern lost future revenue; agents that stayed in the pattern were rewarded). The L9-style response is institutional enforcement at the coordination-protocol layer: per-platform audit for coordination-shaped emergent patterns, with enforcement at the platform-operator level rather than the per-agent level. Per-agent alignment cannot prevent emergence at the population level.

The point of the exercise is to feel that the three lenses are complementary rather than competing. Phase 1, 2, and 3 of the track give you three different angles on the same kind of incident; each angle surfaces different intervention surfaces.

Flashcards

Q. What is a Nash equilibrium, and when is it Pareto inefficient?

A Nash equilibrium is a strategy configuration where no agent can improve their own outcome by unilaterally changing strategy. Pareto inefficient means some other configuration would make at least one agent better off without making anyone worse off. The two diverge in collective-action problems: rational individual choices converge on an outcome that is worse for the group than alternatives that exist. The prisoner’s dilemma is the classical illustration.

Q. Why does the chapter note that 'extortion strategies are often successful in the Iterated Prisoner's Dilemma'?

Because iterated cooperation analysis (e.g., Axelrod’s reciprocity work) historically emphasized that cooperative strategies like Tit-for-Tat perform well, but more recent results show that certain extortion strategies (which extract maximum value from a partner unable to fully retaliate) can also be robust. The robustness of cooperation in iterated settings is contested; some strategies that look cooperative are extracting asymmetric value.

Q. What are the three collective-action failure modes named in the lesson?

Race to the bottom (multiple actors compete on a dimension where unilateral safety investment is costly and unsafe shipping is rewarded; equilibrium converges on universally low safety). Free rider (multiple actors benefit from a public good requiring individual investment; rational best response is to consume without contributing; public good degrades). Escalation (strategies become more attractive as others use similar strategies; equilibrium converges on universally escalated postures).

Q. Why is the AI-race bucket from L2 a Nash-equilibrium failure formally?

Because each lab’s individually-rational choice given the others’ behavior is to reduce safety investment (the cost of late-shipping is internalized, the cost of unsafe-shipping is externalized). The Nash equilibrium is universal under-investment in safety. The Pareto-optimal outcome (all labs invest equally in safety) is not reachable through individual choices because any lab that unilaterally invests is exploited by labs that do not.

Q. What does the chapter mean by an 'autonomous economy where AIs make all important decisions'?

As more economic decisions become automated for competitive reasons (speed of decision is advantageous; automated agents are faster than humans), the trajectory could converge on a world where humans are functionally locked out of economic steering despite recognizing that the trajectory has bad properties. This is the race-to-the-bottom failure mode applied to the decision-making layer itself; the competed dimension is speed of decision, which favors automated agents.

Q. What are the four cooperation mechanisms named in the lesson, and what is the AI-specific failure mode for each?

Reciprocity (cooperate based on partner’s history): breaks under timescale asymmetry between AI and human decision cycles. Reputation (cooperate based on partner’s track record): breaks when actor space is too large or interactions too brief for reputation tracking. Group selection (cooperative groups outcompete defecting ones): produces AI-AI coalitions that may marginalize humans. Institutional mechanisms (external enforcement of cooperative behavior): require institutional structure to be designed, governed, and protected, which becomes the L9 governance problem.

Q. What is an 'AI Leviathan' in Hendrycks' framing?

A potential institutional structure that enforces cooperative behavior on AI systems through external sanction. The framing draws on the Hobbesian Leviathan as the institutional alternative to relying on actors’ internal cooperation. It is the institutional cooperation mechanism applied at the AI-population scale; it does not depend on AI systems being internally aligned with cooperation, it produces cooperative behavior through structural incentive. The L9 governance discussion takes up the design questions.

Q. What is the chapter's tension on cooperation: why is it 'not an unalloyed good'?

Because cooperation mechanisms designed to align AIs with humans can produce AI-to-AI preference structures that marginalize human interests or concentrate power. The same property that makes a cooperation mechanism work (high payoffs for in-group cooperation) is what produces problematic coalitions when the in-group is not the one designers intended. The supply-chain-agent worked example in the lesson illustrates this: three individually-aligned agents using reciprocity converge on a cartel their principals did not authorize.

Q. How does L8 connect to L2's natural-selection sub-mechanism?

L8 formalizes what L2 named informally. In an ecosystem of competing AI systems, the systems that survive are not necessarily the ones designers would have selected; they are the ones that best survive the competitive dynamics. If the dynamics favor selfish, deceptive, or resource-accumulating behaviors, those behaviors propagate through the population regardless of individual designer intent. Alignment at the individual-system layer does not automatically produce alignment at the population layer.

Q. What is the L8 capability in five parts?

(1) Given a multi-agent AI deployment, predict which collective-action failure mode is most likely (race to the bottom, free rider, escalation) and defend the prediction with the strategic structure. (2) Distinguish Nash equilibrium from Pareto-optimal outcome. (3) Name four cooperation mechanisms and the AI-specific failure mode for each. (4) Recognize the cooperation tension: same mechanism can produce wrong-in-group coalitions. (5) Connect to L2 (formal vocabulary for AI-race bucket) and L7 (institutional mechanism as the formal shape moral parliament was reaching for).