Skip to content

Summary: GANs, the minimax game

Phase 2’s second paradigm. The previous lesson kept a likelihood objective (the ELBO bound); this one throws likelihood away entirely and trains by a two-network game. The whole lesson reduces to one line: a GAN is two networks in a minimax game; at the optimal discriminator, the generator implicitly minimizes the Jensen-Shannon divergence between the data and the generator’s distribution; mode collapse and training instability are paradigm-level features of the minimax dynamics with an always-lagging discriminator (not the divergence choice itself, whose minimum at the optimal discriminator is full coverage), and they are not bugs to tune away. This is the scan-it-in-five-minutes version.

  • A GAN has two neural networks: a generator G that maps z ~ p(z) (Gaussian) to a sample x = G(z), and a discriminator D that classifies a sample as real (output near 1) or fake (output near 0). No encoder, no decoder distribution, no likelihood is ever computed.
  • The minimax objective is min_G max_D V(D, G) = E_{p_data}[log D(x)] + E_{p_z}[log(1 − D(G(z)))]. D maximizes (real → 1, fake → 0); G minimizes (try to fool D). Training alternates updates.
  • The optimal discriminator at fixed G is D*(x) = p_data(x) / (p_data(x) + p_G(x)), derived by pointwise maximization. At equilibrium p_G = p_data, so D*(x) = 1/2 everywhere (the discriminator cannot distinguish).
  • Substituting D = D* back into the objective gives V(D*, G) = -log 4 + 2 · JS(p_data || p_G), where JS is the Jensen-Shannon divergence JS(p || q) = 0.5·KL(p || m) + 0.5·KL(q || m) with m = (p + q)/2. So GANs minimize JS, not the forward KL of Phase 1. JS is symmetric and bounded by log 2.
  • Worked anchor: p_data = [0.5, 0.5], p_G = [0.7, 0.3] (skewed generator) → D*(0) ≈ 0.4167 (skeptical, generator over-produces), D*(1) = 0.625 (confident real); JS ≈ 0.021 nats. At equilibrium (p_G = p_data), D* = 1/2, JS = 0.
  • Paradigm-level failure modes: vanishing gradients (log(1 − D(G(z))) saturates when D is good; fixed by the non-saturating loss, where G maximizes log D(G(z)), the common practical loss); mode collapse (generator finds a narrow output set that fools the current imperfect D; arises from the minimax best-response dynamics against a lagging discriminator, not from the Jensen-Shannon objective itself); no clean stopping criterion (losses oscillate; use FID or inspect samples, not training loss). The next lesson, GAN training in practice, introduces Wasserstein-GAN with gradient penalty to address some of these.
  • What GANs are good at: inference speed (one forward pass), sharp samples in specific domains (StyleGAN-family for faces), latent-space editing. Not good at: density evaluation (no likelihood), stable training out of the box, broad-coverage generation (diffusion currently leads).
  • Cross-paradigm position: the L3 cheatsheet’s “GAN” row says “No likelihood” precisely because the GAN paradigm minimizes a different divergence (JS, not forward KL) and never evaluates p_model(x). The broader lesson: divergence choice is a paradigm-design parameter, and recognizing the divergence each method optimizes is the right organizing question.

GANs are the paradigm where the modern deepfake category originated. Four distinct policy/governance forums (when generating synthetic faces/voices/video of identifiable people is appropriate, attribution and watermarking of synthesized content, sector-specific policies for journalism/politics/legal evidence, and IP claims around training-data scraping) sit outside this mechanical lesson and belong in legal, governance, and ethics venues. Treat the math and the policy as separate concerns.

Before this lesson, “GAN” was probably a label with a vague “two networks compete” intuition behind it. Now you have the math: minimax objective, optimal discriminator, JS-divergence reduction, and the precise paradigm-level reasons mode collapse and training instability happen. When you next read about a new GAN variant, you can read the changes as divergence-choice changes (JS to Wasserstein), regularization tricks (gradient penalty, spectral normalization), or architectural shifts (style-based generator) and place them in the right slot of the framework. The next lesson takes the same minimax framework and changes the divergence to Wasserstein, the production-grade GAN variant that addresses many of the training pathologies you just saw.