Summary: GANs, the minimax game
Phase 2’s second paradigm. The previous lesson kept a likelihood objective (the ELBO bound); this one throws likelihood away entirely and trains by a two-network game. The whole lesson reduces to one line: a GAN is two networks in a minimax game; at the optimal discriminator, the generator implicitly minimizes the Jensen-Shannon divergence between the data and the generator’s distribution; mode collapse and training instability are paradigm-level features of the minimax dynamics with an always-lagging discriminator (not the divergence choice itself, whose minimum at the optimal discriminator is full coverage), and they are not bugs to tune away. This is the scan-it-in-five-minutes version.
Core ideas
Section titled “Core ideas”- A GAN has two neural networks: a generator
Gthat mapsz ~ p(z)(Gaussian) to a samplex = G(z), and a discriminatorDthat classifies a sample as real (output near 1) or fake (output near 0). No encoder, no decoder distribution, no likelihood is ever computed. - The minimax objective is
min_G max_D V(D, G) = E_{p_data}[log D(x)] + E_{p_z}[log(1 − D(G(z)))].Dmaximizes (real → 1, fake → 0);Gminimizes (try to foolD). Training alternates updates. - The optimal discriminator at fixed
GisD*(x) = p_data(x) / (p_data(x) + p_G(x)), derived by pointwise maximization. At equilibriump_G = p_data, soD*(x) = 1/2everywhere (the discriminator cannot distinguish). - Substituting
D = D*back into the objective givesV(D*, G) = -log 4 + 2 · JS(p_data || p_G), whereJSis the Jensen-Shannon divergenceJS(p || q) = 0.5·KL(p || m) + 0.5·KL(q || m)withm = (p + q)/2. So GANs minimize JS, not the forward KL of Phase 1. JS is symmetric and bounded bylog 2. - Worked anchor:
p_data = [0.5, 0.5],p_G = [0.7, 0.3](skewed generator) →D*(0) ≈ 0.4167(skeptical, generator over-produces),D*(1) = 0.625(confident real); JS ≈ 0.021 nats. At equilibrium (p_G = p_data),D* = 1/2, JS = 0. - Paradigm-level failure modes: vanishing gradients (
log(1 − D(G(z)))saturates whenDis good; fixed by the non-saturating loss, whereGmaximizeslog D(G(z)), the common practical loss); mode collapse (generator finds a narrow output set that fools the current imperfectD; arises from the minimax best-response dynamics against a lagging discriminator, not from the Jensen-Shannon objective itself); no clean stopping criterion (losses oscillate; use FID or inspect samples, not training loss). The next lesson, GAN training in practice, introduces Wasserstein-GAN with gradient penalty to address some of these. - What GANs are good at: inference speed (one forward pass), sharp samples in specific domains (StyleGAN-family for faces), latent-space editing. Not good at: density evaluation (no likelihood), stable training out of the box, broad-coverage generation (diffusion currently leads).
- Cross-paradigm position: the L3 cheatsheet’s “GAN” row says “No likelihood” precisely because the GAN paradigm minimizes a different divergence (JS, not forward KL) and never evaluates
p_model(x). The broader lesson: divergence choice is a paradigm-design parameter, and recognizing the divergence each method optimizes is the right organizing question.
A note on what this lesson does NOT cover
Section titled “A note on what this lesson does NOT cover”GANs are the paradigm where the modern deepfake category originated. Four distinct policy/governance forums (when generating synthetic faces/voices/video of identifiable people is appropriate, attribution and watermarking of synthesized content, sector-specific policies for journalism/politics/legal evidence, and IP claims around training-data scraping) sit outside this mechanical lesson and belong in legal, governance, and ethics venues. Treat the math and the policy as separate concerns.
What changes for you
Section titled “What changes for you”Before this lesson, “GAN” was probably a label with a vague “two networks compete” intuition behind it. Now you have the math: minimax objective, optimal discriminator, JS-divergence reduction, and the precise paradigm-level reasons mode collapse and training instability happen. When you next read about a new GAN variant, you can read the changes as divergence-choice changes (JS to Wasserstein), regularization tricks (gradient penalty, spectral normalization), or architectural shifts (style-based generator) and place them in the right slot of the framework. The next lesson takes the same minimax framework and changes the divergence to Wasserstein, the production-grade GAN variant that addresses many of the training pathologies you just saw.