Cheatsheet: GANs, the minimax game
The setup (two networks)
Section titled “The setup (two networks)”| Network | Input | Output | Role |
|---|---|---|---|
Generator G | latent z ~ p(z) (Gaussian) | sample x = G(z) | Sample directly; no encoder, no decoder distribution |
Discriminator D | sample x | scalar in [0, 1] | Binary classifier: probability x is real |
No likelihood is computed anywhere. The training game replaces the likelihood objective.
The minimax objective
Section titled “The minimax objective”min over G max over D V(D, G) = E_{x ~ p_data}[ log D(x) ] + E_{z ~ p(z)}[ log(1 - D(G(z))) ]D maximizes (real → 1, fake → 0); G minimizes (try to make D(G(z)) → 1). Training alternates.
The optimal discriminator (chalkboard)
Section titled “The optimal discriminator (chalkboard)”Fix G; let p_G be the generator’s induced distribution. Maximizing the integrand at each x:
D*(x) = p_data(x) / ( p_data(x) + p_G(x) )At equilibrium p_G = p_data, so D*(x) = 1/2 everywhere (the discriminator cannot distinguish).
The implicit divergence: Jensen-Shannon
Section titled “The implicit divergence: Jensen-Shannon”Substitute D = D* back in:
V(D*, G) = -log 4 + 2 · JS( p_data || p_G )
JS(p || q) = 0.5 · KL(p || m) + 0.5 · KL(q || m) with m = (p + q) / 2So GANs minimize the Jensen-Shannon divergence: symmetric, bounded by log 2, NOT the forward KL of Phase 1. This is the precise sense in which GANs “use a different divergence” than the likelihood-based paradigms.
Worked numerical example
Section titled “Worked numerical example”p_data = [0.5, 0.5], p_G = [0.7, 0.3]. Mixture m = [0.6, 0.4].
| Quantity | Value |
|---|---|
D*(0) = 0.5 / 1.2 | ≈ 0.4167 (D thinks x=0 more likely fake) |
D*(1) = 0.5 / 0.8 | = 0.625 (D thinks x=1 more likely real) |
KL(p_data || m) = 0.5 ln(5/6) + 0.5 ln(5/4) | ≈ 0.0204 |
KL(p_G || m) = 0.7 ln(7/6) + 0.3 ln(3/4) | ≈ 0.0216 |
JS = 0.5 · 0.0204 + 0.5 · 0.0216 | ≈ 0.021 nats |
At p_G = p_data: D* = 1/2, both KLs = 0, JS = 0 (equilibrium).
Paradigm-level failure modes (not bugs)
Section titled “Paradigm-level failure modes (not bugs)”| Failure | Cause | Standard mitigation |
|---|---|---|
| Vanishing gradients | log(1 - D(G(z))) saturates when D is good | Non-saturating loss: G maximizes log D(G(z)) (same gradient direction, no saturation) |
| Mode collapse | JS minimization + minimax dynamics; G can fool D with a narrow output set | Wasserstein-GAN (next lesson), feature matching, mini-batch discrimination, unrolled GANs |
| No stopping criterion | Losses oscillate; not monotone | Inspect samples; use FID (lesson 9), not training loss |
Modern training uses the non-saturating loss; the minimax framework is the theoretical description.
GANs are good at / not good at
Section titled “GANs are good at / not good at”| Good at | Not good at |
|---|---|
Inference speed (one forward pass through G) | Density evaluation (no likelihood, period) |
| Sharp samples in specific domains (StyleGAN faces, image-to-image) | Stable out-of-the-box training |
| Latent-space editing (semantic directions for “age,” “smile,” etc.) | Broad-coverage generation (diffusion now leads for general text-to-image) |
Why it matters for AI
Section titled “Why it matters for AI”- Divergence-choice as paradigm-design. Many later generative variants are framed as “which divergence?” GAN minimizes JS, WGAN minimizes Wasserstein, Phase 1 paradigms minimize forward KL, diffusion uses score matching. Recognizing the question matters.
- Adversarial training as a tool. Even outside generative models, the adversarial pattern recurs (robustness research, some self-supervised methods, policy-vs-value in RL).
- Reading model releases that use GAN components. The mechanical math (this lesson) is separable from the policy framing (next section).
A note on what this lesson does NOT cover
Section titled “A note on what this lesson does NOT cover”GANs are the paradigm where the modern deepfake category originated. The framing for those use cases is a set of distinct policy/governance questions outside this lesson’s mechanical scope:
- When generating synthetic faces, voices, or video of identifiable people is appropriate vs not (use-case and consent policy);
- How to attribute or watermark synthesized content (provenance policy);
- Sector-specific policies for generated media in journalism, politics, and legal evidence (deployment policy);
- IP and licensing claims around training data scraped from named sources (data-licensing policy).
Each is a distinct forum with distinct stakeholders, evaluated by different methods. Treat the math and the policy as separate concerns.
Pitfalls to dodge
Section titled “Pitfalls to dodge”- Treating the original saturating loss as the training loss. Every modern GAN uses non-saturating. Same gradient direction, no saturation.
- Mistaking mode collapse for a bug. Paradigm-level; fix it by changing divergence (Wasserstein-GAN) or with regularization tricks, not hyperparameter tuning.
- Expecting a likelihood number. GANs do not compute density anywhere.
- Using training loss for stopping. Losses oscillate; use FID or inspect samples.
The one-line version
Section titled “The one-line version”A GAN is two networks in a minimax game; at optimal D*, the generator minimizes the Jensen-Shannon divergence (not forward KL); mode collapse and training instability are paradigm-level features of that divergence choice, not bugs to tune away.