GANs, the minimax game: brief

What you’ll learn

This is lesson 7 of Track 19 (Generative Models and Diffusion), and the first paradigm in the track that drops the likelihood objective entirely. By the end you will be able to state the minimax objective in one line, derive the optimal discriminator at a fixed generator (D*(x) = p_data(x) / (p_data(x) + p_G(x))), recognize the implicit divergence the generator ends up minimizing (Jensen-Shannon, not the forward KL of Phase 1), and explain why mode collapse and training instability are intrinsic to the paradigm rather than hyperparameter bugs. The source curricula are Stanford CS236 (Lecture 9) and Berkeley CS294-158 (Lecture 5).

Where this fits

This is lesson 7 of 15, the third step of Phase 2 (latent-variable and adversarial paradigms). It contrasts sharply with the previous lesson: VAEs keep a likelihood objective (the ELBO bound) but produce blurrier samples; GANs drop likelihood for sharper samples and pay with training instability. The next lesson, GAN training in practice, keeps the minimax framework but changes the divergence to Wasserstein and adds gradient-penalty regularization, addressing some of the pathologies introduced here. Lesson 9 then covers how to evaluate generative models when training loss is unreliable as a quality proxy.

Before you start

Prerequisites: the previous lesson, VAE training in practice, for the contrast (likelihood-bounded vs likelihood-free). The L3 KL/cross-entropy machinery is reused (the Jensen-Shannon divergence is built from KL divergences). Math background: comfort with expectations, KL divergence, and one calculus step (pointwise maximization of an integrand at each x to derive D*). No new technical machinery is introduced beyond JS divergence itself.

About the math

This lesson has two clean derivations: the optimal discriminator at fixed G (one line of pointwise calculus), and the substitution back into the minimax objective to reveal JS divergence (a few lines, mostly bookkeeping). A worked numerical example on a 2-outcome distribution pins down both the optimal-discriminator formula and the JS divergence; the practice extends it to a 3-outcome case. Unlike previous lessons in this track, the lesson cannot show a single training-step computation with small numbers, because the loss depends on two networks training against each other; the small-numbers work is on the equilibrium analysis.

By the end, you’ll be able to

State the GAN minimax objective and explain who maximizes and who minimizes
Derive the optimal discriminator D*(x) = p_data(x) / (p_data(x) + p_G(x)) at fixed generator
Recognize that the implicit divergence the generator minimizes is the Jensen-Shannon divergence, and contrast it with the forward KL of Phase 1
Explain why mode collapse and training instability are paradigm-level features (JS-divergence dynamics) rather than incidental bugs
Place GANs in the modern generative landscape on inference speed, sample sharpness, and the lack of a likelihood number

Time and difficulty

Read time: about 14 minutes
Practice time: about 18 minutes (a six-question self-check, an optimal-discriminator computation on a 3-outcome case, a JS-divergence computation on the same distributions, and flashcards)
Difficulty: standard (a Phase 2 lesson; two clean derivations, one new divergence definition, and an honest discussion of paradigm-level failure modes that are not fixable by tuning)