Skip to content

GAN training in practice, Wasserstein loss and gradient penalty

This is lesson 8 of Track 19 (Generative Models and Diffusion), the second of two lessons on adversarial training and the last GAN-focused lesson in the track. By the end you will be able to explain why the original GAN’s Jensen-Shannon objective gives no useful gradient when the data and generator distributions are disjoint (the early-training case), state the Wasserstein-distance alternative in both its Earth Mover’s intuition and its Kantorovich-Rubinstein dual form, write the WGAN-GP critic loss including the gradient penalty, and compute the Wasserstein-1 distance between two 1D distributions by hand using the CDF formula. You will also know which of L7’s paradigm-level pathologies WGAN-GP fixes and which it does not. The source curriculum is Stanford CS236 Lecture 10 with the same Berkeley CS294-158 Lecture 5 framing applied throughout the GAN family.

This is lesson 8 of 15, the fourth lesson of Phase 2 (latent-variable and adversarial paradigms) and the last GAN-focused lesson in this track. The remaining Phase 2 lesson (L9) is evaluation-focused. The GAN arc through the track: original GAN (L7) → Wasserstein-GAN with gradient penalty (this lesson) → evaluation methods for generative samples (L9, which closes Phase 2 with a divergence-agnostic look at how all of Phase 2’s models are actually compared). After L9, Phase 3 opens with energy-based and score-based models, then full diffusion in lessons 12-14, then the synthesis capstone at L15.

Prerequisites: the previous lesson, GANs, the minimax game, for the framework and the paradigm-level problems this lesson addresses. The L3 KL/cross-entropy machinery is implicit (the gradient-penalty formulation uses standard expectation-and-derivative notation from those lessons), but no new probability identities are introduced. Math background: comfort with expectations, one calculus step (gradient norm), and one transport-theory intuition (Earth Mover’s distance) that the lesson builds up carefully.

The lesson introduces one new concept (the Wasserstein distance, with both an intuitive and a dual-form definition), one new architectural pattern (the gradient penalty), and one closed-form 1D formula (W_1 = integral of |F_p − F_q|). A worked numerical example computes W_1 for two simple discrete distributions using the CDF formula; the practice extends it to a fresh 3-mass case. The Lipschitz constraint is introduced symbolically and exercised on a simple linear function. The math density is comparable to L7’s optimal-discriminator derivation.

  • Explain the Earth Mover’s intuition for the Wasserstein distance and why it gives meaningful gradients on disjoint distributions where JS divergence saturates
  • State the Kantorovich-Rubinstein dual form of W_1 and explain the role of the 1-Lipschitz constraint
  • Write the WGAN-GP critic loss including the gradient penalty on interpolated samples and explain why interpolated samples (not real or fake alone) are the right evaluation points
  • Compute the Wasserstein-1 distance between two simple 1D discrete distributions using the CDF formula and verify against the Earth Mover’s interpretation
  • Identify which of the original-GAN problems WGAN-GP fixes (vanishing gradients, mode collapse, stopping criterion) and which it does not (no likelihood, architectural constraint, extra compute)
  • Read time: about 14 minutes
  • Practice time: about 16 minutes (a six-question self-check, a Wasserstein-1 computation on a fresh 3-mass 1D case, a Lipschitz-constraint verification on a simple linear function, and flashcards)
  • Difficulty: standard (a Phase 2 lesson; one new distance, one new constraint, one new architectural pattern; closed-form computations stay at the 1D CDF level)