WGAN-GP and Wasserstein loss, in brief

What you’ll learn

This is lesson 8 of Track 19 (Generative Models and Diffusion), the second of two lessons on adversarial training and the last GAN-focused lesson in the track. By the end you will be able to explain why the original GAN’s Jensen-Shannon objective gives no useful gradient when the data and generator distributions are disjoint (the early-training case), state the Wasserstein-distance alternative in both its Earth Mover’s intuition and its Kantorovich-Rubinstein dual form, write the WGAN-GP critic loss including the gradient penalty, and compute the Wasserstein-1 distance between two 1D distributions by hand using the CDF formula. You will also know which of L7’s paradigm-level pathologies WGAN-GP fixes and which it does not. The source curriculum is Stanford CS236 Lecture 10 with the same Berkeley CS294-158 Lecture 5 framing applied throughout the GAN family.

Where this fits

This is lesson 8 of 15, the fourth lesson of Phase 2 (latent-variable and adversarial paradigms) and the last GAN-focused lesson in this track. The remaining Phase 2 lesson (L9) is evaluation-focused. The GAN arc through the track: original GAN (L7) → Wasserstein-GAN with gradient penalty (this lesson) → evaluation methods for generative samples (L9, which closes Phase 2 with a divergence-agnostic look at how all of Phase 2’s models are actually compared). After L9, Phase 3 opens with energy-based and score-based models, then full diffusion in lessons 12-14, then the synthesis capstone at L15.

Before you start

Prerequisites: the previous lesson, GANs, the minimax game, for the framework and the paradigm-level problems this lesson addresses. The L3 KL/cross-entropy machinery is implicit (the gradient-penalty formulation uses standard expectation-and-derivative notation from those lessons), but no new probability identities are introduced. Math background: comfort with expectations, one calculus step (gradient norm), and one transport-theory intuition (Earth Mover’s distance) that the lesson builds up carefully.

About the math

The lesson introduces one new concept (the Wasserstein distance, with both an intuitive and a dual-form definition), one new architectural pattern (the gradient penalty), and one closed-form 1D formula (W_1 = integral of |F_p − F_q|). A worked numerical example computes W_1 for two simple discrete distributions using the CDF formula; the practice extends it to a fresh 3-mass case. The Lipschitz constraint is introduced symbolically and exercised on a simple linear function. The math density is comparable to L7’s optimal-discriminator derivation.

By the end, you’ll be able to

Explain the Earth Mover’s intuition for the Wasserstein distance and why it gives meaningful gradients on disjoint distributions where JS divergence saturates
State the Kantorovich-Rubinstein dual form of W_1 and explain the role of the 1-Lipschitz constraint
Write the WGAN-GP critic loss including the gradient penalty on interpolated samples and explain why interpolated samples (not real or fake alone) are the right evaluation points
Compute the Wasserstein-1 distance between two simple 1D discrete distributions using the CDF formula and verify against the Earth Mover’s interpretation
Identify which of the original-GAN problems WGAN-GP fixes (vanishing gradients, mode collapse, stopping criterion) and which it does not (no likelihood, architectural constraint, extra compute)

Time and difficulty

Read time: about 14 minutes
Practice time: about 16 minutes (a six-question self-check, a Wasserstein-1 computation on a fresh 3-mass 1D case, a Lipschitz-constraint verification on a simple linear function, and flashcards)
Difficulty: standard (a Phase 2 lesson; one new distance, one new constraint, one new architectural pattern; closed-form computations stay at the 1D CDF level)