Summary: GAN training in practice, Wasserstein loss and gradient penalty

Phase 2’s second and last GAN-focused lesson (only L7 and L8 are GAN-focused; L9 covers paradigm-agnostic evaluation). The previous lesson named three paradigm-level problems with the original JS-divergence-minimizing GAN (vanishing gradients, mode collapse, no clean stopping criterion). This lesson fixes the first directly, the third partially, and the second meaningfully by changing the divergence to the Wasserstein distance and adding a soft Lipschitz constraint. The whole lesson reduces to one line: WGAN-GP replaces Jensen-Shannon with the Wasserstein distance (Earth Mover’s, geometric, gives meaningful gradients on disjoint distributions) and enforces the required 1-Lipschitz critic constraint via a gradient penalty on interpolated samples; the result is the production-grade GAN variant. This is the scan-it-in-five-minutes version.

Core ideas

The Wasserstein distance W_1(p, q) is the Earth Mover’s distance: minimum cost to transport mass from p into the shape of q, where cost is mass times distance moved. It scales smoothly with geometric distance, unlike JS which saturates at log 2 on disjoint supports. For point masses at x=0 and x=3: W_1 = 3; JS = log 2 ≈ 0.693 regardless of how far apart.
Kantorovich-Rubinstein duality gives a trainable form: W_1(p, q) = sup over 1-Lipschitz f of (E_{p}[f] − E_{q}[f]). A neural network critic f (constrained 1-Lipschitz) approximates the supremum by maximizing E_{p_data}[f] − E_{p_G}[f]; the generator minimizes the same quantity. Same two-network minimax framework as L7, different objective and constraint.
The Lipschitz constraint is enforced by a gradient penalty added to the critic’s loss: λ · E_{x̂}[(||∇_x f(x̂)|| − 1)²] with x̂ = α·x_real + (1 − α)·x_fake, α ~ Uniform(0, 1), λ ≈ 10. The penalty pushes the critic’s gradient norm toward 1 on samples interpolated between real and fake (which is where the transport plan does most of its work). This is WGAN-GP; the soft constraint replaces the original WGAN’s weight clipping, which had expressiveness problems.
Worked anchor: 1D p with mass 1/3 at {0, 1, 2}, q shifted to {2, 3, 4}. CDF differences are 1/3, 2/3, 2/3, 1/3 on unit-width intervals, so W_1 = 2. Earth Mover’s check: three masses of 1/3, each moved 2 units, total cost 3 · (2/3) = 2. Matches. JS on the same case would be near-saturated and would not encode the shift magnitude.
What WGAN-GP fixes: vanishing gradients on disjoint distributions (meaningful gradient signal); training loss correlates with sample quality (the critic’s loss estimates W_1, a real similarity number); mode collapse reduced (Wasserstein penalizes missing mass, unlike JS). What it does not fix: no likelihood (the critic outputs an unbounded scalar, not a density); still requires careful architecture (Lipschitz constraint must be respected); adds compute cost (gradient penalty needs ∇_x f per step).
The critic is not a discriminator. It outputs an unbounded scalar, not a probability. Mixing the terms is a sign of conflating original-GAN and WGAN.
Production-grade GAN training is stable-variant-flavored. Domain-specific high-resolution generation (StyleGAN-family, typically non-saturating logistic loss with R1 regularization), hybrid systems (image-to-image translation, super-resolution, audio synthesis), and adversarial-training-as-regularization in robustness research use a stable training framework rather than the original GAN; ProGAN was the canonical WGAN-GP-trained example, and WGAN-GP and its descendants remain a common choice.

A note on what this lesson does NOT cover

The §6 boundary from L7 carries through. The four distinct policy/governance forums outside this mechanical scope: when generating synthetic faces/voices/video of identifiable people is appropriate vs not, attribution + watermarking of synthesized content, sector-specific policies for journalism + politics + legal evidence, and IP + licensing claims around training-data scraping. The relevant evaluation methods for this lesson’s scope are training stability (critic’s Wasserstein estimate, gradient norm on interpolated samples) and sample quality (FID, Inception Score, human preference studies, lesson 9). If you are using those tools, you are in this lesson’s scope. If you are using policy-debate methods (stakeholder analysis, legal frameworks, regulatory comment), you are in a different conversation with different stakeholders.

What changes for you

Before this lesson, the practical question of “how do you actually train a GAN to convergence” probably had no precise answer beyond “carefully.” Now it does: use WGAN-GP, change the divergence to Wasserstein, enforce the Lipschitz constraint with the gradient penalty, watch the critic’s loss as a real proxy for sample quality. When you next read a paper on a GAN variant, the dominant question to ask is “what divergence are they minimizing, and what Lipschitz-enforcement are they using?” (Common answers: WGAN-GP, spectral normalization, hinge loss, or some combination.) The next lesson, Evaluating generative models, closes Phase 2 by covering the sample-based metrics (FID, Inception Score, Precision/Recall for distributions) that are the actual stand-in for likelihood when likelihood is bounded (VAEs), unavailable (GANs), or just one of many possible quality measures.