Skip to content

Cheatsheet: GAN training in practice, Wasserstein loss and gradient penalty

Original GAN: minimize Jensen-Shannon divergence (via discriminator outputting probabilities)
WGAN(-GP): minimize Wasserstein distance (via critic outputting unbounded scalar, 1-Lipschitz)

JS saturates at log 2 when distributions are disjoint (no gradient). Wasserstein scales smoothly with geometric distance (always meaningful gradient).

W_1(p, q) = minimum cost to transport mass from p into the shape of q, where cost = mass × distance moved.

Point-mass example: p at x=0, q at x=3W_1 = 3 (move all mass 3 units). JS would give log 2 regardless of distance.

Kantorovich-Rubinstein duality (trainable form)

Section titled “Kantorovich-Rubinstein duality (trainable form)”
W_1(p, q) = sup over 1-Lipschitz f ( E_{x ~ p}[f(x)] - E_{x ~ q}[f(x)] )

A function is 1-Lipschitz if |f(x) − f(y)| ≤ ||x − y|| for all x, y (equivalently ||∇f(x)|| ≤ 1 almost everywhere).

NetworkWhat it does
Critic f (not “discriminator”)Outputs unbounded scalar; maximizes E_{p_data}[f] − E_{p_G}[f]
Generator GMinimizes the same quantity (move output mass closer to data)
ApproachHowStatus
Weight clipping (original WGAN)Clip each critic weight to [-c, c] after each stepWorked, but reduced expressiveness; hyperparameter sensitive
Gradient penalty (WGAN-GP)Add `λ · E_x̂ [(
Spectral normalization (SN-GAN)Normalize the spectral norm of each weight matrixAlternative; some advantages for certain architectures

The gradient penalty is a SOFT constraint (it pushes gradient norm toward 1 at sample points). In practice this produces near-1-Lipschitz critics that train far more stably than weight-clipped ones.

critic loss = -[ E_{x ~ p_data}[ f(x) ] - E_{x ~ p_G}[ f(x) ] ]
+ λ · E_{x̂}[ ( ||∇_x f(x̂)|| - 1 )² ]

with x̂ = α · x_real + (1 − α) · x_fake, α ~ Uniform(0, 1). Generator loss is just the (negated) first bracket.

For 1D: W_1(p, q) = integral |F_p(x) − F_q(x)| dx.

Take p: mass 0.5 at x=0 and x=2. Take q: mass 0.5 at x=1 and x=3 (shifted by 1).

CDFs: F_p = 0 for x<0, 0.5 for 0≤x<2, 1 for x≥2; F_q = 0 for x<1, 0.5 for 1≤x<3, 1 for x≥3.

| Interval | |F_p − F_q| | | --- | --- | | x < 0 | 0 | | 0 ≤ x < 1 | 0.5 | | 1 ≤ x < 2 | 0 | | 2 ≤ x < 3 | 0.5 | | x ≥ 3 | 0 |

W_1 = 0.5·1 + 0.5·1 = 1. Matches Earth Mover’s intuition (move 0.5 mass 1 unit, twice).

JS for these same distributions: disjoint support → JS = log 2 regardless. Wasserstein wins on geometric sensitivity.

ProblemOriginal GANWGAN-GP
Vanishing gradients on disjoint distributionsYes (JS pinned at log 2)NO, meaningful gradients
No clean stopping criterionYes (oscillating losses)Partial; critic loss correlates with sample quality
Mode collapseYes (JS dynamics)Reduced (Wasserstein penalizes missing mass; not eliminated)
No likelihoodYesYes (no fix)
Requires careful architectureYesYes (Lipschitz constraint must be respected)
Extra compute(baseline)Gradient penalty adds ∇_x f(x̂) computation per step

WGAN-GP is the GAN family member that survived the post-GAN era and the default starting point for production-grade adversarial training.

The §6 boundary from L7 carries through this lesson. Four distinct policy/governance forums sit outside this mechanical scope:

  • When generating synthetic faces/voices/video of identifiable people is appropriate vs not (use-case + consent)
  • Attribution and watermarking of synthesized content (provenance)
  • Sector-specific policies for journalism, politics, legal evidence (deployment)
  • IP and licensing claims around training data scraped from named sources (data-licensing)

The relevant evaluation methods for this lesson’s scope are: training stability (the critic’s Wasserstein estimate, the gradient norm on interpolated samples) and sample quality (FID, Inception Score, human preference studies; covered in lesson 9). If you are using those tools, you are in this lesson’s scope. If you are using policy-debate methods (stakeholder analysis, legal frameworks, regulatory comment), you are in a different conversation.

  • Calling the critic a “discriminator.” Critic outputs unbounded scalar (Wasserstein estimate), not probability.
  • Skipping the gradient penalty. Without Lipschitz enforcement, the duality form does not equal W_1. Use GP or spectral normalization.
  • Computing GP on wrong points. Interpolated samples , not real or fake alone.
  • Treating WGAN-GP as a complete fix. Better, not perfect. No likelihood. Still can collapse under poor settings.

WGAN-GP replaces Jensen-Shannon with the Wasserstein distance (Earth Mover’s, geometric, gives meaningful gradients on disjoint distributions) and enforces the required 1-Lipschitz critic constraint via a gradient penalty on interpolated samples; the result is the production-grade GAN variant.