Cheatsheet: GAN training in practice, Wasserstein loss and gradient penalty
The core swap from L7
Section titled “The core swap from L7”Original GAN: minimize Jensen-Shannon divergence (via discriminator outputting probabilities)WGAN(-GP): minimize Wasserstein distance (via critic outputting unbounded scalar, 1-Lipschitz)JS saturates at log 2 when distributions are disjoint (no gradient). Wasserstein scales smoothly with geometric distance (always meaningful gradient).
Earth Mover’s intuition
Section titled “Earth Mover’s intuition”W_1(p, q) = minimum cost to transport mass from p into the shape of q, where cost = mass × distance moved.
Point-mass example: p at x=0, q at x=3 → W_1 = 3 (move all mass 3 units). JS would give log 2 regardless of distance.
Kantorovich-Rubinstein duality (trainable form)
Section titled “Kantorovich-Rubinstein duality (trainable form)”W_1(p, q) = sup over 1-Lipschitz f ( E_{x ~ p}[f(x)] - E_{x ~ q}[f(x)] )A function is 1-Lipschitz if |f(x) − f(y)| ≤ ||x − y|| for all x, y (equivalently ||∇f(x)|| ≤ 1 almost everywhere).
| Network | What it does |
|---|---|
Critic f (not “discriminator”) | Outputs unbounded scalar; maximizes E_{p_data}[f] − E_{p_G}[f] |
Generator G | Minimizes the same quantity (move output mass closer to data) |
Enforcing the 1-Lipschitz constraint
Section titled “Enforcing the 1-Lipschitz constraint”| Approach | How | Status |
|---|---|---|
| Weight clipping (original WGAN) | Clip each critic weight to [-c, c] after each step | Worked, but reduced expressiveness; hyperparameter sensitive |
| Gradient penalty (WGAN-GP) | Add `λ · E_x̂ [( | |
| Spectral normalization (SN-GAN) | Normalize the spectral norm of each weight matrix | Alternative; some advantages for certain architectures |
The gradient penalty is a SOFT constraint (it pushes gradient norm toward 1 at sample points). In practice this produces near-1-Lipschitz critics that train far more stably than weight-clipped ones.
WGAN-GP critic loss
Section titled “WGAN-GP critic loss”critic loss = -[ E_{x ~ p_data}[ f(x) ] - E_{x ~ p_G}[ f(x) ] ] + λ · E_{x̂}[ ( ||∇_x f(x̂)|| - 1 )² ]with x̂ = α · x_real + (1 − α) · x_fake, α ~ Uniform(0, 1). Generator loss is just the (negated) first bracket.
Worked numerical: W_1 via CDF (1D)
Section titled “Worked numerical: W_1 via CDF (1D)”For 1D: W_1(p, q) = integral |F_p(x) − F_q(x)| dx.
Take p: mass 0.5 at x=0 and x=2. Take q: mass 0.5 at x=1 and x=3 (shifted by 1).
CDFs: F_p = 0 for x<0, 0.5 for 0≤x<2, 1 for x≥2; F_q = 0 for x<1, 0.5 for 1≤x<3, 1 for x≥3.
| Interval | |F_p − F_q| |
| --- | --- |
| x < 0 | 0 |
| 0 ≤ x < 1 | 0.5 |
| 1 ≤ x < 2 | 0 |
| 2 ≤ x < 3 | 0.5 |
| x ≥ 3 | 0 |
W_1 = 0.5·1 + 0.5·1 = 1. Matches Earth Mover’s intuition (move 0.5 mass 1 unit, twice).
JS for these same distributions: disjoint support → JS = log 2 regardless. Wasserstein wins on geometric sensitivity.
What WGAN-GP fixes vs what it does not
Section titled “What WGAN-GP fixes vs what it does not”| Problem | Original GAN | WGAN-GP |
|---|---|---|
| Vanishing gradients on disjoint distributions | Yes (JS pinned at log 2) | NO, meaningful gradients |
| No clean stopping criterion | Yes (oscillating losses) | Partial; critic loss correlates with sample quality |
| Mode collapse | Yes (JS dynamics) | Reduced (Wasserstein penalizes missing mass; not eliminated) |
| No likelihood | Yes | Yes (no fix) |
| Requires careful architecture | Yes | Yes (Lipschitz constraint must be respected) |
| Extra compute | (baseline) | Gradient penalty adds ∇_x f(x̂) computation per step |
WGAN-GP is the GAN family member that survived the post-GAN era and the default starting point for production-grade adversarial training.
A note on what this lesson does NOT cover
Section titled “A note on what this lesson does NOT cover”The §6 boundary from L7 carries through this lesson. Four distinct policy/governance forums sit outside this mechanical scope:
- When generating synthetic faces/voices/video of identifiable people is appropriate vs not (use-case + consent)
- Attribution and watermarking of synthesized content (provenance)
- Sector-specific policies for journalism, politics, legal evidence (deployment)
- IP and licensing claims around training data scraped from named sources (data-licensing)
The relevant evaluation methods for this lesson’s scope are: training stability (the critic’s Wasserstein estimate, the gradient norm on interpolated samples) and sample quality (FID, Inception Score, human preference studies; covered in lesson 9). If you are using those tools, you are in this lesson’s scope. If you are using policy-debate methods (stakeholder analysis, legal frameworks, regulatory comment), you are in a different conversation.
Pitfalls to dodge
Section titled “Pitfalls to dodge”- Calling the critic a “discriminator.” Critic outputs unbounded scalar (Wasserstein estimate), not probability.
- Skipping the gradient penalty. Without Lipschitz enforcement, the duality form does not equal
W_1. Use GP or spectral normalization. - Computing GP on wrong points. Interpolated samples
x̂, not real or fake alone. - Treating WGAN-GP as a complete fix. Better, not perfect. No likelihood. Still can collapse under poor settings.
The one-line version
Section titled “The one-line version”WGAN-GP replaces Jensen-Shannon with the Wasserstein distance (Earth Mover’s, geometric, gives meaningful gradients on disjoint distributions) and enforces the required 1-Lipschitz critic constraint via a gradient penalty on interpolated samples; the result is the production-grade GAN variant.