References: GAN training in practice, Wasserstein loss and gradient penalty

Source material

Source curricula (multi-source structural mirror; cited as further study):

PRIMARY (this lesson follows its framing most directly)
• Stanford CS236, "Deep Generative Models", Lecture 10: Generative Adversarial Networks (continued)
  Instructor: Stefano Ermon
  Course URL: https://deepgenerativemodels.github.io/
  Syllabus: https://deepgenerativemodels.github.io/syllabus.html
  License: standard course-page link-out; cited as further study

SECONDARY (parallel framing where applicable; CS294-158's GAN lecture covers
WGAN-family briefly within the broader implicit-models lecture)
• Berkeley CS294-158, "Deep Unsupervised Learning" (Spring 2024), Lecture 5: Generative Adversarial Networks / Implicit Models
  Instructors: Pieter Abbeel, Wilson Yan, Kevin Frans, Philipp Wu
  Course URL: https://sites.google.com/view/berkeley-cs294-158-sp24/
  License: standard course-page link-out; cited as further study

Clawdemy's lessons are original prose that follows the pedagogical arc of these
two courses, anchored on CS236's lecture order with CS294-158 framing pulled in
where its slide deck and recording are stronger. We do not reproduce or
transcribe the lectures; we cite them as the recommended companions. All rights
to the original course materials remain with the respective instructors and
institutions.

Watch this next

Stanford CS236 (Stefano Ermon), course homepage. Lecture 10 (GANs, continued) is the primary anchor; it covers the Wasserstein objective, the Kantorovich-Rubinstein duality, and the gradient-penalty derivation. The course notes at deepgenerativemodels.github.io/notes include the duality proof and the gradient-penalty motivation in more detail than the slides.
Berkeley CS294-158 Sp24 (Pieter Abbeel et al.), course homepage. Lecture 5 (GANs / Implicit Models) covers the WGAN family alongside the original GAN in one lecture; the comparison framing is the secondary contribution.

Going deeper

A short, durable list. Each link is a specific next step, not a generic pile.

“Wasserstein GAN” (Arjovsky, Chintala, Bottou, 2017). The original WGAN paper. Section 2 covers the Earth Mover’s distance and its advantages over JS; Section 3 covers the duality form; Section 4 introduces the weight-clipping enforcement (which the next paper improves). The introduction’s “geometric distance vs disjoint-support saturation” argument is the cleanest motivation you will find for swapping JS out.
“Improved Training of Wasserstein GANs” (Gulrajani, Ahmed, Arjovsky, Dumoulin, Courville, 2017). The WGAN-GP paper. Replaces weight clipping with the gradient penalty; this is the recipe most production-grade WGAN implementations actually use. Section 4 derives the gradient-penalty form and explains why interpolated samples are the right evaluation points.
“Spectral Normalization for Generative Adversarial Networks” (Miyato, Kataoka, Koyama, Yoshida, 2018). An alternative Lipschitz-enforcement method (constrain the spectral norm of each weight matrix). Different mechanism from the gradient penalty, similar effect. Worth knowing about because some modern GAN-family architectures use spectral normalization in place of (or alongside) the gradient penalty.

Adjacent topics

Where this sits in the track.

GANs, the minimax game (previous lesson). L7 introduced the minimax framework with the original JS-divergence objective and named the paradigm-level pathologies (vanishing gradients, mode collapse, no stopping criterion). This lesson keeps the framework but changes the divergence, which addresses the first pathology directly, the third partially, and the second meaningfully. The improvement is incremental but practically large.
Evaluating generative models (next lesson, L9). Phase 2 closes with how to evaluate generative models when likelihood is bounded (VAEs), unavailable (GANs), or only one of many possible quality measures. FID, Inception Score, and Precision/Recall for distributions are the standard tools. The lesson here flagged FID/IS as the relevant evaluation methods for the WGAN-GP scope; L9 builds them out.
The four-paradigm map (lesson 1). This lesson sharpens the L1 placement of GANs by showing that the “implicit / no-likelihood” branch is itself parameterizable by which divergence you choose. JS (original GAN) and Wasserstein (WGAN-GP) are two answers to the divergence-choice question within the GAN family. Spectral GANs and other variants are further answers. The cross-paradigm map can be refined as “which divergence does this GAN use?” once you are inside the paradigm.
Lesson 14 (score-based diffusion via SDEs). The fact that “different divergence choices give different paradigms” appears again in Phase 3: diffusion can be viewed through the score-matching lens (an objective related to but distinct from forward KL) and through the SDE lens (a continuous-time view). Recognizing divergence-choice as a paradigm-design parameter, which this lesson sets up explicitly, makes the diffusion derivations easier to read.