GANs and VAEs: brief

What you’ll learn

This is lesson 11 of Phase 3 (Generating and grounding vision), the second lesson of the generative-modeling stretch. The one capability it builds: you will be able to distinguish discriminative from generative modeling, describe the two pre-2020 generative-image-model families (VAEs and GANs) at intuition level with their characteristic trade-off, apply the reparameterization trick by hand, and choose between VAE, GAN, diffusion, and self-supervised approaches for a given task. The source curriculum is Stanford CS231n, cs231n.stanford.edu; this lesson maps to Lecture 13 (Generative Models 1).

The lesson opens with the discriminative-vs-generative conceptual split (P(label | image) vs P(image)), walks VAE (encoder maps to distribution, reparameterization trick z = μ + σ · ε makes sampling differentiable, decoder reconstructs, ELBO loss = reconstruction + KL-to-prior, smooth-but-blurry), walks GAN (generator G + discriminator D adversarially, sharp but unstable training, no likelihood), summarizes the trade-off, and surveys vision use cases (image-to-image translation, super-resolution, inpainting, data augmentation, latent-space editing).

Where this fits

This is lesson 11 of 16, the second lesson of Phase 3. It depends on lesson 10 (the self-supervised techniques whose pretrained encoders are sometimes used as first-stage components in modern latent-diffusion systems). The next lesson, Generating images by denoising: diffusion, covers the technique that has largely replaced both VAEs and GANs at the high end since around 2020, and is the architecture behind most famous recent text-to-image systems.

Before you start

Prerequisites: lesson 10 of this track (self-supervised vision; useful context for the encoder side of VAEs and for understanding how representations are learned without explicit labels). Lessons 3-4 (loss + gradient descent + backprop) carry over; what changes is the loss formulation (ELBO for VAEs; adversarial min-max for GANs).

About the math

Light at the vision-context level. The body shows the reparameterization trick formula z = μ + σ · ε and works one numerical example by hand (μ = [0.5, -0.2], σ = [0.1, 0.3], ε = [0.5, -1.0] → z = [0.55, -0.5]). Practice repeats with a 3-dimensional case (μ = [0.2, 0.7, -0.3], σ = [0.4, 0.1, 0.5], ε = [1.0, -0.5, 0.2] → z = [0.6, 0.65, -0.2]). The ELBO derivation and the GAN min-max convergence math are explicitly deferred to sister tracks (T19, T24); no calculus required for this lesson.

By the end, you’ll be able to

Distinguish discriminative from generative modeling
Describe the VAE architecture + ELBO + reparameterization trick, and compute z by hand
Describe the GAN adversarial setup and its trade-offs
Choose VAE / GAN / diffusion / self-supervised for a given task
Recognize VAEs’ continuing production role as first-stage encoders in latent diffusion

Time and difficulty

Read time: about 14 minutes
Practice time: about 15 minutes (a fresh 3-dimensional reparameterization computation, a tool-choice exercise across 4 scenarios, a GAN-training-failure-mode reasoning question about mode collapse, plus flashcards)
Difficulty: standard (the math is multiplication and addition for reparameterization; the conceptual lift is holding both families’ trade-offs side by side and knowing when each is the right tool)