Skip to content

Summary: GANs and VAEs

Every architecture in this track so far has been discriminative (image in, label out). This lesson opens the generative side: produce a new image given some signal (random noise, a class, a caption). Two pre-2020 families: Variational Autoencoders (encoder maps image to latent distribution; decoder reconstructs; reparameterization trick z = μ + σ · ε makes sampling differentiable; ELBO loss = reconstruction + KL-to-prior; smooth latent space, slightly blurry outputs) and Generative Adversarial Networks (generator G maps noise to image, discriminator D classifies real-vs-fake, trained adversarially; sharp outputs, unstable training, no likelihood). Neither is perfect, which is why diffusion (next lesson) has largely replaced both at the high end since 2020. The full math derivations live in sister tracks (T19 for ELBO, T24 for GAN dynamics); this lesson stays at vision-applied intuition.

  • Discriminative vs generative. Discriminative learns P(label | image) and slices up image space (everything before Phase 3). Generative learns P(image) or how to sample from it; can synthesize new plausible images.
  • VAE (Kingma & Welling 2014). Encoder → distribution (μ, σ); sample z ~ N(μ, σ²); decoder → reconstruction. Reparameterization trick z = μ + σ · ε (with ε ~ N(0, I)) makes sampling differentiable so gradients flow back to μ, σ. Training loss is ELBO = reconstruction + KL-to-prior; the KL term keeps the latent space well-organized. Generate by sampling z from N(0, I), running decoder.
  • GAN (Goodfellow et al. 2014). Generator G: noise → image. Discriminator D: image → real-vs-fake probability. Trained adversarially: G fools D; D distinguishes. No encoder by default; no likelihood; sample by running G on random z.
  • The trade-off. VAE: smooth/well-organized latent (good for interpolation, latent arithmetic), stable principled training, slightly blurry outputs. GAN: sharp/photorealistic, unstable training (mode collapse, oscillation), no likelihood. Modern GAN landmarks: DCGAN, StyleGAN, BigGAN.
  • Worked reparameterization (body): μ = [0.5, -0.2], σ = [0.1, 0.3], ε = [0.5, -1.0]z = [0.55, -0.5]. Practice extends with 3-dim case → z = [0.6, 0.65, -0.2]. Reparameterization is short, elegant, and the reason VAEs train at all.
  • Vision use cases. Image-to-image translation (Pix2Pix, CycleGAN), super-resolution (SRGAN), inpainting, data augmentation by synthesis, latent-space semantic editing (smile/no-smile, age progression via StyleGAN). Many still in production for cost/latency reasons even as diffusion has displaced them for novel-image generation quality.
  • The training loop is unchanged in spirit; what changes is the loss (ELBO for VAEs; adversarial min-max for GANs) and the architecture (encoder-decoder vs two-network adversarial setup). Gradient descent + backprop carry the gradients through both.

When you see a system that generates novel images (photorealistic synthetic faces, sketch-to-photo translations, super-resolved satellite imagery, face-swap demos), it is one of this lesson’s two families or a diffusion model from the next. The VAE-vs-GAN trade-off is a real engineering choice: smoothness and interpretability of latent space (semantic editing, interpolation, controlled generation) → VAE-family. Maximum photorealism, latency-tolerant → GAN-family or diffusion. Encode-only (no synthesis needed) → self-supervised encoders from lesson 10. VAEs also remain load-bearing as first-stage encoders in latent-diffusion architectures (the popular text-to-image systems often use a VAE to compress images first, then run diffusion in the compressed latent space). The VAE never went away; it moved into a different layer of the stack.

Discriminative models recognize; generative models imagine. VAEs and GANs were the first two ways the field learned to make networks imagine; diffusion (next lesson) is the third and the modern default for high quality.