GANs and VAEs: cheatsheet

Discriminative vs generative

Model type	Learns	Tasks
Discriminative	`P(label	image)`; slices image space
Generative	`P(image)` or how to sample from it; describes image space	Synthesize new images, in/out-painting, super-resolution, latent editing

VAE (Variational Autoencoder)

Element	Detail
Architecture	Encoder → latent distribution `(μ, σ)` → sample z → Decoder → reconstruction
Sampling	`z ~ N(μ, σ²)`, made differentiable via reparameterization
Reparameterization trick	`z = μ + σ · ε` where `ε ~ N(0, I)` (parameter-free randomness)
Loss (ELBO)	Reconstruction (MSE / CE on decoded output) + KL-divergence(encoder \|\| N(0,I))
Generation	Sample z ~ N(0,I), run decoder
Strength	Smooth/well-organized latent space; stable principled training
Weakness	Slightly blurry outputs (MSE averages plausible reconstructions)

GAN (Generative Adversarial Network)

Element	Detail
Architecture	Generator G: noise z → image; Discriminator D: image → P(real)
Training	Adversarial min-max: G fools D; D distinguishes real from fake
Equilibrium	G’s outputs indistinguishable from real; D outputs 0.5
Generation	Sample z, run G; one forward pass
Strength	Sharp, photorealistic outputs
Weakness	Unstable training (mode collapse, oscillation); no likelihood; no built-in encoder
Modern landmarks	DCGAN (stable recipe), StyleGAN (faces), BigGAN (class-conditional ImageNet)

Worked reparameterizations

Source	μ	σ	ε	z
Body	[0.5, -0.2]	[0.1, 0.3]	[0.5, -1.0]	[0.55, -0.5]
Practice	[0.2, 0.7, -0.3]	[0.4, 0.1, 0.5]	[1.0, -0.5, 0.2]	[0.6, 0.65, -0.2]

Trick: randomness in ε (no parameters); gradients flow through μ and σ.

The trade-off

Property	VAE	GAN
Output sharpness	Blurry-ish	Sharp/photorealistic
Latent space	Smooth, well-organized	Less structured by default (StyleGAN improved)
Training stability	Stable, principled (ELBO)	Unstable, requires engineering art
Likelihood	Approximate (ELBO bound)	None (sampler only)
Has encoder	Yes	No (separate inversion required)
Mode collapse risk	No	Yes

When to use what

Goal	Choice
Smooth interpolation / latent arithmetic	VAE-family
Maximum photorealism (latency OK)	Diffusion (next lesson) or modern GAN
Encode-only (no synthesis)	Self-supervised encoder (L10)
Single-pass on-device generation	VAE or GAN (diffusion’s iterative sampling is slower)

Vision use cases

Use case	Family / example
Image-to-image translation	Pix2Pix, CycleGAN
Super-resolution	SRGAN, ESRGAN
Inpainting	Both VAE-family and GAN-based
Data augmentation by synthesis	Either; common in label-scarce domains (medical)
Latent-space semantic editing	StyleGAN’s structured latent; VAE-style interpolation

VAE-as-first-stage-encoder (modern reality)

Many production text-to-image systems use a VAE to compress images to a compact latent space, then run diffusion in that latent space (“latent diffusion”). The VAE never went away; it moved into a different layer of the stack.

Pitfalls

Pitfall	Reality
Generative = discriminative just because both are deep	Different objectives, different training; one cannot do the other’s job well
GAN training = standard supervised training	It’s a min-max adversarial loop; loss values are not directly meaningful
VAEs are obsolete	Pure photorealism: diffusion wins. But VAEs still ubiquitous as first-stage encoders in latent-diffusion systems
GAN = deepfake (only)	Deepfakes are one application; generative models have many neutral or beneficial uses

One-line takeaway

Discriminative models recognize; generative models imagine. VAEs (encoder-decoder with reparameterization trick + ELBO; smooth latent, blurry output) and GANs (adversarial G+D; sharp output, unstable training) were the first two ways the field learned to make networks imagine; diffusion (next lesson) is the modern default for high quality.