Skip to content

Cheatsheet: GANs and VAEs

Model typeLearnsTasks
Discriminative`P(labelimage)`; slices image space
GenerativeP(image) or how to sample from it; describes image spaceSynthesize new images, in/out-painting, super-resolution, latent editing
ElementDetail
ArchitectureEncoder → latent distribution (μ, σ) → sample z → Decoder → reconstruction
Samplingz ~ N(μ, σ²), made differentiable via reparameterization
Reparameterization trickz = μ + σ · ε where ε ~ N(0, I) (parameter-free randomness)
Loss (ELBO)Reconstruction (MSE / CE on decoded output) + KL-divergence(encoder || N(0,I))
GenerationSample z ~ N(0,I), run decoder
StrengthSmooth/well-organized latent space; stable principled training
WeaknessSlightly blurry outputs (MSE averages plausible reconstructions)
ElementDetail
ArchitectureGenerator G: noise z → image; Discriminator D: image → P(real)
TrainingAdversarial min-max: G fools D; D distinguishes real from fake
EquilibriumG’s outputs indistinguishable from real; D outputs 0.5
GenerationSample z, run G; one forward pass
StrengthSharp, photorealistic outputs
WeaknessUnstable training (mode collapse, oscillation); no likelihood; no built-in encoder
Modern landmarksDCGAN (stable recipe), StyleGAN (faces), BigGAN (class-conditional ImageNet)
Sourceμσεz
Body[0.5, -0.2][0.1, 0.3][0.5, -1.0][0.55, -0.5]
Practice[0.2, 0.7, -0.3][0.4, 0.1, 0.5][1.0, -0.5, 0.2][0.6, 0.65, -0.2]

Trick: randomness in ε (no parameters); gradients flow through μ and σ.

PropertyVAEGAN
Output sharpnessBlurry-ishSharp/photorealistic
Latent spaceSmooth, well-organizedLess structured by default (StyleGAN improved)
Training stabilityStable, principled (ELBO)Unstable, requires engineering art
LikelihoodApproximate (ELBO bound)None (sampler only)
Has encoderYesNo (separate inversion required)
Mode collapse riskNoYes
GoalChoice
Smooth interpolation / latent arithmeticVAE-family
Maximum photorealism (latency OK)Diffusion (next lesson) or modern GAN
Encode-only (no synthesis)Self-supervised encoder (L10)
Single-pass on-device generationVAE or GAN (diffusion’s iterative sampling is slower)
Use caseFamily / example
Image-to-image translationPix2Pix, CycleGAN
Super-resolutionSRGAN, ESRGAN
InpaintingBoth VAE-family and GAN-based
Data augmentation by synthesisEither; common in label-scarce domains (medical)
Latent-space semantic editingStyleGAN’s structured latent; VAE-style interpolation

VAE-as-first-stage-encoder (modern reality)

Section titled “VAE-as-first-stage-encoder (modern reality)”

Many production text-to-image systems use a VAE to compress images to a compact latent space, then run diffusion in that latent space (“latent diffusion”). The VAE never went away; it moved into a different layer of the stack.

PitfallReality
Generative = discriminative just because both are deepDifferent objectives, different training; one cannot do the other’s job well
GAN training = standard supervised trainingIt’s a min-max adversarial loop; loss values are not directly meaningful
VAEs are obsoletePure photorealism: diffusion wins. But VAEs still ubiquitous as first-stage encoders in latent-diffusion systems
GAN = deepfake (only)Deepfakes are one application; generative models have many neutral or beneficial uses

Discriminative models recognize; generative models imagine. VAEs (encoder-decoder with reparameterization trick + ELBO; smooth latent, blurry output) and GANs (adversarial G+D; sharp output, unstable training) were the first two ways the field learned to make networks imagine; diffusion (next lesson) is the modern default for high quality.