| Model type | Learns | Tasks |
|---|
| Discriminative | `P(label | image)`; slices image space |
| Generative | P(image) or how to sample from it; describes image space | Synthesize new images, in/out-painting, super-resolution, latent editing |
| Element | Detail |
|---|
| Architecture | Encoder → latent distribution (μ, σ) → sample z → Decoder → reconstruction |
| Sampling | z ~ N(μ, σ²), made differentiable via reparameterization |
| Reparameterization trick | z = μ + σ · ε where ε ~ N(0, I) (parameter-free randomness) |
| Loss (ELBO) | Reconstruction (MSE / CE on decoded output) + KL-divergence(encoder || N(0,I)) |
| Generation | Sample z ~ N(0,I), run decoder |
| Strength | Smooth/well-organized latent space; stable principled training |
| Weakness | Slightly blurry outputs (MSE averages plausible reconstructions) |
| Element | Detail |
|---|
| Architecture | Generator G: noise z → image; Discriminator D: image → P(real) |
| Training | Adversarial min-max: G fools D; D distinguishes real from fake |
| Equilibrium | G’s outputs indistinguishable from real; D outputs 0.5 |
| Generation | Sample z, run G; one forward pass |
| Strength | Sharp, photorealistic outputs |
| Weakness | Unstable training (mode collapse, oscillation); no likelihood; no built-in encoder |
| Modern landmarks | DCGAN (stable recipe), StyleGAN (faces), BigGAN (class-conditional ImageNet) |
| Source | μ | σ | ε | z |
|---|
| Body | [0.5, -0.2] | [0.1, 0.3] | [0.5, -1.0] | [0.55, -0.5] |
| Practice | [0.2, 0.7, -0.3] | [0.4, 0.1, 0.5] | [1.0, -0.5, 0.2] | [0.6, 0.65, -0.2] |
Trick: randomness in ε (no parameters); gradients flow through μ and σ.
| Property | VAE | GAN |
|---|
| Output sharpness | Blurry-ish | Sharp/photorealistic |
| Latent space | Smooth, well-organized | Less structured by default (StyleGAN improved) |
| Training stability | Stable, principled (ELBO) | Unstable, requires engineering art |
| Likelihood | Approximate (ELBO bound) | None (sampler only) |
| Has encoder | Yes | No (separate inversion required) |
| Mode collapse risk | No | Yes |
| Goal | Choice |
|---|
| Smooth interpolation / latent arithmetic | VAE-family |
| Maximum photorealism (latency OK) | Diffusion (next lesson) or modern GAN |
| Encode-only (no synthesis) | Self-supervised encoder (L10) |
| Single-pass on-device generation | VAE or GAN (diffusion’s iterative sampling is slower) |
| Use case | Family / example |
|---|
| Image-to-image translation | Pix2Pix, CycleGAN |
| Super-resolution | SRGAN, ESRGAN |
| Inpainting | Both VAE-family and GAN-based |
| Data augmentation by synthesis | Either; common in label-scarce domains (medical) |
| Latent-space semantic editing | StyleGAN’s structured latent; VAE-style interpolation |
Many production text-to-image systems use a VAE to compress images to a compact latent space, then run diffusion in that latent space (“latent diffusion”). The VAE never went away; it moved into a different layer of the stack.
| Pitfall | Reality |
|---|
| Generative = discriminative just because both are deep | Different objectives, different training; one cannot do the other’s job well |
| GAN training = standard supervised training | It’s a min-max adversarial loop; loss values are not directly meaningful |
| VAEs are obsolete | Pure photorealism: diffusion wins. But VAEs still ubiquitous as first-stage encoders in latent-diffusion systems |
| GAN = deepfake (only) | Deepfakes are one application; generative models have many neutral or beneficial uses |
Discriminative models recognize; generative models imagine. VAEs (encoder-decoder with reparameterization trick + ELBO; smooth latent, blurry output) and GANs (adversarial G+D; sharp output, unstable training) were the first two ways the field learned to make networks imagine; diffusion (next lesson) is the modern default for high quality.