Skip to content

Cheatsheet: VAE training in practice, the reparameterization trick

ComponentOutputRole
Priorp(z) = N(0, I) (fixed)Standard Gaussian over d-dim latent
Encoderq(z|x) = N(μ_x, diag σ²_x)Two vectors μ_x, log σ²_x from network
Decoderp(x|z) (Bernoulli / Gaussian / softmax)Distribution parameters from network

Predict log σ² (not σ²); exponentiate for positivity.

Problem: z ~ N(μ_x, σ²_x) is stochastic, ordinary backprop cannot flow through the sampler.

Solution: write the sample as a deterministic function of the encoder parameters and an independent noise variable.

ε ~ N(0, I) sampled OUTSIDE the network
z = μ_x + σ_x · ε deterministic in (μ_x, σ_x); same distribution as N(μ_x, σ²_x)

Backprop now flows freely from z to the encoder. The randomness lives in ε, treated as a constant input per step.

KL( N(μ, σ²) || N(0, 1) ) = 0.5 · ( σ² + μ² − 1 − log σ² )

Sum over latent dimensions for a d-dim diagonal-Gaussian encoder.

μ, σKL
0, 1 (match prior)0
1, 1 (shifted mean)0.5
0, 2 (wider)≈ 0.807

The KL is zero only at the prior, positive otherwise, and grows as the encoder shifts or stretches away.

-ELBO(x; q) = -log p(x | z̃) + 0.5 · sum_dims ( σ²_x + μ²_x − 1 − log σ²_x )
\---reconstruction--/ \-----------closed-form KL---------/
z̃ = μ_x + σ_x · ε, ε ~ N(0, I)

Reconstruction needs ONE Monte Carlo sample. KL is exact, no sampling.

  1. Encoder forward on x(μ_x, log σ²_x)
  2. Sample ε ~ N(0, I); compute z̃ = μ_x + σ_x · ε
  3. Decoder forward on p(x | z̃); compute -log p(x | z̃) (reconstruction)
  4. Closed-form KL from (μ_x, log σ²_x)
  5. Sum: per-example loss = reconstruction + KL
  6. Backprop through everything (including to encoder); SGD step
Good atLess competitive
Representation learning (latent code as structured representation)State-of-the-art image sample quality (diffusion wins)
Compression as a component in larger systems (latent diffusion)Sharp samples in raw pixel space (Gaussian decoder smooths)
Disentangled latents, controllable generation by latent arithmeticExact likelihood (only ELBO, a lower bound)
  • Reading VAE training curves. Two terms separately: KL too small early = posterior collapse; reconstruction stuck = decoder failing to use latent; both shrinking = healthy.
  • Reparameterization trick recurs. Diffusion sampling, some RL policy-gradient methods, normalizing-flow extensions of q. “Differentiable sampling” = reparameterization.
  • Latent diffusion has a VAE inside. Stable Diffusion: VAE compresses image to latent, diffusion runs in latent space. Lesson 14 builds on this.

VAE-based components appear in many synthetic-media systems. Framings around synthetic-media use (when appropriate, watermarking, content policies, training-data licensing) are a separate set of questions outside this mechanical lesson and belong in legal, governance, and ethics forums.

  • Sample from q directly instead of reparameterizing → gradients do not flow through the encoder.
  • Predict σ² instead of log σ² → instability + clamp boundaries; predict log σ², exponentiate.
  • Quote ELBO as likelihood → ELBO is a lower bound; cross-paradigm comparisons require care.
  • Conflate small KL with posterior collapse → small KL can also mean prior is just a good fit; collapse is when DIFFERENT x produce the SAME encoder output.

A VAE is two neural networks (Gaussian encoder + decoder) trained on the ELBO; the reparameterization trick z = μ + σ·ε makes the sample differentiable, the Gaussian KL has a closed form, and the per-example loss is one Monte Carlo reconstruction plus one closed-form KL.