References: VAE training in practice, the reparameterization trick

Source material

Source curricula (multi-source structural mirror; cited as further study):

PRIMARY (this lesson follows its framing most directly)
• Stanford CS236, "Deep Generative Models", Lecture 6: Variational Autoencoders (continued)
  Instructor: Stefano Ermon
  Course URL: https://deepgenerativemodels.github.io/
  Syllabus: https://deepgenerativemodels.github.io/syllabus.html
  License: standard course-page link-out; cited as further study

SECONDARY (also contributed to this lesson's framing)
• Berkeley CS294-158, "Deep Unsupervised Learning" (Spring 2024), Lecture 4: Latent Variable Models
  Instructors: Pieter Abbeel, Wilson Yan, Kevin Frans, Philipp Wu
  Course URL: https://sites.google.com/view/berkeley-cs294-158-sp24/
  License: standard course-page link-out; cited as further study

Clawdemy's lessons are original prose that follows the pedagogical arc of these
two courses, anchored on CS236's lecture order with CS294-158 framing pulled in
where its slide deck and recording are stronger. We do not reproduce or
transcribe the lectures; we cite them as the recommended companions. All rights
to the original course materials remain with the respective instructors and
institutions.

Watch this next

Stanford CS236 (Stefano Ermon), course homepage. Lecture 6 (Variational Autoencoders, continued) is the primary anchor; it covers the reparameterization trick, the closed-form Gaussian KL, and the practical training loop. The course notes at deepgenerativemodels.github.io/notes include a worked derivation of the Gaussian KL and the per-example loss.
Berkeley CS294-158 Sp24 (Pieter Abbeel et al.), course homepage. Lecture 4 (Latent Variable Models) covers the VAE alongside the ELBO derivation in one lecture; the section on training (reparameterization + closed-form KL) is the secondary anchor for this lesson.

Going deeper

A short, durable list. Each link is a specific next step, not a generic pile.

“Auto-Encoding Variational Bayes” (Kingma, Welling, 2013). The original VAE paper, the canonical reference for everything in this lesson. Section 2.4 (“The reparameterization trick”) introduces the move and coins the term. Appendix B has the closed-form KL between two diagonal Gaussians and is the source for the formula used here.
“An Introduction to Variational Autoencoders” (Kingma, Welling, 2019). A book-length tutorial by the original VAE authors. Chapter 2.4 generalizes the reparameterization trick beyond Gaussians (location-scale families, normalizing-flow extensions of q), and Chapter 3 covers practical training details (warmup schedules, posterior collapse, the beta-VAE family). Read after the 2013 paper if you want to see how the trick generalizes.
“High-Resolution Image Synthesis with Latent Diffusion Models” (Rombach et al., 2022). The paper that introduced latent diffusion (the architecture behind Stable Diffusion). Section 3 covers the VAE-style “perceptual compression” encoder that maps images to a low-dimensional latent space before the diffusion process runs. Useful preview reading for the diffusion lessons in Phase 3; shows how the VAE machinery from this lesson plugs into a larger generative pipeline.

Adjacent topics

Where this sits in the track.

Latent variables and the ELBO (previous lesson). L5 derived the ELBO abstractly. This lesson takes the same ELBO to a concrete neural-network architecture and adds the one technical move (the reparameterization trick) that makes it trainable by SGD. The Gaussian KL closed-form simplifies the second term.
GANs, the minimax game (next lesson, L7). Phase 2 continues with the adversarial paradigm. GANs throw away the likelihood objective entirely (no ELBO, no NLL) and replace it with a two-network game. The contrast with VAEs is sharp: VAEs trade exact likelihood for the ELBO bound but keep a principled training objective; GANs trade likelihood entirely for sample quality and pay with training instability.
Score-based diffusion via SDEs, the unifying view (lesson 14). Diffusion models, when derived carefully, turn out to optimize a multi-step ELBO over a hierarchy of noisy intermediate latents. The reparameterization trick and the closed-form Gaussian KL reappear in the diffusion training derivation. The latent-diffusion variant additionally uses a VAE in the form of this lesson as a learned compression front-end.
Tracks 4 and 8 (Visual Math: Linear Algebra and Calculus). The closed-form Gaussian KL 0.5·(σ² + μ² − 1 − log σ²) uses derivatives + logs from T8; the multidimensional generalization uses determinants of the covariance matrix from T4 (log det Σ). Diagonal covariance reduces both to per-dimension sums, which is why this lesson stays at the per-dimension formula throughout.