Cheatsheet: Variational inference for RL
The central identity
Section titled “The central identity”log p(x) = ELBO + KL(q(z|x) || p(z|x))ELBO is a lower bound on log p(x); gap is the KL between variational and true posterior. Maximizing the ELBO simultaneously maximizes likelihood and minimizes posterior-approximation error.
The ELBO formula
Section titled “The ELBO formula”ELBO = E_{z ~ q(z|x)} [ log p(x | z) ] - KL(q(z|x) || p(z))| Term | Role |
|---|---|
| `E_q [log p(x | z)]` |
| `KL(q(z | x) |
Gaussian-to-Gaussian KL (closed form)
Section titled “Gaussian-to-Gaussian KL (closed form)”For q = N(μ_q, σ_q²), p = N(μ_p, σ_p²):
KL(q || p) = log(σ_p / σ_q) + (σ_q² + (μ_q - μ_p)²) / (2 σ_p²) - 1/2For multivariate Gaussians with diagonal Σ, sum over dimensions.
Reparameterization trick
Section titled “Reparameterization trick”z = g_φ(ε, x) where ε ~ p(ε) (fixed, no parameters)For q_φ(z|x) = N(μ_φ(x), σ_φ²(x)):
z = μ_φ(x) + σ_φ(x) · ε, ε ~ N(0, 1)Lets gradient flow through samples. Critical for trainable VAEs and the SAC actor.
Worked example (lesson body)
Section titled “Worked example (lesson body)”Model: p(z) = N(0, 1), p(x|z) = N(z, 1). Observe x = 1.
True posterior (closed form by Bayes + complete-the-square): p(z|x=1) = N(0.5, 0.5).
Marginal: p(x) = N(0, 2). So log p(x=1) = -0.5 log(4π) - 0.25 ≈ -1.516.
Mismatched variational: q(z|x=1) = N(0.3, 1). Compute:
E_q[(1-z)²] = (1-0.3)² + 1 = 1.49E_q[log p(x|z)] = -0.5 log(2π) - 0.5 · 1.49 ≈ -0.919 - 0.745 = -1.664KL(q || p) = 0 + (1 + 0.09)/2 - 0.5 = 0.045ELBO = -1.664 - 0.045 = -1.709ELBO gap: log p(x) - ELBO = -1.516 - (-1.709) = 0.193.
Dual-path: compute KL(q(z|x) || p(z|x)) directly:
KL(N(0.3, 1) || N(0.5, 0.5))= log(√0.5 / 1) + (1 + 0.04)/1 - 0.5= -0.347 + 1.04 - 0.5= 0.193 ✓Both paths arrive at gap = 0.193. The central identity holds to the digit.
Where VI shows up in RL
Section titled “Where VI shows up in RL”| Setting | Variational object | Algorithm |
|---|---|---|
| Partially observed RL, latent-state world models | Sequential q(s_t | o_1:t) over latent states |
| Maximum-entropy RL | KL(π(· | s) |
| KL-regularized RLHF | KL(π_θ | |
| Inverse RL with energy-based models | Variational max-likelihood of expert trajectories | MaxEnt IRL (Ziebart 2008) |
The unifying view: the entropy bonus, the KL-to-pretrained penalty, and the variational world model are all instances of the same machinery applied to different parts of the RL pipeline.
MaxEnt RL as variational (the SAC view)
Section titled “MaxEnt RL as variational (the SAC view)”Standard policy gradient maximizes E[r(s, a)]. SAC adds entropy:
J_SAC(π) = E [ r(s, a) + α · H(π(·|s)) ] = E [ r(s, a) ] - α · KL(π(·|s) || uniform) (up to constants)The entropy bonus IS a KL regularizer toward the uniform action prior. Picking a different prior gives a different objective:
| Prior | Objective | Use case |
|---|---|---|
| Uniform | SAC, MaxEnt RL | Exploration regularization |
| Pretrained model | KL-regularized PPO | RLHF |
| Demonstration policy | Behavioral-cloning regularizer | Imitation-bootstrap |
Common pitfalls
Section titled “Common pitfalls”- Forgetting the ELBO is a lower bound, not the likelihood
- Treating KL as a “free” regularizer (it’s real regularization with real consequences)
- Confusing reparameterization with rejection sampling
- MC-estimating Gaussian-to-Gaussian KL when closed form exists
- Treating SAC’s entropy bonus as separate from variational inference (it’s identical)
What you should remember
Section titled “What you should remember”- The ELBO is
E_q[log p(x|z)] - KL(q || p); derivable from Jensen’s inequality. - ELBO gap =
KL(q(z|x) || p(z|x)); tight when q matches the true posterior. - Reparameterization makes sampling differentiable.
- Two RL uses: latent-state world models (Dreamer), MaxEnt RL (SAC).
- L12 generalizes: full RL = variational inference with “optimality” as evidence.