Summary: Variational inference for RL
The one paragraph version
Section titled “The one paragraph version”Variational inference provides the mathematical language for two distinct uses in modern deep RL. The evidence lower bound (ELBO) log p(x) ≥ E_q[log p(x|z)] - KL(q(z|x) || p(z)) is a tractable lower bound on the intractable model likelihood, derived from Jensen’s inequality. The gap log p(x) - ELBO = KL(q(z|x) || p(z|x)) is the KL between the variational and true posterior; maximizing the ELBO simultaneously maximizes likelihood and minimizes posterior-approximation error. The reparameterization trick writes z = g_φ(ε, x) with ε from a fixed distribution, making sampling differentiable for backpropagation. Two main RL applications: latent-state world models (PlaNet, Dreamer, RSSM) treat the world as a sequential latent-variable model and train it with the ELBO; MaxEnt RL (SAC) is variational policy optimization with the uniform action prior, where the famous entropy bonus is actually a KL regularizer toward uniform. The same machinery applies to RLHF (KL penalty against the pretrained-model prior) and imitation-bootstrapped RL (KL against a demonstration policy). Lesson 12 generalizes this all the way: the full RL problem can be cast as variational inference in a graphical model with “the trajectory was optimal” as the conditioning evidence.
Five things to remember
Section titled “Five things to remember”- The ELBO
E_q[log p(x | z)] - KL(q || p)is a tractable lower bound on the log marginallog p(x). Derived by applying Jensen’s inequality tolog E_q[p(x, z) / q(z|x)]. - The ELBO gap equals the KL between the variational and true posterior:
log p(x) - ELBO = KL(q(z|x) || p(z|x)). Tight whenq = p(z|x). - Reparameterization writes
z = g_φ(ε, x)withεfrom a fixed distribution. Makes sampling differentiable; lets backpropagation flow through samples. - MaxEnt RL is variational: SAC’s entropy bonus
α H(π)is equivalent to-α KL(π || uniform)(up to constants). Picking a different prior gives a different objective: KL-regularized PPO usesπ_pretrained; imitation-bootstrap uses a demonstration policy. - L12 will generalize: take “optimality” as evidence and reformulate the entire RL problem as variational inference. Bellman backups, soft Q-learning, and KL-regularized policy gradient all fall out.
Why this matters
Section titled “Why this matters”Variational inference is the bridge between three sub-fields that otherwise speak different languages:
- Deep generative modeling: VAEs, normalizing flows, latent diffusion (Rombach et al., 2022).
- Model-based RL: PlaNet and Dreamer’s RSSM-style world models, trained with the ELBO.
- Maximum-entropy RL: SAC and the family of soft Q-learning algorithms.
Once you can read all three through the same machinery, innovations transfer. The RSSM is a sequential VAE; SAC’s soft Bellman backup is the variational message-passing in the right graphical model; RLHF’s KL penalty is a Bayesian prior; the temperature α in SAC and the β in RLHF play the same role (the KL-weight hyperparameter). The ELBO + reparameterization vocabulary is what makes the connections visible.
This lesson is also the conceptual bridge to L12. The L4-L10 dispatch-table tour answered “what does each algorithm estimate?” (five families, one per L3 entry). The control-as-inference framing answers a different question: “is there a single unifying principle that produces the right algorithm given a problem specification?” The answer is variational inference; L12 makes the connection concrete.
Worked check (memory anchor)
Section titled “Worked check (memory anchor)”Linear-Gaussian model: p(z) = N(0, 1), p(x | z) = N(z, 1). Observe x = 1. True posterior p(z | x) = N(0.5, 0.5). Marginal p(x) = N(0, 2), so log p(x=1) ≈ -1.516.
Mismatched variational q(z | x) = N(0.3, 1):
E_q[(1-z)²] = (1-0.3)² + 1 = 1.49→E_q[log p(x|z)] ≈ -0.919 - 0.745 = -1.664KL(q || p) = 0 + 1.09/2 - 0.5 = 0.045ELBO ≈ -1.664 - 0.045 = -1.709- Gap
= -1.516 - (-1.709) = 0.193
Dual-path check via direct posterior KL:
KL(N(0.3, 1) || N(0.5, 0.5)) = 0.5 log(0.5) + 1.04/1 - 0.5 = -0.347 + 0.54 = 0.193✓
Identity log p(x) - ELBO = KL(q(z|x) || p(z|x)) holds to the digit.
Where this fits
Section titled “Where this fits”- Previous (Lesson 10): Planning with a learned model. Closed the P-branch.
- This lesson: Variational inference machinery. ELBO, reparameterization, the two RL applications (latent-state world models, MaxEnt RL).
- Next (Lesson 12): Control as inference. Apply the variational machinery to the full RL problem.
- Later (Lesson 13): RLHF. The KL-regularized PPO objective is a special case of the L12 framing.
What you should remember
Section titled “What you should remember”The ELBO is the workhorse identity, reparameterization makes it backprop-compatible, and the two main RL applications (latent-state world models and MaxEnt RL) are different incarnations of the same machinery. Lesson 12 uses this language to derive the entire RL framework from a probabilistic-inference starting point.