Summary: Variational inference for RL

The one paragraph version

Variational inference provides the mathematical language for two distinct uses in modern deep RL. The evidence lower bound (ELBO) log p(x) ≥ E_q[log p(x|z)] - KL(q(z|x) || p(z)) is a tractable lower bound on the intractable model likelihood, derived from Jensen’s inequality. The gap log p(x) - ELBO = KL(q(z|x) || p(z|x)) is the KL between the variational and true posterior; maximizing the ELBO simultaneously maximizes likelihood and minimizes posterior-approximation error. The reparameterization trick writes z = g_φ(ε, x) with ε from a fixed distribution, making sampling differentiable for backpropagation. Two main RL applications: latent-state world models (PlaNet, Dreamer, RSSM) treat the world as a sequential latent-variable model and train it with the ELBO; MaxEnt RL (SAC) is variational policy optimization with the uniform action prior, where the famous entropy bonus is actually a KL regularizer toward uniform. The same machinery applies to RLHF (KL penalty against the pretrained-model prior) and imitation-bootstrapped RL (KL against a demonstration policy). Lesson 12 generalizes this all the way: the full RL problem can be cast as variational inference in a graphical model with “the trajectory was optimal” as the conditioning evidence.

Five things to remember

The ELBO E_q[log p(x | z)] - KL(q || p) is a tractable lower bound on the log marginal log p(x). Derived by applying Jensen’s inequality to log E_q[p(x, z) / q(z|x)].
The ELBO gap equals the KL between the variational and true posterior: log p(x) - ELBO = KL(q(z|x) || p(z|x)). Tight when q = p(z|x).
Reparameterization writes z = g_φ(ε, x) with ε from a fixed distribution. Makes sampling differentiable; lets backpropagation flow through samples.
MaxEnt RL is variational: SAC’s entropy bonus α H(π) is equivalent to -α KL(π || uniform) (up to constants). Picking a different prior gives a different objective: KL-regularized PPO uses π_pretrained; imitation-bootstrap uses a demonstration policy.
L12 will generalize: take “optimality” as evidence and reformulate the entire RL problem as variational inference. Bellman backups, soft Q-learning, and KL-regularized policy gradient all fall out.

Why this matters

Variational inference is the bridge between three sub-fields that otherwise speak different languages:

Deep generative modeling: VAEs, normalizing flows, latent diffusion (Rombach et al., 2022).
Model-based RL: PlaNet and Dreamer’s RSSM-style world models, trained with the ELBO.
Maximum-entropy RL: SAC and the family of soft Q-learning algorithms.

Once you can read all three through the same machinery, innovations transfer. The RSSM is a sequential VAE; SAC’s soft Bellman backup is the variational message-passing in the right graphical model; RLHF’s KL penalty is a Bayesian prior; the temperature α in SAC and the β in RLHF play the same role (the KL-weight hyperparameter). The ELBO + reparameterization vocabulary is what makes the connections visible.

This lesson is also the conceptual bridge to L12. The L4-L10 dispatch-table tour answered “what does each algorithm estimate?” (five families, one per L3 entry). The control-as-inference framing answers a different question: “is there a single unifying principle that produces the right algorithm given a problem specification?” The answer is variational inference; L12 makes the connection concrete.

Worked check (memory anchor)

Linear-Gaussian model: p(z) = N(0, 1), p(x | z) = N(z, 1). Observe x = 1. True posterior p(z | x) = N(0.5, 0.5). Marginal p(x) = N(0, 2), so log p(x=1) ≈ -1.516.

Mismatched variational q(z | x) = N(0.3, 1):

E_q[(1-z)²] = (1-0.3)² + 1 = 1.49 → E_q[log p(x|z)] ≈ -0.919 - 0.745 = -1.664
KL(q || p) = 0 + 1.09/2 - 0.5 = 0.045
ELBO ≈ -1.664 - 0.045 = -1.709
Gap = -1.516 - (-1.709) = 0.193

Dual-path check via direct posterior KL:

KL(N(0.3, 1) || N(0.5, 0.5)) = 0.5 log(0.5) + 1.04/1 - 0.5 = -0.347 + 0.54 = 0.193 ✓

Identity log p(x) - ELBO = KL(q(z|x) || p(z|x)) holds to the digit.

Where this fits

Previous (Lesson 10): Planning with a learned model. Closed the P-branch.
This lesson: Variational inference machinery. ELBO, reparameterization, the two RL applications (latent-state world models, MaxEnt RL).
Next (Lesson 12): Control as inference. Apply the variational machinery to the full RL problem.
Later (Lesson 13): RLHF. The KL-regularized PPO objective is a special case of the L12 framing.

What you should remember

The ELBO is the workhorse identity, reparameterization makes it backprop-compatible, and the two main RL applications (latent-state world models and MaxEnt RL) are different incarnations of the same machinery. Lesson 12 uses this language to derive the entire RL framework from a probabilistic-inference starting point.