Variational inference for RL: the ELBO

What you’ll be able to do after this lesson

Lessons 4 through 10 covered the algorithmic families of deep RL through the lens of the Lesson 3 dispatch table: each algorithm estimates one of pi, V, Q, A, P. That tour is complete. This lesson and the next change the angle: instead of organizing algorithms by what they estimate, reformulate the RL problem itself as probabilistic inference.

Variational inference (VI) is the machinery for that reformulation. By the end of this lesson you can:

Derive the evidence lower bound (ELBO) from Jensen’s inequality applied to log p of x.
Compute the ELBO for a small linear-Gaussian latent-variable model by hand. Verify the “ELBO gap” log p of x - ELBO equals the KL between the variational posterior and the true posterior, two ways.
Explain the reparameterization trick: writing the latent z as a deterministic function of a noise variable epsilon and the input x, with epsilon drawn from a fixed distribution such as a standard normal, lets you sample z while letting gradients flow through that function.
Identify the two main places VI shows up in modern RL: latent-state world models (Dreamer’s RSSM, PlaNet) for partial observability, and maximum-entropy RL (SAC) where the entropy bonus is a KL regularizer toward a uniform action prior.
Anticipate the L12 framing (control as inference): the entire RL problem is variational inference in a graphical model where “the trajectory was optimal” is the evidence we condition on.

This lesson is conceptual scaffolding. It does not introduce a new algorithm; it provides the mathematical language Lesson 12 needs.

Why VI matters for RL

Two specific RL settings need approximate inference, and the same machinery works for both.

Setting 1: latent-state world models. Real-world dynamics are partially observable: you see camera frames, not the full physical state. The “true” state is latent. To plan or learn a policy you need a representation of that latent state. The canonical algorithms (PlaNet, Hafner et al. 2019; Dreamer, Hafner et al. 2019/2020/2021/2023) treat the world model as a sequential latent-variable model and train it by maximizing the ELBO. This is the modern engine behind continuous-control model-based RL.

Setting 2: maximum-entropy RL. Algorithms like SAC (Haarnoja et al., 2018) add an entropy bonus to the policy objective: maximize the expected reward plus alpha times the policy entropy. The bonus is usually presented as “exploration regularization.” It is actually a variational quantity: maximizing the policy entropy (the negative expected log-probability of the actions) is equivalent to minimizing the KL divergence from the policy to a uniform distribution (up to a constant). The whole MaxEnt-RL family is variational inference with the prior set to the uniform action distribution.

Lesson 12 generalizes this: pick a different prior, get a different objective, and the full Bellman recursion follows from variational inference in the right graphical model.

Both settings require deriving and computing the ELBO. That is what this lesson covers.

The latent-variable model

A latent-variable model relates observations x and unobserved latent variables z through a joint distribution: p of x and z equals p of x given z times p of z. The marginal, p of x, is the integral over z of p of x given z times p of z, which is what you actually want (the model likelihood given the data), and is generally intractable: the integral is over a continuous z, with no closed form.

The maximum-likelihood objective log p of x is also intractable. You cannot compute log p of x directly; you can only bound it.

The variational trick: introduce a tractable proposal distribution q of z given x (the variational posterior). Then:

log p(x) = log ∫ p(x, z) dz
        = log ∫ q(z | x) · (p(x, z) / q(z | x)) dz
        = log E_{z ~ q(z | x)} [ p(x, z) / q(z | x) ]

Jensen’s inequality says the log of an expectation is at least the expectation of the log, for the concave log function. Apply it:

log p(x) ≥ E_{z ~ q(z | x)} [ log p(x, z) - log q(z | x) ]
       = E_q [ log p(x | z) ] - KL(q(z | x) || p(z))

This is the evidence lower bound (ELBO). Two terms:

Reconstruction: the expectation under q of log p of x given z, the expected log-likelihood of the data under the variational posterior.
KL regularizer: the KL divergence from q of z given x to the prior p of z, how far the variational posterior deviates from the prior on z. Acts as a regularization term keeping q from over-fitting to a single training point.

The ELBO gap

The Jensen-inequality step has a known gap. Specifically:

log p(x) - ELBO = KL(q(z | x) || p(z | x))

The ELBO is tight (equal to log p of x) when q of z given x exactly matches the true posterior p of z given x. Since the true posterior is generally intractable, the gap is generally positive; minimizing the gap (or equivalently, maximizing the ELBO) over q brings the variational posterior closer to the truth.

This is the central identity of variational inference:

log p(x) = ELBO + KL(q(z | x) || p(z | x))

Maximizing the ELBO simultaneously maximizes the model likelihood (left side) and minimizes the posterior-approximation error (KL gap).

Worked example: linear-Gaussian by hand

Take the simplest non-trivial latent-variable model. Latent z is a real number, observation x is a real number:

p(z)    = N(0, 1)          (prior)
p(x | z) = N(z, 1)          (likelihood)

Conjugate Gaussians admit a closed-form true posterior. The marginal p of x is also closed-form: the convolution of two Gaussians.

Closed-form true posterior p of z given x

By Bayes’ rule, p of z given x is proportional to p of x given z times p of z. The product of two Gaussians in z is another Gaussian; complete the square:

p(x | z) · p(z) ∝ exp(-(x - z)²/2) · exp(-z²/2)
              = exp(-(2z² - 2xz + x²) / 2)
              = exp(-(z - x/2)²) · exp(-x²/4)

The exponential term is a Gaussian in z with mean x over 2 and variance one-half. So:

p(z | x) = N(x/2, 1/2)

For x = 1: the true posterior is a Gaussian with mean 0.5 and variance 0.5. Its standard deviation is the square root of 0.5, about 0.707.

Closed-form marginal p of x

The marginal is a Gaussian with mean 0 and variance 1 plus 1, that is, variance 2 (the sum of the two Gaussians’ variances). For x = 1:

log p(x = 1) = -0.5 · log(2π · 2) - 0.5 · (1²/2)
           = -0.5 · log(4π) - 0.25
           ≈ -1.266 - 0.25
           = -1.516                        (numerically)

Pick a deliberately mismatched variational posterior

Suppose we use a poor variational fit: q of z given x, at x = 1, is a Gaussian with mean 0.3 and variance 1. The mean is wrong (0.3 instead of 0.5) and the variance is wrong (1 instead of 0.5). What ELBO do we get?

Term 1: the expectation under q of log p of x given z

The log-likelihood log p of x given z equals minus 0.5 times log of 2-pi, minus 0.5 times the squared error between x and z. For x = 1:

log p(1 | z) = -0.5 log(2π) - 0.5 · (1 - z)²

Take the expectation over z drawn from q, the Gaussian with mean 0.3 and variance 1. The squared term, one minus z, squared, has expectation:

E_q [(1 - z)²] = (1 - μ_q)² + σ_q² = (1 - 0.3)² + 1 = 0.49 + 1 = 1.49

So:

E_q [log p(1 | z)] = -0.5 log(2π) - 0.5 · 1.49
                 = -0.5 log(2π) - 0.745

Term 2: the KL divergence from q to p, where p is a standard normal

For two univariate Gaussians:

KL(N(μ_q, σ_q²) || N(μ_p, σ_p²))
  = log(σ_p / σ_q) + (σ_q² + (μ_q - μ_p)²) / (2 σ_p²) - 1/2

With q the Gaussian of mean 0.3 and variance 1, and p the standard normal:

KL(q || p) = log(1) + (1 + 0.09) / 2 - 0.5
          = 0 + 0.545 - 0.5
          = 0.045

ELBO

ELBO = E_q [log p(x | z)] - KL(q || p)
    = (-0.5 log(2π) - 0.745) - 0.045
    = -0.5 log(2π) - 0.790
    ≈ -0.5 · 1.838 - 0.790
    ≈ -0.919 - 0.790
    = -1.709

(Numerically, minus 0.5 times log of 2-pi is about minus 0.919.)

The ELBO gap

Carefully expand log p of x at x = 1, which is minus 0.5 times log of 4-pi, minus 0.25, term by term:

log p(x = 1) = -0.5 · log(2π) - 0.5 · log 2 - 0.25
           ≈ -0.919 - 0.347 - 0.25
           = -1.516

And from above, the ELBO is about minus 0.919 minus 0.790, which is minus 1.709. The gap:

log p(x) - ELBO ≈ -1.516 - (-1.709) = 0.193

Positive, as Jensen’s inequality requires. The variational fit is below the true log-marginal by 0.193 nats.

Dual-path verification: compute the KL from q of z given x to p of z given x directly

Now compute the KL between the variational q, the Gaussian of mean 0.3 and variance 1, and the true posterior p of z given x, the Gaussian of mean 0.5 and variance 0.5:

KL(N(0.3, 1) || N(0.5, 0.5))
  = log(√0.5 / 1) + (1 + (0.3 - 0.5)²) / (2 · 0.5) - 0.5
  = 0.5 · log(0.5) + (1 + 0.04) / 1 - 0.5
  = -0.347 + 1.04 - 0.5
  = 0.193

Both paths arrive at a KL of 0.193. The ELBO-gap identity, that log p of x minus the ELBO equals the KL from q of z given x to p of z given x, holds to the digit. This is the dual-path verification of the ELBO machinery: compute the gap two ways (via the ELBO definition vs the analytic KL between the variational and true posterior), they match.

The reparameterization trick

When q of z given x is part of a deep-learning pipeline, you need gradients to flow through the sampling step z ~ q of z given x. Naively, sampling is non-differentiable: there is no gradient of “draw a random number” with respect to the parameters of q.

The reparameterization trick (Kingma & Welling, 2014) writes:

z = g_φ(ε, x)    where ε ~ p(ε)   (a fixed distribution, typically N(0, I))

The mapping g-phi is a deterministic function of epsilon, x, and the variational parameters phi. For a Gaussian variational posterior with mean mu-phi and variance sigma-phi squared, the standard choice sets z equal to the mean mu-phi plus the standard deviation sigma-phi times epsilon, with epsilon drawn from a standard normal. The randomness is “pushed out” into epsilon, which has no parameters. That expression, mu-phi plus sigma-phi times epsilon, is differentiable in phi.

The ELBO then becomes:

ELBO = E_{ε ~ N(0, I)} [ log p(x | g_φ(ε, x)) ] - KL(q_φ(z | x) || p(z))

The expectation is now over a fixed distribution over epsilon, so the gradient of the ELBO with respect to phi passes through the mapping g-phi. This is what makes variational autoencoders (VAEs) trainable by backpropagation.

For Gaussian q and p (the common case), the KL term has a closed-form expression (the formula above), so only the reconstruction term needs Monte-Carlo estimation.

Where this shows up in RL

Latent-state world models

A partially observable RL problem has observations (camera frames, sensor readings) but a latent state. A world model parameterizes the latent dynamics as:

p(s_1, ..., s_T, o_1, ..., o_T, r_1, ..., r_T)
  = p(s_1) · Π_t p(s_{t+1} | s_t, a_t) · p(o_t | s_t) · p(r_t | s_t, a_t)

Train this by maximizing the variational ELBO over sequences. The variational posterior, the latent state given the observations and actions up to that time, runs forward through the trajectory; the prior, the latent state given the previous state and action, is the latent dynamics.

This is exactly what PlaNet (Hafner et al., 2019) and Dreamer (Hafner et al., 2019, 2020, 2021, 2023) do. The “world model” is a sequential latent-variable model trained with the ELBO. The latent representation is then used for planning or for training a model-free policy on imagined latent rollouts.

The MuZero loss-function insight from Lesson 10 is the same idea, expressed differently: training the model for planning quality (policy + value + reward losses) is what makes the latent representation useful for control. Variational inference is the explicit probabilistic version; MuZero’s training is the implicit, end-to-end version. The Dreamer line and the MuZero line are converging by 2024 to 2025.

MaxEnt RL (SAC)

Soft Actor-Critic (SAC, Haarnoja et al., 2018) optimizes:

J_SAC(π) = E_{s, a ~ π} [ Σ_t γ^t · (r(s_t, a_t) + α · H(π(·|s_t))) ]

The entropy bonus, alpha times the policy entropy, is presented in the SAC paper as an exploration regularizer. The variational view: maximum entropy is equivalent to minimum KL to a uniform prior on actions. The full SAC objective is:

J(π) = E [ r(s, a) ] - α · KL(π(·|s) || uniform)

(up to constants). This is the same shape as an ELBO: reward term + KL regularizer toward a fixed prior. SAC is variational policy optimization with the uniform prior.

Picking a different prior gives a different objective. KL-regularized PPO in RLHF (Lesson 8) uses the KL divergence from the policy to the pretrained model instead: the prior is the pretrained model, not uniform. Same variational structure, different prior.

Lesson 12 generalizes this all the way: the full RL problem (not just the entropy bonus) is variational inference in the right graphical model.

Common pitfalls

Forgetting that the ELBO is a lower bound, not the likelihood. Maximizing the ELBO can be sub-optimal in two ways: a bad variational family (q can’t represent the true posterior) gives a large gap; a bad ELBO maximum (q chosen to satisfy local optima) underperforms the true MLE.
Treating the KL regularizer as a “free” prior. The KL term is real regularization. Strong priors prevent the model from fitting the data; weak priors make the posterior approximate the data and lose generalization. Tune the prior or its weight (the beta in beta-VAE).
Confusing reparameterization with rejection sampling. Reparameterization is for differentiable sampling from a continuous distribution with closed-form mapping. Rejection sampling, importance sampling, and the score-function estimator are alternatives when reparameterization isn’t available (discrete distributions, non-differentiable transformations).
Computing the KL term by MC when closed-form exists. Gaussian-to-Gaussian KL is closed-form. MC estimation here adds variance for no reason. Use the closed-form whenever both q and p are in the same parametric family.
Treating the entropy bonus in SAC as separate from variational inference. It is variational inference. The unified view explains why SAC’s alpha tuning matters and why the soft Q-update has its specific form.

Why this matters when you use AI

The variational framework is the bridge between three sub-fields:

Deep generative modeling: VAEs (Kingma & Welling, 2014), normalizing flows (Rezende & Mohamed, 2015), latent diffusion models (Rombach et al., 2022).
Model-based RL: Dreamer, PlaNet, RSSM-style world models.
MaxEnt RL: SAC and the family that grew from it; KL-regularized RLHF.

All three speak the same language (ELBO + reparameterization), so cross-pollination is easy. Innovations in one sub-field transfer quickly. The RSSM latent-state model is structurally similar to a sequential VAE; SAC’s soft Q-update is the variational version of the standard Q-update; RLHF’s KL penalty is a Bayesian prior in disguise.

Lesson 12 builds on this language to recast the full RL problem as inference in a graphical model. That reformulation explains why MaxEnt RL has nice convergence properties (it is exact inference in the right model), where SAC comes from algorithmically (variational inference yields the soft Bellman backup), and why RLHF’s clipped surrogate plus KL penalty is the right combination (the clipped surrogate handles the trust region for the variational distribution, the KL penalty is the prior).

What you should remember from this lesson

The ELBO is a tractable lower bound on log p of x: the expectation under q of log p of x given z, minus the KL from q of z given x to the prior p of z. Derived by applying Jensen’s inequality to the log-marginal.
The ELBO gap, log p of x minus the ELBO, equals the KL from q of z given x to p of z given x, and measures how far the variational posterior is from the true posterior. Worked example: linear-Gaussian model with x = 1, a mismatched q (mean 0.3, variance 1), true posterior (mean 0.5, variance 0.5). Both paths to the KL gave 0.193.
Reparameterization writes the latent z as a deterministic mapping g-phi of a noise variable epsilon and the input x, with epsilon from a fixed distribution. Lets gradients flow through samples. Makes VAEs (and many RL algorithms) trainable by backpropagation.
Two main RL uses: latent-state world models (Dreamer, PlaNet) trained with the ELBO; MaxEnt RL (SAC) where the entropy bonus is a KL regularizer to uniform.
L12 will generalize: the full RL problem can be cast as variational inference in a graphical model with “optimality” as the conditioning evidence.

Next lesson: control as inference. Take the latent-variable / variational machinery developed here and apply it to the entire RL problem. The result: Bellman backups, MaxEnt RL, and inverse RL all fall out of a single variational framework.

References

Kingma, D. P., & Welling, M. (2014). Auto-Encoding Variational Bayes. ICLR 2014. https://arxiv.org/abs/1312.6114 The VAE paper. Introduces the reparameterization trick in modern deep-learning form.
Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic Backpropagation and Approximate Inference in Deep Generative Models. ICML 2014. https://arxiv.org/abs/1401.4082 The parallel paper that also introduced the reparameterization trick in 2014.
Hafner, D., Lillicrap, T., Fischer, I., et al. (2019). Learning Latent Dynamics for Planning from Pixels. ICML 2019. https://arxiv.org/abs/1811.04551 PlaNet, the RSSM (recurrent state-space model) for partially observed control.
Hafner, D., Lillicrap, T., Ba, J., & Norouzi, M. (2020). Dream to Control: Learning Behaviors by Latent Imagination. ICLR 2020. https://arxiv.org/abs/1912.01603 DreamerV1.
Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. ICML 2018. https://arxiv.org/abs/1801.01290 SAC. MaxEnt RL as the soft-Bellman variational backup.
Ziebart, B. D., Maas, A., Bagnell, J. A., & Dey, A. K. (2008). Maximum Entropy Inverse Reinforcement Learning. AAAI 2008. The original MaxEnt-RL framing in inverse RL, predating SAC by a decade.
Levine, S. (2018). Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review. arXiv:1805.00909. https://arxiv.org/abs/1805.00909 The canonical reference for L12’s control-as-inference framing. Read together with this lesson; that paper is L12’s source material.
Levine, S. (2023). CS285 lecture on Variational Inference and Generative Models. UC Berkeley. https://rail.eecs.berkeley.edu/deeprlcourse/