Skip to content

Cheatsheet: Variational inference for RL

log p(x) = ELBO + KL(q(z|x) || p(z|x))

ELBO is a lower bound on log p(x); gap is the KL between variational and true posterior. Maximizing the ELBO simultaneously maximizes likelihood and minimizes posterior-approximation error.

ELBO = E_{z ~ q(z|x)} [ log p(x | z) ] - KL(q(z|x) || p(z))
TermRole
`E_q [log p(xz)]`
`KL(q(zx)

For q = N(μ_q, σ_q²), p = N(μ_p, σ_p²):

KL(q || p) = log(σ_p / σ_q) + (σ_q² + (μ_q - μ_p)²) / (2 σ_p²) - 1/2

For multivariate Gaussians with diagonal Σ, sum over dimensions.

z = g_φ(ε, x) where ε ~ p(ε) (fixed, no parameters)

For q_φ(z|x) = N(μ_φ(x), σ_φ²(x)):

z = μ_φ(x) + σ_φ(x) · ε, ε ~ N(0, 1)

Lets gradient flow through samples. Critical for trainable VAEs and the SAC actor.

Model: p(z) = N(0, 1), p(x|z) = N(z, 1). Observe x = 1.

True posterior (closed form by Bayes + complete-the-square): p(z|x=1) = N(0.5, 0.5).

Marginal: p(x) = N(0, 2). So log p(x=1) = -0.5 log(4π) - 0.25 ≈ -1.516.

Mismatched variational: q(z|x=1) = N(0.3, 1). Compute:

E_q[(1-z)²] = (1-0.3)² + 1 = 1.49
E_q[log p(x|z)] = -0.5 log(2π) - 0.5 · 1.49 ≈ -0.919 - 0.745 = -1.664
KL(q || p) = 0 + (1 + 0.09)/2 - 0.5 = 0.045
ELBO = -1.664 - 0.045 = -1.709

ELBO gap: log p(x) - ELBO = -1.516 - (-1.709) = 0.193.

Dual-path: compute KL(q(z|x) || p(z|x)) directly:

KL(N(0.3, 1) || N(0.5, 0.5))
= log(√0.5 / 1) + (1 + 0.04)/1 - 0.5
= -0.347 + 1.04 - 0.5
= 0.193 ✓

Both paths arrive at gap = 0.193. The central identity holds to the digit.

SettingVariational objectAlgorithm
Partially observed RL, latent-state world modelsSequential q(s_to_1:t) over latent states
Maximum-entropy RLKL(π(·s)
KL-regularized RLHFKL(π_θ
Inverse RL with energy-based modelsVariational max-likelihood of expert trajectoriesMaxEnt IRL (Ziebart 2008)

The unifying view: the entropy bonus, the KL-to-pretrained penalty, and the variational world model are all instances of the same machinery applied to different parts of the RL pipeline.

Standard policy gradient maximizes E[r(s, a)]. SAC adds entropy:

J_SAC(π) = E [ r(s, a) + α · H(π(·|s)) ]
= E [ r(s, a) ] - α · KL(π(·|s) || uniform) (up to constants)

The entropy bonus IS a KL regularizer toward the uniform action prior. Picking a different prior gives a different objective:

PriorObjectiveUse case
UniformSAC, MaxEnt RLExploration regularization
Pretrained modelKL-regularized PPORLHF
Demonstration policyBehavioral-cloning regularizerImitation-bootstrap
  • Forgetting the ELBO is a lower bound, not the likelihood
  • Treating KL as a “free” regularizer (it’s real regularization with real consequences)
  • Confusing reparameterization with rejection sampling
  • MC-estimating Gaussian-to-Gaussian KL when closed form exists
  • Treating SAC’s entropy bonus as separate from variational inference (it’s identical)
  • The ELBO is E_q[log p(x|z)] - KL(q || p); derivable from Jensen’s inequality.
  • ELBO gap = KL(q(z|x) || p(z|x)); tight when q matches the true posterior.
  • Reparameterization makes sampling differentiable.
  • Two RL uses: latent-state world models (Dreamer), MaxEnt RL (SAC).
  • L12 generalizes: full RL = variational inference with “optimality” as evidence.