Variational inference for RL: brief

Capability gained

Derive the ELBO from Jensen’s inequality applied to log p(x). Compute the ELBO for a small linear-Gaussian latent-variable model by hand and verify via dual-path (definition vs analytic posterior-KL) that log p(x) - ELBO = KL(q(z|x) || p(z|x)). Explain the reparameterization trick. Identify variational inference at work inside SAC (MaxEnt RL via uniform prior) and KL-regularized PPO (RLHF via pretrained-model prior).

Why this lesson exists

The L4-L10 dispatch-table tour answered “what does each algorithm estimate?” L11 introduces a different angle: a single unifying probabilistic-inference framework that subsumes much of modern deep RL. L11 covers the variational inference machinery (ELBO, reparameterization, two RL applications); L12 applies it to the full RL problem (control as inference).

L11 is also the prerequisite for L12’s heavier derivation. Without ELBO fluency, L12’s reformulation of RL as inference in a graphical model will not land. L11 builds the language; L12 builds the construction.

Source

Berkeley CS285 lecture on Variational Inference and Generative Models), Sergey Levine, 2023. Primary papers: Kingma & Welling (2014) VAE; Rezende et al. (2014) parallel reparameterization paper; PlaNet (Hafner 2019); DreamerV1-V3 (Hafner 2020/2021/2023); SAC (Haarnoja 2018); Levine (2018) control-as-inference review (which is L12’s source material).

Phase advance

Phase 2 lesson 6 (phase_order: 6). First of the conceptual-reframing pair (L11 builds the language; L12 applies it). The dispatch-table tour closed at L10; L11/L12 add a different organizing axis (variational inference / probabilistic inference) that subsumes the algorithmic families. Sets up L12 directly; supports L13’s RLHF deep-dive (KL-regularized RLHF is a special case of the L12 framing).

Lesson body (lesson.mdx)

Why VI matters for RL: two settings (latent-state world models, MaxEnt RL) use the same machinery.
The latent-variable model and the intractable log-marginal.
ELBO derivation via Jensen’s inequality applied to log E_q[p(x,z)/q(z|x)]. Two terms (reconstruction + KL regularizer).
The ELBO gap = KL(q(z|x) || p(z|x)). Maximizing ELBO simultaneously maximizes likelihood and minimizes posterior-approximation error.
Worked example: linear-Gaussian model. p(z) = N(0, 1), p(x|z) = N(z, 1), observe x = 1. Closed-form true posterior N(0.5, 0.5) via complete-the-square. Closed-form marginal N(0, 2). Mismatched variational q = N(0.3, 1). Compute ELBO ≈ -1.709, gap ≈ 0.193. Dual-path: KL(N(0.3, 1) || N(0.5, 0.5)) = 0.193 ✓.
Reparameterization trick: z = μ_φ(x) + σ_φ(x) · ε for Gaussian q. Differentiable sampling.
Two RL applications: latent-state world models (Dreamer, PlaNet, RSSM); MaxEnt RL (SAC) where entropy bonus = KL to uniform.
Connection forward to L12: the full RL problem is variational inference in a graphical model.
Common pitfalls: ELBO is a lower bound (not the likelihood); KL is real regularization; reparameterization ≠ rejection sampling; closed-form KL when same parametric family.
“Why this matters” anchors the VI bridge across generative modeling, model-based RL, MaxEnt RL.

Practice (practice.mdx)

Two exercises:

ELBO computation with dual-path verification on a fresh model. p(z) = N(0, 4), p(x|z) = N(z, 1), observe x = 2. Reader derives true posterior N(1.6, 0.8) by complete-the-square; marginal N(0, 5); mismatched q = N(1.0, 1.5). Computes E_q[log p(x|z)] ≈ -2.169, KL(q || p) = 0.303, ELBO ≈ -2.472. Gap ≈ 0.348. Dual-path: KL(N(1.0, 1.5) || N(1.6, 0.8)) = 0.348 ✓. Identity holds to the digit.
Identify the variational ingredients in SAC and KL-regularized PPO. For each algorithm, name the latent variable z, prior p(z), variational posterior q(z|x), and “likelihood” term. SAC: z = a, p(z) = uniform, q = π(a|s), likelihood = exp(r/α). KL-PPO: z = response, p(z) = π_pretrained, q = π_θ, likelihood = exp(R/β). Synthesis: both fit the variational template E[reward] - KL(policy || prior); choice of prior is the design knob.

5 flashcards: ELBO derivation; Gaussian-to-Gaussian KL formula; reparameterization trick; entropy-bonus-as-variational connection; closed-form vs MC KL estimation.

Cheatsheet (cheatsheet.mdx)

One-page reference. Central identity. ELBO formula with two-term breakdown. Gaussian-to-Gaussian KL. Reparameterization trick formula. Worked example numerics reproduced. Where-VI-shows-up-in-RL table (latent-state models, SAC, KL-RLHF, MaxEnt IRL). MaxEnt-RL-as-variational derivation. Common pitfalls.

Summary (summary.mdx)

5-minute distillation. One-paragraph framing. Five things to remember. Why-this-matters paragraph (VI bridge across generative modeling, model-based RL, MaxEnt RL). Worked-check memory anchor with the dual-path 0.193 numerics. Where this fits in the track arc (L12 generalizes, L13 RLHF as special case).

References (references.mdx)

Reparameterization: Kingma & Welling (2014), Rezende et al. (2014) co-foundational. VI fundamentals: Blei, Kucukelbir & McAuliffe (2017) modern review; Bishop (2006) ch 10; Murphy (2012, 2023). Latent-state world models: PlaNet (Hafner 2019), DreamerV1/V2/V3. MaxEnt RL: SAC (Haarnoja 2018), Soft Q-learning (Haarnoja 2017), MaxEnt IRL (Ziebart 2008). Control as inference (L12 source): Levine (2018), Toussaint & Storkey (2006). RLHF: Ouyang et al. (2022), Bai et al. (2022). Recent extensions: latent diffusion (Rombach 2022). Course: CS285 L18.

Editorial discipline

Stage 2 sweep: em/en dashes, U+2212 vs hyphen, caps emphasis, dead /topics/ links. Acronyms allowed in caps: ELBO, KL, VI, VAE, RL, RLHF, MDP, MLE, MC, MLE, IRL, MaxEnt, RSSM, PlaNet, Dreamer, DreamerV1, DreamerV2, DreamerV3, SAC, PPO, DQN, DDPG, NeurIPS, ICML, ICLR, JMLR, AAAI, MIT, CVPR.
No vendor naming triggers (paper authors, course instructors, algorithm names only). The Levine 2018 review paper is named as L12’s source material, which is a course-source citation, not vendor framing.
§6 status: standard pipeline, no triggers. L12 forward reference is the natural continuation. L13 RLHF reference is properly deferred.

Word counts

Lesson 2800
Cheatsheet 660
Practice 2010
Summary 695
Brief 870
References 580

Total ≈ 7615 words across 6 artifacts. Math-heavy band; appropriate for the conceptual-density lesson.

Notes for promotion

Component placeholders (�J0�, �J1�) live as MDX comments.
Practice imports real �J0� + �J1� components.
Numerics: the linear-Gaussian dual-path verification (gap = 0.193 = KL by two paths) is exact arithmetic to three decimals; should pass independent verification. The practice’s fresh model with σ_p² = 4 similarly verifies dual-path at 0.348.
Continues phase-boundary cadence; Phase 2 boundary check after L12 (next lesson closes Phase 2).
The “VI subsumes the algorithmic families” framing is the load-bearing pedagogical move; should be preserved through any future edits. The MuZero loss-function-insight callout (the L10 → L11 cross-reference) connects to dev-03’s T24L7 representation-learning framing per the advisor’s “cross-track learning loop” observation; preserve that linkage.