Brief: Variational inference for RL
Capability gained
Section titled “Capability gained”Derive the ELBO from Jensen’s inequality applied to log p(x). Compute the ELBO for a small linear-Gaussian latent-variable model by hand and verify via dual-path (definition vs analytic posterior-KL) that log p(x) - ELBO = KL(q(z|x) || p(z|x)). Explain the reparameterization trick. Identify variational inference at work inside SAC (MaxEnt RL via uniform prior) and KL-regularized PPO (RLHF via pretrained-model prior).
Why this lesson exists
Section titled “Why this lesson exists”The L4-L10 dispatch-table tour answered “what does each algorithm estimate?” L11 introduces a different angle: a single unifying probabilistic-inference framework that subsumes much of modern deep RL. L11 covers the variational inference machinery (ELBO, reparameterization, two RL applications); L12 applies it to the full RL problem (control as inference).
L11 is also the prerequisite for L12’s heavier derivation. Without ELBO fluency, L12’s reformulation of RL as inference in a graphical model will not land. L11 builds the language; L12 builds the construction.
Source
Section titled “Source”Berkeley CS285 lecture on Variational Inference and Generative Models), Sergey Levine, 2023. Primary papers: Kingma & Welling (2014) VAE; Rezende et al. (2014) parallel reparameterization paper; PlaNet (Hafner 2019); DreamerV1-V3 (Hafner 2020/2021/2023); SAC (Haarnoja 2018); Levine (2018) control-as-inference review (which is L12’s source material).
Phase advance
Section titled “Phase advance”Phase 2 lesson 6 (phase_order: 6). First of the conceptual-reframing pair (L11 builds the language; L12 applies it). The dispatch-table tour closed at L10; L11/L12 add a different organizing axis (variational inference / probabilistic inference) that subsumes the algorithmic families. Sets up L12 directly; supports L13’s RLHF deep-dive (KL-regularized RLHF is a special case of the L12 framing).
Lesson body (lesson.mdx)
Section titled “Lesson body (lesson.mdx)”- Why VI matters for RL: two settings (latent-state world models, MaxEnt RL) use the same machinery.
- The latent-variable model and the intractable log-marginal.
- ELBO derivation via Jensen’s inequality applied to
log E_q[p(x,z)/q(z|x)]. Two terms (reconstruction + KL regularizer). - The ELBO gap =
KL(q(z|x) || p(z|x)). Maximizing ELBO simultaneously maximizes likelihood and minimizes posterior-approximation error. - Worked example: linear-Gaussian model.
p(z) = N(0, 1),p(x|z) = N(z, 1), observex = 1. Closed-form true posteriorN(0.5, 0.5)via complete-the-square. Closed-form marginalN(0, 2). Mismatched variationalq = N(0.3, 1). Compute ELBO ≈ -1.709, gap ≈ 0.193. Dual-path:KL(N(0.3, 1) || N(0.5, 0.5)) = 0.193✓. - Reparameterization trick:
z = μ_φ(x) + σ_φ(x) · εfor Gaussianq. Differentiable sampling. - Two RL applications: latent-state world models (Dreamer, PlaNet, RSSM); MaxEnt RL (SAC) where entropy bonus = KL to uniform.
- Connection forward to L12: the full RL problem is variational inference in a graphical model.
- Common pitfalls: ELBO is a lower bound (not the likelihood); KL is real regularization; reparameterization ≠ rejection sampling; closed-form KL when same parametric family.
- “Why this matters” anchors the VI bridge across generative modeling, model-based RL, MaxEnt RL.
Practice (practice.mdx)
Section titled “Practice (practice.mdx)”Two exercises:
-
ELBO computation with dual-path verification on a fresh model.
p(z) = N(0, 4),p(x|z) = N(z, 1), observex = 2. Reader derives true posteriorN(1.6, 0.8)by complete-the-square; marginalN(0, 5); mismatchedq = N(1.0, 1.5). ComputesE_q[log p(x|z)] ≈ -2.169,KL(q || p) = 0.303,ELBO ≈ -2.472. Gap ≈ 0.348. Dual-path:KL(N(1.0, 1.5) || N(1.6, 0.8)) = 0.348✓. Identity holds to the digit. -
Identify the variational ingredients in SAC and KL-regularized PPO. For each algorithm, name the latent variable
z, priorp(z), variational posteriorq(z|x), and “likelihood” term. SAC:z = a,p(z) =uniform,q = π(a|s), likelihood= exp(r/α). KL-PPO:z =response,p(z) = π_pretrained,q = π_θ, likelihood= exp(R/β). Synthesis: both fit the variational templateE[reward] - KL(policy || prior); choice of prior is the design knob.
5 flashcards: ELBO derivation; Gaussian-to-Gaussian KL formula; reparameterization trick; entropy-bonus-as-variational connection; closed-form vs MC KL estimation.
Cheatsheet (cheatsheet.mdx)
Section titled “Cheatsheet (cheatsheet.mdx)”One-page reference. Central identity. ELBO formula with two-term breakdown. Gaussian-to-Gaussian KL. Reparameterization trick formula. Worked example numerics reproduced. Where-VI-shows-up-in-RL table (latent-state models, SAC, KL-RLHF, MaxEnt IRL). MaxEnt-RL-as-variational derivation. Common pitfalls.
Summary (summary.mdx)
Section titled “Summary (summary.mdx)”5-minute distillation. One-paragraph framing. Five things to remember. Why-this-matters paragraph (VI bridge across generative modeling, model-based RL, MaxEnt RL). Worked-check memory anchor with the dual-path 0.193 numerics. Where this fits in the track arc (L12 generalizes, L13 RLHF as special case).
References (references.mdx)
Section titled “References (references.mdx)”Reparameterization: Kingma & Welling (2014), Rezende et al. (2014) co-foundational. VI fundamentals: Blei, Kucukelbir & McAuliffe (2017) modern review; Bishop (2006) ch 10; Murphy (2012, 2023). Latent-state world models: PlaNet (Hafner 2019), DreamerV1/V2/V3. MaxEnt RL: SAC (Haarnoja 2018), Soft Q-learning (Haarnoja 2017), MaxEnt IRL (Ziebart 2008). Control as inference (L12 source): Levine (2018), Toussaint & Storkey (2006). RLHF: Ouyang et al. (2022), Bai et al. (2022). Recent extensions: latent diffusion (Rombach 2022). Course: CS285 L18.
Editorial discipline
Section titled “Editorial discipline”- Stage 2 sweep: em/en dashes, U+2212 vs hyphen, caps emphasis, dead
/topics/links. Acronyms allowed in caps: ELBO, KL, VI, VAE, RL, RLHF, MDP, MLE, MC, MLE, IRL, MaxEnt, RSSM, PlaNet, Dreamer, DreamerV1, DreamerV2, DreamerV3, SAC, PPO, DQN, DDPG, NeurIPS, ICML, ICLR, JMLR, AAAI, MIT, CVPR. - No vendor naming triggers (paper authors, course instructors, algorithm names only). The Levine 2018 review paper is named as L12’s source material, which is a course-source citation, not vendor framing.
- §6 status: standard pipeline, no triggers. L12 forward reference is the natural continuation. L13 RLHF reference is properly deferred.
Word counts
Section titled “Word counts”- Lesson 2800
- Cheatsheet 660
- Practice 2010
- Summary 695
- Brief 870
- References 580
Total ≈ 7615 words across 6 artifacts. Math-heavy band; appropriate for the conceptual-density lesson.
Notes for promotion
Section titled “Notes for promotion”- Component placeholders (
�J0�,�J1�) live as MDX comments. - Practice imports real
�J0�+�J1�components. - Numerics: the linear-Gaussian dual-path verification (gap = 0.193 = KL by two paths) is exact arithmetic to three decimals; should pass independent verification. The practice’s fresh model with
σ_p² = 4similarly verifies dual-path at 0.348. - Continues phase-boundary cadence; Phase 2 boundary check after L12 (next lesson closes Phase 2).
- The “VI subsumes the algorithmic families” framing is the load-bearing pedagogical move; should be preserved through any future edits. The MuZero loss-function-insight callout (the L10 → L11 cross-reference) connects to dev-03’s T24L7 representation-learning framing per the advisor’s “cross-track learning loop” observation; preserve that linkage.