Brief: Control as inference (closes Phase 2)
Capability gained
Section titled “Capability gained”Construct the graphical model with optimality variables O_t that turns RL into variational inference. Derive the soft Bellman backup V_soft(s) = α · log Σ_a exp(Q_soft(s, a) / α). Compute soft Q-values and the resulting policy by hand on a small example. Verify both limits α → 0 (hard Bellman) and α → ∞ (uniform policy). Recognize SAC as the practical algorithm implementing this backup; recognize RLHF (and DPO) as the same framework with the pretrained model as the prior. Closes Phase 2 of Track 18.
Why this lesson exists
Section titled “Why this lesson exists”L11 introduced the variational language (ELBO, reparameterization, two RL applications). L12 redeems L11’s promise: the entire RL problem is variational inference. The Levine (2018) tutorial review is the canonical reference; this lesson follows its construction with the fleet’s worked-example + dual-path discipline applied throughout.
L12 is the conceptual capstone of Phase 2. The L4-L10 dispatch-table tour answered “what does each algorithm estimate?”; L11-L12 answer “is there a single principled objective from which all these algorithms fall out?” The answer is yes, and this lesson makes the connection concrete with worked examples (soft Bellman by hand, both limits verified) and explicit unification (SAC, RLHF, DPO as three samplers from the same variational posterior).
This lesson also carries forward the fleet-pattern observation the advisor flagged: “the loss function determines what the model learns.” MuZero (L10), variational inference (L11), control-as-inference (L12), and JEPA-style representation learning (Track 24, dev-03’s T24L7) are all incarnations of the same insight. The cross-track coherence linkage is preserved through the lesson’s “Fleet pattern” section.
Source
Section titled “Source”Berkeley CS285 lecture on Reframing Control as an Inference Problem), Sergey Levine, 2023. Primary canonical paper: Levine (2018) “Reinforcement Learning and Control as Probabilistic Inference” (arXiv:1805.00909). Pre-deep-learning predecessors: Toussaint & Storkey (2006), Kappen (2005), Todorov (2009). Practical algorithms: Soft Q-learning (Haarnoja 2017), SAC (Haarnoja 2018), MaxEnt IRL (Ziebart 2008), InstructGPT (Ouyang 2022), DPO (Rafailov 2023).
Phase advance
Section titled “Phase advance”Phase 2 lesson 7 (phase_order: 7). FINAL lesson of Phase 2. After L12 = Phase 2 → Phase 3 boundary checkpoint covering L6-L12 as a unit (DQN, PPO, model-based pair, variational inference + control-as-inference). Phase 3 opens at L13 with RLHF as the production killer application.
Lesson body (lesson.mdx)
Section titled “Lesson body (lesson.mdx)”- Recap of L11; this lesson redeems L11’s promise.
- Why this reformulation: the algorithm zoo (DQN squared TD, PPO clipped surrogate, SAC entropy bonus, RLHF KL clip) all derive from one variational principle.
- The graphical model: introduce optimality variables
O_t ∈ {0, 1}withp(O_t = 1 | s, a) ∝ exp(r / α). Joint distribution; RL = inference over actions givenO_{1:T} = 1. - Backward message passing: derive the soft Bellman backup formally.
β(s, a) = p(O_{t:T} = 1 | s, a)satisfies the backward recursion; taking logs and definingQ_soft = α · log βgivesQ_soft(s, a) = r(s, a) + α · log E_{s'}[V(s_{t+1})](with the stochastic-dynamics log-sum-exp correction noted). V_soft(s) = α · log Σ_a exp(Q_soft(s, a) / α). The soft policyπ_soft = exp((Q - V) / α).- Two limits sanity-check:
α → 0recovers hard Bellman (max replaces log-sum-exp);α → ∞recovers uniform policy. Real systems pickα ∈ [0.01, 1.0]. - Worked example: 2-action terminal MDP with
r = (1, 0),α = 1.V_soft = log(e + 1) ≈ 1.3133;π_soft ≈ (0.731, 0.269). Sum check (probability normalizes); limit checks (atα = 0.01: greedy; atα = 100: uniform). - SAC implements this backup: soft Q-critic + reparameterized stochastic actor; the KL-to-Boltzmann-posterior gradient is the actor update.
- RLHF is the same framework with pretrained prior. Full RLHF objective
L = E[R] - β · KL(π_θ || π_pretrained)is the variational ELBO; PPO is the practical optimizer for the surrogate. - The fleet pattern: training objective determines what the model learns. Cross-references MuZero (L10), JEPA (Track 24).
- Common pitfalls (α vs γ confusion, stochastic-dynamics correction, prior choice).
- “Why this matters” anchors the L13-onward production applications.
- “What you should remember” closes the lesson and Phase 2.
Practice (practice.mdx)
Section titled “Practice (practice.mdx)”Two exercises:
-
Soft Bellman by hand at multiple temperatures. 3-action terminal MDP with
r = (2, 1, 0). Atα = 1:V_soft ≈ 2.408,π_soft ≈ (0.665, 0.245, 0.090). Atα = 0.5: more concentrated on best action (π ≈ (0.866, 0.117, 0.016)); atα = 0.01: greedy (π → (1, 0, 0)). Atα = 5: spread (π ≈ (0.402, 0.329, 0.269)); atα = 100: near-uniform (π → (0.333, 0.333, 0.333)). Limits verified dual-path (numerically atα = 0.01andα = 100and conceptually via the framework). -
Variational ingredients in SAC, KL-PPO/RLHF, DPO. For each, identify latent, prior, temperature, and conditioning evidence. SAC: action, uniform,
α, per-step optimality. KL-PPO/RLHF: response, pretrained,β, sequence-level optimality (via reward model). DPO: response, pretrained,β, sequence-level optimality (via preference pairs). Same variational construction; different priors/samplers.
5 flashcards: soft vs hard Bellman; the optimality-variable construction; compute V_soft and π_soft for r=(1,0) at α=1; how the three algorithms unify; α vs γ.
Cheatsheet (cheatsheet.mdx)
Section titled “Cheatsheet (cheatsheet.mdx)”One-page reference. The graphical model construction. The soft Bellman backup formulas. Two-limits table. Worked example numerics reproduced. SAC component map. Prior-choice-as-algorithm-choice table. Stochastic-dynamics correction note. RLHF as special case. Common pitfalls. Fleet pattern. Forward reference to L13.
Summary (summary.mdx)
Section titled “Summary (summary.mdx)”5-minute distillation. One-paragraph framing. Five things to remember. Why-this-matters paragraph (the modern teaching-paradigm shift). Worked-check memory anchor with the 2-action numerics. Where this fits (closes Phase 2; L13 opens Phase 3). Fleet-pattern recap.
References (references.mdx)
Section titled “References (references.mdx)”Canonical: Levine (2018) tutorial. Pre-deep-learning predecessors: Toussaint & Storkey (2006), Kappen (2005), Todorov (2009). MaxEnt IRL: Ziebart (2008), Finn et al. (2016). Soft Q + SAC: Haarnoja (2017, 2018a, 2018b). RLHF: Ouyang (2022), Bai (2022). DPO + IPO: Rafailov (2023), Azar (2024). Bridge papers: Schulman/Chen/Abbeel (2018), Nachum et al. (2017). Course source: CS285 L19.
Editorial discipline
Section titled “Editorial discipline”- Stage 2 sweep: em/en dashes, U+2212 vs hyphen, caps emphasis, dead
/topics/links. Acronyms allowed in caps: ELBO, VI, RL, RLHF, IRL, KL, MDP, MLE, SAC, DPO, IPO, KTO, PPO, DQN, DDPG, MaxEnt, JEPA, MuZero, MIT, ICML, ICLR, NeurIPS, AAAI, AISTATS, PNAS. - No vendor naming triggers; paper authors + algorithm names + course instructors only.
- §6 status: standard pipeline, no triggers. L13 RLHF deep-dive properly deferred.
Word counts
Section titled “Word counts”- Lesson 2980
- Cheatsheet 715
- Practice 2240
- Summary 720
- Brief 940
- References 605
Total ≈ 8200 words across 6 artifacts. Math-heavy band; the heaviest lesson in Phase 2 due to the graphical-model construction + 5 limit derivations + 3-algorithm unification.
Notes for promotion
Section titled “Notes for promotion”- Component placeholders (
�J0�,�J1�) live as MDX comments. - Practice imports real
�J0�+�J1�components. - Numerics: all softmax / log-sum-exp arithmetic is hand-checkable to 4 decimals. The
α = 1,r = (1, 0)worked example producesV_soft ≈ 1.3133andπ_soft ≈ (0.7311, 0.2689); verified bye/(e+1) ≈ 0.7311and1/(e+1) ≈ 0.2689. - L12 closes Phase 2. The Phase 2 → Phase 3 boundary checkpoint comes between L12 and L13; that’s the next governance event. After L12 Stage 1 GO + the boundary checkpoint, L13 (RLHF) opens Phase 3 (production applications).
- The “fleet pattern” recurring across MuZero (L10), VI (L11), control-as-inference (L12), and JEPA (T24) should be preserved; advisor flagged it as the cross-fleet pattern co-evolution loop at the META level.