Summary: Control as inference (closes Phase 2)

The one paragraph version

The control-as-inference framing reformulates the entire RL problem as variational inference in a graphical model. Introduce binary optimality variables O_t ∈ {0, 1} at each timestep with un-normalized likelihood p(O_t = 1 | s_t, a_t) ∝ exp(r(s_t, a_t) / α); the RL problem becomes inferring p(a_t | s_t, O_{t:T} = 1). Variational message-passing through this graphical model gives the soft Bellman backup: Q_soft(s, a) = r(s, a) + γ · E[V_soft(s')] and V_soft(s) = α · log Σ_a exp(Q_soft(s, a) / α). The soft policy π_soft(a|s) = exp((Q_soft - V_soft) / α) is a Boltzmann distribution over actions. As α → 0, log-sum-exp converges to max and soft Bellman reduces to the hard Bellman optimality equation (Lesson 6); as α → ∞, the policy becomes uniform. Real systems pick α in between to balance reward maximization against policy entropy. SAC implements this backup with a soft Q-critic and a reparameterized stochastic actor. RLHF is the same framework with the pretrained language model replacing the uniform action prior; the full RLHF objective L = E[R] - β · KL(π_θ || π_pretrained) is the variational ELBO. DPO is the same framework with a direct max-likelihood sampler that skips the explicit reward model. Three algorithms, same variational construction with different (prior, sampler) choices. This lesson closes Phase 2 of Track 18: Lessons 6 through 12 covered the algorithmic core (DQN, PPO, model-based pair, variational pair, control-as-inference). Phase 3 opens at Lesson 13 with RLHF as the killer production application.

Five things to remember

Optimality variables: introduce O_t ∈ {0, 1} with p(O_t = 1 | s, a) ∝ exp(r/α). The RL problem is inferring p(a | s, O_{t:T} = 1). This is the variational construction that turns control into inference.
Soft Bellman backup: Q_soft(s, a) = r + γ · E[V_soft(s')] and V_soft(s) = α · log Σ_a exp(Q_soft(s, a) / α). The log-sum-exp is the soft max.
Two limits: α → 0 recovers hard Bellman (Lesson 6); α → ∞ recovers uniform policy. Real systems live in between.
SAC implements this backup. The soft Q-critic regresses to the soft Bellman target; the reparameterized stochastic actor matches the Boltzmann posterior. The “soft” qualifier is not stylistic; it refers exactly to the log-sum-exp value function.
Prior choice = algorithm choice: uniform → SAC; pretrained model → RLHF; demonstration policy → imitation-bootstrap; reward model implicit → DPO. Same variational framework, different design knob.

Why this matters

The control-as-inference framing is the conceptual capstone for Phase 2. The L4-L10 dispatch-table tour answered “what does each algorithm estimate?” L11-L12 answer the deeper question: “is there a single principled objective from which all these algorithms fall out?” Yes, with appropriate choice of (prior, temperature, evidence). This is what makes RL feel less like a zoo of unrelated tricks and more like a coherent mathematical framework.

The framework is also actionable: every time you need a new RL algorithm for a new problem class, you have a recipe. Pick the prior that encodes your inductive bias. Pick the temperature for your entropy / reward trade-off. Optimize the variational objective with your favorite gradient machinery. The resulting algorithm is the right algorithm for your problem.

Modern RL has converged toward this view by 2024 to 2025. The deep-RL textbooks now begin with variational inference; RLHF is presented as a special case of MaxEnt RL; DPO is “the variational shortcut that skips the explicit reward model.” The framing started as a theoretical curiosity (Toussaint & Storkey 2006, Levine 2018) and has become the modern teaching paradigm.

Worked check (memory anchor)

Single-state, 2-action MDP, terminal after 1 step, α = 1, r = (1, 0). Compute:

Q_soft(s, a_1) = 1, Q_soft(s, a_2) = 0
V_soft(s) = log(e + 1) = log(3.7183) ≈ 1.3133
π_soft(a_1) = exp(1 - 1.3133) ≈ 0.7311
π_soft(a_2) = exp(0 - 1.3133) ≈ 0.2689

Limit check: at α = 0.01, π_soft → (1.0, 0.0) (greedy); at α = 100, π_soft → (0.502, 0.498) (uniform). Framework matches both limits. The single-temperature parameter α continuously interpolates between deterministic greedy (Phase 1 standard RL) and uniform random (no information from rewards).

Where this fits

Previous (Lesson 11): Variational inference machinery (ELBO, reparameterization, two RL applications).
This lesson: Control as inference. Apply the variational machinery to the full RL problem. Closes Phase 2.
Next (Lesson 13): RLHF deep-dive. The L11/L12 framework is the theoretical basis for the production RLHF pipeline. The Phase 2 → Phase 3 boundary checkpoint comes between L12 (this lesson) and L13.
Later (Lessons 14+): Production applications: agentic systems, real-world robotics, safety alignment.

The fleet pattern

The Phase 2 unification is one instance of a broader pattern: the loss function determines what the model learns. Other instances:

MuZero (Lesson 10): train the model for planning quality, not raw-observation reconstruction.
JEPA-style representation learning (Track 24, contemporary): predict latent representations, not pixels.
DPO: skip the explicit reward model; sample directly from the preference posterior.
SAC’s entropy bonus (this lesson, Lesson 11): the entropy is a KL regularizer in disguise.

All four are different incarnations of “pick the loss for what you want the model to do.” Variational inference makes this principle explicit; the rest are different domains illustrating the same insight.

What you should remember

Phase 2 is now complete. Lessons 6 through 12 covered the algorithmic core of deep RL (DQN, PPO, model-based pair, variational inference + control-as-inference) as a coherent mathematical unit. Phase 3 opens at Lesson 13 with RLHF as the production killer application. The Phase 2 → Phase 3 boundary checkpoint after this lesson reviews L6 through L12 as a coherent batch before opening the production-applications phase.