Cheatsheet: Control as inference
The graphical model
Section titled “The graphical model”Introduce binary optimality variables O_t ∈ {0, 1}:
p(O_t = 1 | s_t, a_t) = exp(r(s_t, a_t) / α)(Un-normalized “likelihood”; absorbed by partition function later.)
Joint:
p(τ, O_{1:T}) = p(s_0) · Π_t [ p(a_t | s_t) · p(s_{t+1} | s_t, a_t) · p(O_t | s_t, a_t) ]RL problem: infer p(a_t | s_t, O_{t:T} = 1).
The soft Bellman backup (derivation outcome)
Section titled “The soft Bellman backup (derivation outcome)”Q_soft(s, a) = r(s, a) + γ · E_{s' ~ P(·|s, a)} [V_soft(s')]V_soft(s) = α · log Σ_a exp(Q_soft(s, a) / α)π_soft(a|s) = exp((Q_soft(s, a) - V_soft(s)) / α)V_soft is the log-partition function over actions. α is the temperature.
The two limits
Section titled “The two limits”| Limit | Behavior | Recovers |
|---|---|---|
α → 0 | log-sum-exp → max | Hard Bellman (Lesson 6); deterministic greedy policy |
α → ∞ | log-sum-exp uniform | Uniform random policy; no use of rewards |
| Real systems | α ∈ [0.01, 1.0] typically | Soft policy with controlled entropy |
α is the information rate: how sharply rewards drive the policy.
Worked example (lesson body)
Section titled “Worked example (lesson body)”Single state, two actions, terminal after 1 step, α = 1, r = (1, 0).
| Action | Q_soft | exp(Q_soft / α) |
|---|---|---|
| a_1 | 1 | e = 2.7183 |
| a_2 | 0 | 1.0000 |
V_soft = log(e + 1) = log(3.7183) ≈ 1.3133π_soft(a_1) = exp(1 - 1.3133) ≈ 0.7311π_soft(a_2) = exp(0 - 1.3133) ≈ 0.2689Sum: 1.0000 ✓. At α → 0: π → (1, 0). At α → ∞: π → (0.5, 0.5). Limits verified.
SAC = practical implementation of soft Bellman
Section titled “SAC = practical implementation of soft Bellman”| Component | What it does |
|---|---|
Soft Q-critic Q_φ(s, a) | Regress to r + γ · E[V_soft(s')] |
| Stochastic actor `π_θ(a | s)` |
Reparameterization a = μ_θ(s) + σ_θ(s) · ε | Differentiable sampling |
Auto temperature α | Tune to target entropy (Haarnoja 2018b) |
Prior choice = algorithm choice
Section titled “Prior choice = algorithm choice”| Prior p(a | s) | Algorithm | KL term in objective |
|------------------|-----------|---------------------|
| Uniform | SAC, MaxEnt RL | KL(π_θ || uniform) = -H(π_θ) up to constant |
| π_pretrained | KL-regularized PPO (RLHF) | KL(π_θ || π_pretrained) |
| Demonstration policy | Imitation-bootstrapped RL | KL(π_θ || π_demo) |
| Reward model implicit | DPO (skip explicit RL) | Same KL, different sampler |
Same variational framework. Different prior = different algorithm.
Exact-vs-variational axis (NOT deterministic-vs-stochastic)
Section titled “Exact-vs-variational axis (NOT deterministic-vs-stochastic)”Naive exact-inference message passing under stochastic transitions yields:
Q_soft(s, a) = r(s, a) + α · log E_{s'} [exp(V_soft(s') / α)] [naive exact]This log-sum-exp over next states is risk-seeking / optimistic under uncertainty (the “optimism problem”). The variational correction restores the plain expectation:
Q_soft(s, a) = r(s, a) + γ · E_{s'} [V_soft(s')] [variational; what SAC uses]Levine 2018: exact inference is appropriate for deterministic dynamics; variational inference for stochastic. SAC uses the variational form regardless of dynamics. The log-sum-exp / soft-max stays over actions (in V_soft), never over next states in the actual backup.
RLHF as special case
Section titled “RLHF as special case”Full RLHF objective:
L = E_π [R(prompt, response)] - β · KL(π_θ || π_pretrained)= variational ELBO for the optimality-conditioned graphical model with:
- Latent: response
- Prior: pretrained model
- Likelihood:
exp(R / β) - Temperature:
β
L8 derived the practical surrogate L^CLIP - β · KL. Variational framework derives the OBJECTIVE; PPO is the optimizer.
Common pitfalls
Section titled “Common pitfalls”- Conflating temperature
αand discountγ(different roles) - Confusing the exact-vs-variational axis with deterministic-vs-stochastic (SAC uses the variational form r + γ · E[V_soft] regardless of dynamics; naive exact-inference message passing is the risk-seeking version)
- Treating the framework as algorithm-specific (it’s structural)
- Skipping the prior choice (the design knob)
- Treating control-as-inference as just notational
Fleet pattern
Section titled “Fleet pattern”The variational unification is one instance of a broader pattern: the loss function determines what the model learns. Other instances:
- MuZero (L10): train for planning quality, not raw-observation reconstruction
- JEPA (T24): predict latent representations, not pixels
- DPO: skip reward model, sample directly from preference posterior
Same insight: pick the loss for what you want; the algorithm follows.
What you should remember
Section titled “What you should remember”- Soft Bellman backup:
V_soft = α · log Σ_a exp(Q/α);π_soft = exp(Q/α) / Z. - α → 0: hard Bellman; α → ∞: uniform policy.
- SAC implements this. RLHF is the same framework with pretrained prior.
- Phase 2 closes here. Phase 3 opens with RLHF in L13.