Skip to content

Cheatsheet: Control as inference

Introduce binary optimality variables O_t ∈ {0, 1}:

p(O_t = 1 | s_t, a_t) = exp(r(s_t, a_t) / α)

(Un-normalized “likelihood”; absorbed by partition function later.)

Joint:

p(τ, O_{1:T}) = p(s_0) · Π_t [ p(a_t | s_t) · p(s_{t+1} | s_t, a_t) · p(O_t | s_t, a_t) ]

RL problem: infer p(a_t | s_t, O_{t:T} = 1).

The soft Bellman backup (derivation outcome)

Section titled “The soft Bellman backup (derivation outcome)”
Q_soft(s, a) = r(s, a) + γ · E_{s' ~ P(·|s, a)} [V_soft(s')]
V_soft(s) = α · log Σ_a exp(Q_soft(s, a) / α)
π_soft(a|s) = exp((Q_soft(s, a) - V_soft(s)) / α)

V_soft is the log-partition function over actions. α is the temperature.

LimitBehaviorRecovers
α → 0log-sum-exp → maxHard Bellman (Lesson 6); deterministic greedy policy
α → ∞log-sum-exp uniformUniform random policy; no use of rewards
Real systemsα ∈ [0.01, 1.0] typicallySoft policy with controlled entropy

α is the information rate: how sharply rewards drive the policy.

Single state, two actions, terminal after 1 step, α = 1, r = (1, 0).

ActionQ_softexp(Q_soft / α)
a_11e = 2.7183
a_201.0000
V_soft = log(e + 1) = log(3.7183) ≈ 1.3133
π_soft(a_1) = exp(1 - 1.3133) ≈ 0.7311
π_soft(a_2) = exp(0 - 1.3133) ≈ 0.2689

Sum: 1.0000 ✓. At α → 0: π → (1, 0). At α → ∞: π → (0.5, 0.5). Limits verified.

SAC = practical implementation of soft Bellman

Section titled “SAC = practical implementation of soft Bellman”
ComponentWhat it does
Soft Q-critic Q_φ(s, a)Regress to r + γ · E[V_soft(s')]
Stochastic actor `π_θ(as)`
Reparameterization a = μ_θ(s) + σ_θ(s) · εDifferentiable sampling
Auto temperature αTune to target entropy (Haarnoja 2018b)

| Prior p(a | s) | Algorithm | KL term in objective | |------------------|-----------|---------------------| | Uniform | SAC, MaxEnt RL | KL(π_θ || uniform) = -H(π_θ) up to constant | | π_pretrained | KL-regularized PPO (RLHF) | KL(π_θ || π_pretrained) | | Demonstration policy | Imitation-bootstrapped RL | KL(π_θ || π_demo) | | Reward model implicit | DPO (skip explicit RL) | Same KL, different sampler |

Same variational framework. Different prior = different algorithm.

Exact-vs-variational axis (NOT deterministic-vs-stochastic)

Section titled “Exact-vs-variational axis (NOT deterministic-vs-stochastic)”

Naive exact-inference message passing under stochastic transitions yields:

Q_soft(s, a) = r(s, a) + α · log E_{s'} [exp(V_soft(s') / α)] [naive exact]

This log-sum-exp over next states is risk-seeking / optimistic under uncertainty (the “optimism problem”). The variational correction restores the plain expectation:

Q_soft(s, a) = r(s, a) + γ · E_{s'} [V_soft(s')] [variational; what SAC uses]

Levine 2018: exact inference is appropriate for deterministic dynamics; variational inference for stochastic. SAC uses the variational form regardless of dynamics. The log-sum-exp / soft-max stays over actions (in V_soft), never over next states in the actual backup.

Full RLHF objective:

L = E_π [R(prompt, response)] - β · KL(π_θ || π_pretrained)

= variational ELBO for the optimality-conditioned graphical model with:

  • Latent: response
  • Prior: pretrained model
  • Likelihood: exp(R / β)
  • Temperature: β

L8 derived the practical surrogate L^CLIP - β · KL. Variational framework derives the OBJECTIVE; PPO is the optimizer.

  • Conflating temperature α and discount γ (different roles)
  • Confusing the exact-vs-variational axis with deterministic-vs-stochastic (SAC uses the variational form r + γ · E[V_soft] regardless of dynamics; naive exact-inference message passing is the risk-seeking version)
  • Treating the framework as algorithm-specific (it’s structural)
  • Skipping the prior choice (the design knob)
  • Treating control-as-inference as just notational

The variational unification is one instance of a broader pattern: the loss function determines what the model learns. Other instances:

  • MuZero (L10): train for planning quality, not raw-observation reconstruction
  • JEPA (T24): predict latent representations, not pixels
  • DPO: skip reward model, sample directly from preference posterior

Same insight: pick the loss for what you want; the algorithm follows.

  • Soft Bellman backup: V_soft = α · log Σ_a exp(Q/α); π_soft = exp(Q/α) / Z.
  • α → 0: hard Bellman; α → ∞: uniform policy.
  • SAC implements this. RLHF is the same framework with pretrained prior.
  • Phase 2 closes here. Phase 3 opens with RLHF in L13.