Skip to content

Summary: Planning with a learned model

Once you have a learned dynamics model (Lesson 9), you can plan with it. The simplest planner is random shooting: sample N action sequences uniformly, score each by rolling the model forward, pick the best. The standard upgrade is the cross-entropy method (CEM): iteratively refit a Gaussian over action sequences to the elite samples, concentrating the distribution around high-reward regions over 5 to 10 iterations. Model Predictive Control (MPC) wraps any planner in a receding-horizon loop: plan H steps, execute one, observe, re-plan. The receding horizon limits how far the model needs to be accurate; the compounding-error analysis from Lesson 9 dictates that this H stays small (5 to 30 steps for most continuous-control problems). The most ambitious learned-model algorithm to date is MuZero (Schrittwieser et al., 2020), which trains a dynamics network jointly with policy and value networks inside an MCTS planning loop. MuZero’s key innovation: the model lives entirely in hidden-state space and is trained for planning quality (policy + value + reward predictions), not for raw-observation reconstruction. This produces a better model for control than training for one-step accuracy. MuZero mastered Go, chess, shogi, and Atari without being told the rules. This lesson closes the P-branch of the Lesson 3 dispatch table and completes the Phase 2 tour of all five algorithmic families.

  1. Random shooting is the baseline planner: sample, score, pick. Cheap and embarrassingly parallel; wasteful for higher-dim action spaces.
  2. CEM is the workhorse: iteratively refit a Gaussian to elite samples. Each iteration concentrates around high-reward regions. Converges in 5 to 10 iterations on most problems.
  3. MPC wraps any planner in a receding-horizon loop. Plan H steps, execute one, re-plan from the new real-world state. The model needs to be accurate only over the H-step horizon.
  4. MuZero learns a dynamics model end-to-end inside MCTS. The model lives in hidden-state space; never predicts raw observations; trained for planning quality. Mastered Go, chess, shogi, Atari without the rules.
  5. Family decision: MuZero/AlphaZero for discrete board / Atari games with deep tree search; MPC + CEM for continuous control with short horizons; model-free PPO for language and everything else.

This lesson closes the P-branch of the Lesson 3 dispatch table. Phase 2 has now toured all five families:

  • π (policy): L4 REINFORCE, L5 actor-critic, L8 PPO
  • V (state value): L5 as critic in actor-critic
  • Q (action value): L6 value-based RL, L7 DQN
  • A (advantage): L5 as advantage in actor-critic, L8 in PPO
  • P (dynamics model): L9 learning the model, L10 (this) planning with it

The reader leaves with a map of “what each algorithm estimates” across modern deep RL, plus the engineering tricks each family needs to be stable (DQN’s three patches for off-policy reuse, PPO’s clipped surrogate for on-policy stability, MPC’s receding horizon for model bias).

Lessons 11 and 12 zoom out to a different angle: control as inference. The entire RL problem can be reformulated as a probabilistic-inference problem in a graphical model, where actions are latent variables and “high reward” is the evidence we condition on. Different mathematical scaffolding, same underlying control problem; useful for connecting RL to variational inference, normalizing flows, and the Bayesian-RL literature.

One CEM iteration on a 1D target problem (a ∈ [-1, 1], dynamics s' = s + a, reward -(s - 1)², s_0 = 0, H = 1). Initial q_0 = N(0, 1). Samples {-0.5, 0.0, 0.5, 0.8} score {-2.25, -1.00, -0.25, -0.04}. Top-K = 2 elites: {0.5, 0.8}. Refit:

μ_1 = (0.5 + 0.8) / 2 = 0.65
σ_1 = √(((0.5 - 0.65)² + (0.8 - 0.65)²) / 2) = √0.0225 = 0.15

q_1 = N(0.65, 0.15²). Distance to optimum a* = 1.0 dropped from 1.0 (initial mean) to 0.35 (after one iteration). Width shrank from σ = 1.0 to σ = 0.15. Two more iterations bring the distribution within 0.01 of the optimum.

  • Previous (Lesson 9): Learning the dynamics. The model itself.
  • This lesson: Using the model. Random shooting, CEM, MPC, MuZero.
  • Next (Lesson 11): Variational inference for RL. A different angle: reformulate the RL problem as probabilistic inference in a graphical model.
  • Later (Lesson 13): RLHF. The L8 PPO algorithm applied to language model fine-tuning.

Random shooting is the baseline, CEM is the workhorse, MPC is the wrapper, and MuZero is the contemporary breakthrough. The dispatch table from Lesson 3 is now fully toured: every algorithm family covered in Phase 2 maps to one of π, V, Q, A, or P (or a hybrid). The Phase 2 → Phase 3 boundary checkpoint after Lesson 12 will mark the end of the algorithmic core; Phase 3 then covers the production applications (RLHF, agentic systems, real-world robotics).