Summary: Planning with a learned model
The one paragraph version
Section titled “The one paragraph version”Once you have a learned dynamics model (Lesson 9), you can plan with it. The simplest planner is random shooting: sample N action sequences uniformly, score each by rolling the model forward, pick the best. The standard upgrade is the cross-entropy method (CEM): iteratively refit a Gaussian over action sequences to the elite samples, concentrating the distribution around high-reward regions over 5 to 10 iterations. Model Predictive Control (MPC) wraps any planner in a receding-horizon loop: plan H steps, execute one, observe, re-plan. The receding horizon limits how far the model needs to be accurate; the compounding-error analysis from Lesson 9 dictates that this H stays small (5 to 30 steps for most continuous-control problems). The most ambitious learned-model algorithm to date is MuZero (Schrittwieser et al., 2020), which trains a dynamics network jointly with policy and value networks inside an MCTS planning loop. MuZero’s key innovation: the model lives entirely in hidden-state space and is trained for planning quality (policy + value + reward predictions), not for raw-observation reconstruction. This produces a better model for control than training for one-step accuracy. MuZero mastered Go, chess, shogi, and Atari without being told the rules. This lesson closes the P-branch of the Lesson 3 dispatch table and completes the Phase 2 tour of all five algorithmic families.
Five things to remember
Section titled “Five things to remember”- Random shooting is the baseline planner: sample, score, pick. Cheap and embarrassingly parallel; wasteful for higher-dim action spaces.
- CEM is the workhorse: iteratively refit a Gaussian to elite samples. Each iteration concentrates around high-reward regions. Converges in 5 to 10 iterations on most problems.
- MPC wraps any planner in a receding-horizon loop. Plan H steps, execute one, re-plan from the new real-world state. The model needs to be accurate only over the H-step horizon.
- MuZero learns a dynamics model end-to-end inside MCTS. The model lives in hidden-state space; never predicts raw observations; trained for planning quality. Mastered Go, chess, shogi, Atari without the rules.
- Family decision: MuZero/AlphaZero for discrete board / Atari games with deep tree search; MPC + CEM for continuous control with short horizons; model-free PPO for language and everything else.
Why this matters
Section titled “Why this matters”This lesson closes the P-branch of the Lesson 3 dispatch table. Phase 2 has now toured all five families:
- π (policy): L4 REINFORCE, L5 actor-critic, L8 PPO
- V (state value): L5 as critic in actor-critic
- Q (action value): L6 value-based RL, L7 DQN
- A (advantage): L5 as advantage in actor-critic, L8 in PPO
- P (dynamics model): L9 learning the model, L10 (this) planning with it
The reader leaves with a map of “what each algorithm estimates” across modern deep RL, plus the engineering tricks each family needs to be stable (DQN’s three patches for off-policy reuse, PPO’s clipped surrogate for on-policy stability, MPC’s receding horizon for model bias).
Lessons 11 and 12 zoom out to a different angle: control as inference. The entire RL problem can be reformulated as a probabilistic-inference problem in a graphical model, where actions are latent variables and “high reward” is the evidence we condition on. Different mathematical scaffolding, same underlying control problem; useful for connecting RL to variational inference, normalizing flows, and the Bayesian-RL literature.
Worked check (memory anchor)
Section titled “Worked check (memory anchor)”One CEM iteration on a 1D target problem (a ∈ [-1, 1], dynamics s' = s + a, reward -(s - 1)², s_0 = 0, H = 1). Initial q_0 = N(0, 1). Samples {-0.5, 0.0, 0.5, 0.8} score {-2.25, -1.00, -0.25, -0.04}. Top-K = 2 elites: {0.5, 0.8}. Refit:
μ_1 = (0.5 + 0.8) / 2 = 0.65σ_1 = √(((0.5 - 0.65)² + (0.8 - 0.65)²) / 2) = √0.0225 = 0.15q_1 = N(0.65, 0.15²). Distance to optimum a* = 1.0 dropped from 1.0 (initial mean) to 0.35 (after one iteration). Width shrank from σ = 1.0 to σ = 0.15. Two more iterations bring the distribution within 0.01 of the optimum.
Where this fits
Section titled “Where this fits”- Previous (Lesson 9): Learning the dynamics. The model itself.
- This lesson: Using the model. Random shooting, CEM, MPC, MuZero.
- Next (Lesson 11): Variational inference for RL. A different angle: reformulate the RL problem as probabilistic inference in a graphical model.
- Later (Lesson 13): RLHF. The L8 PPO algorithm applied to language model fine-tuning.
What you should remember
Section titled “What you should remember”Random shooting is the baseline, CEM is the workhorse, MPC is the wrapper, and MuZero is the contemporary breakthrough. The dispatch table from Lesson 3 is now fully toured: every algorithm family covered in Phase 2 maps to one of π, V, Q, A, or P (or a hybrid). The Phase 2 → Phase 3 boundary checkpoint after Lesson 12 will mark the end of the algorithmic core; Phase 3 then covers the production applications (RLHF, agentic systems, real-world robotics).