Cheatsheet: Planning with a learned model
The planning problem
Section titled “The planning problem”Given a learned model P̂(s' | s, a) and R̂(s, a), at state s_t find:
(a_0, ..., a_{H-1})* = argmax E [ Σ_{t=0}^{H-1} γ^t · R̂(s_t, a_t) + γ^H · V̂(s_H) ]The terminal V̂(s_H) (often from a learned critic) patches the truncation past horizon H. Set H to where the model’s validation error stays bounded.
Three planners, from simple to sophisticated
Section titled “Three planners, from simple to sophisticated”| Planner | Family | Best for | Compute per step |
|---|---|---|---|
| Random shooting | Sample-and-score | Low-dim actions, baseline | Embarrassingly parallel; N rollouts |
| CEM (cross-entropy method) | Iterative Gaussian fit | Continuous control, default | J·N rollouts (typical J = 5-10, N = 100-1000) |
| CMA-ES | CEM + rank-µ cov updates | Higher dimensions | Same order as CEM, better convergence |
| MPPI | Softmax-weighted CEM | Smooth dynamics, robotics | Same order as CEM |
| MCTS (with learned model) | Tree search | Discrete actions, games | Tree size × simulation cost |
Random shooting (the baseline)
Section titled “Random shooting (the baseline)”1. Sample N action sequences uniformly2. For each, roll the model forward H steps; compute score3. Pick the best; execute its first actionEmbarrassingly parallel. Wasteful but bulletproof. Use when N is large and dim_a · H < 10.
CEM iteration (the workhorse)
Section titled “CEM iteration (the workhorse)”Initialize q_0 = N(μ_0, Σ_0) over action sequencesFor j = 1, ..., J: 1. Sample N candidates from q_{j-1} 2. Score by model rollouts 3. Pick top K elites (typical K = 0.1 · N) 4. Refit: μ_j = mean(elites), Σ_j = cov(elites)Return μ_J as the planned action sequenceTypical: N = 200, K = 20, J = 5. Adjust per-problem.
Worked example, one iteration
Section titled “Worked example, one iteration”1D, a ∈ [-1, 1], dynamics s' = s + a, reward -(s-1)², s_0 = 0, H = 1, q_0 = N(0, 1). Samples {-0.5, 0.0, 0.5, 0.8}.
a | s_1 | reward |
|---|---|---|
| -0.5 | -0.5 | -2.25 |
| 0.0 | 0.0 | -1.00 |
| 0.5 | 0.5 | -0.25 |
| 0.8 | 0.8 | -0.04 |
Top 2 elites: {0.5, 0.8} → μ_1 = 0.65, σ_1 = 0.15. Iteration 2 samples from N(0.65, 0.15²), much tighter. Converges to a* = 1 in ~5 iterations.
Model Predictive Control (MPC)
Section titled “Model Predictive Control (MPC)”For each timestep t: 1. Plan an H-step sequence (a_t, ..., a_{t+H-1}) with CEM 2. Execute only a_t 3. Observe s_{t+1}; discard the rest; re-plan from s_{t+1}“Receding horizon.” The model only needs accuracy over the H-step horizon, not the full episode.
Choosing H
Section titled “Choosing H”| Too small | Too large |
|---|---|
| Greedy; misses multi-step opportunities | Compounds model error |
| Fast | Slow |
| Robust to model bias | Sensitive to model bias |
Typical: H = 5 to 30 for continuous control. Pin it to where model validation error stays bounded.
MuZero (end-to-end learned model + MCTS)
Section titled “MuZero (end-to-end learned model + MCTS)”Three jointly trained networks:
| Network | Job |
|---|---|
Representation h_θ | observation → hidden state |
Dynamics g_θ | (hidden state, action) → (next hidden state, reward) |
Prediction f_θ | hidden state → (policy, value) |
Key insight: the model lives in hidden-state space, never predicts raw observations. Trained for planning quality (policy + value + reward losses), not for one-step observation accuracy.
| Result | Where |
|---|---|
| Master Go, chess, shogi without knowing rules | (Schrittwieser 2020) |
| Match AlphaZero strength on board games | (Schrittwieser 2020) |
| Match or exceed DQN on Atari at time of publication | (Schrittwieser 2020) |
Variants: EfficientZero (Atari median human from 2 hours of real-time game experience, the Atari 100k data budget), Sampled MuZero (continuous actions), MuZero Unplugged (offline RL).
Family decision rubric
Section titled “Family decision rubric”| Problem | Use |
|---|---|
| Continuous control + learned model + short horizon | MPC + CEM (PETS-style) |
| Discrete board / Atari games + lots of training compute | MuZero, AlphaZero |
| Continuous control + Dyna-style imagined rollouts | Dreamer, MBPO (L9) |
| Language model agents | Model-free PPO (L8); planning at prompt level |
| Most everything else | Model-free PPO or actor-critic |
Common pitfalls
Section titled “Common pitfalls”- Planning past the model’s reliable horizon
- Using random shooting where CEM would obviously win (
dim_a · H > 10) - Forgetting the terminal
V̂(s_H)value (truncating loses the tail) - Treating MPC horizon as fixed instead of receding
- Trying to train MuZero from scratch on a single GPU (use EfficientZero for accessibility)
- Conflating MCTS with random rollouts (MCTS uses learned priors via the PUCT formula)
What you should remember
Section titled “What you should remember”- Random shooting is the baseline; CEM is the workhorse; MuZero is the end-to-end learned-model MCTS.
- MPC wraps any planner in a receding-horizon loop, limits how far the model needs to be reliable.
- The CEM iteration is sample → score → top-K elites → refit Gaussian; converges in 5 to 10 iterations.
- MuZero’s hidden-state-only model is the key innovation: never predict raw observations, only what supports good planning.
This closes the P-branch and the Phase 2 dispatch-table tour.