Skip to content

Cheatsheet: Planning with a learned model

Given a learned model P̂(s' | s, a) and R̂(s, a), at state s_t find:

(a_0, ..., a_{H-1})* = argmax E [ Σ_{t=0}^{H-1} γ^t · R̂(s_t, a_t) + γ^H · V̂(s_H) ]

The terminal V̂(s_H) (often from a learned critic) patches the truncation past horizon H. Set H to where the model’s validation error stays bounded.

Three planners, from simple to sophisticated

Section titled “Three planners, from simple to sophisticated”
PlannerFamilyBest forCompute per step
Random shootingSample-and-scoreLow-dim actions, baselineEmbarrassingly parallel; N rollouts
CEM (cross-entropy method)Iterative Gaussian fitContinuous control, defaultJ·N rollouts (typical J = 5-10, N = 100-1000)
CMA-ESCEM + rank-µ cov updatesHigher dimensionsSame order as CEM, better convergence
MPPISoftmax-weighted CEMSmooth dynamics, roboticsSame order as CEM
MCTS (with learned model)Tree searchDiscrete actions, gamesTree size × simulation cost
1. Sample N action sequences uniformly
2. For each, roll the model forward H steps; compute score
3. Pick the best; execute its first action

Embarrassingly parallel. Wasteful but bulletproof. Use when N is large and dim_a · H < 10.

Initialize q_0 = N(μ_0, Σ_0) over action sequences
For j = 1, ..., J:
1. Sample N candidates from q_{j-1}
2. Score by model rollouts
3. Pick top K elites (typical K = 0.1 · N)
4. Refit: μ_j = mean(elites), Σ_j = cov(elites)
Return μ_J as the planned action sequence

Typical: N = 200, K = 20, J = 5. Adjust per-problem.

1D, a ∈ [-1, 1], dynamics s' = s + a, reward -(s-1)², s_0 = 0, H = 1, q_0 = N(0, 1). Samples {-0.5, 0.0, 0.5, 0.8}.

as_1reward
-0.5-0.5-2.25
0.00.0-1.00
0.50.5-0.25
0.80.8-0.04

Top 2 elites: {0.5, 0.8}μ_1 = 0.65, σ_1 = 0.15. Iteration 2 samples from N(0.65, 0.15²), much tighter. Converges to a* = 1 in ~5 iterations.

For each timestep t:
1. Plan an H-step sequence (a_t, ..., a_{t+H-1}) with CEM
2. Execute only a_t
3. Observe s_{t+1}; discard the rest; re-plan from s_{t+1}

“Receding horizon.” The model only needs accuracy over the H-step horizon, not the full episode.

Too smallToo large
Greedy; misses multi-step opportunitiesCompounds model error
FastSlow
Robust to model biasSensitive to model bias

Typical: H = 5 to 30 for continuous control. Pin it to where model validation error stays bounded.

Three jointly trained networks:

NetworkJob
Representation h_θobservation → hidden state
Dynamics g_θ(hidden state, action) → (next hidden state, reward)
Prediction f_θhidden state → (policy, value)

Key insight: the model lives in hidden-state space, never predicts raw observations. Trained for planning quality (policy + value + reward losses), not for one-step observation accuracy.

ResultWhere
Master Go, chess, shogi without knowing rules(Schrittwieser 2020)
Match AlphaZero strength on board games(Schrittwieser 2020)
Match or exceed DQN on Atari at time of publication(Schrittwieser 2020)

Variants: EfficientZero (Atari median human from 2 hours of real-time game experience, the Atari 100k data budget), Sampled MuZero (continuous actions), MuZero Unplugged (offline RL).

ProblemUse
Continuous control + learned model + short horizonMPC + CEM (PETS-style)
Discrete board / Atari games + lots of training computeMuZero, AlphaZero
Continuous control + Dyna-style imagined rolloutsDreamer, MBPO (L9)
Language model agentsModel-free PPO (L8); planning at prompt level
Most everything elseModel-free PPO or actor-critic
  • Planning past the model’s reliable horizon
  • Using random shooting where CEM would obviously win (dim_a · H > 10)
  • Forgetting the terminal V̂(s_H) value (truncating loses the tail)
  • Treating MPC horizon as fixed instead of receding
  • Trying to train MuZero from scratch on a single GPU (use EfficientZero for accessibility)
  • Conflating MCTS with random rollouts (MCTS uses learned priors via the PUCT formula)
  • Random shooting is the baseline; CEM is the workhorse; MuZero is the end-to-end learned-model MCTS.
  • MPC wraps any planner in a receding-horizon loop, limits how far the model needs to be reliable.
  • The CEM iteration is sample → score → top-K elites → refit Gaussian; converges in 5 to 10 iterations.
  • MuZero’s hidden-state-only model is the key innovation: never predict raw observations, only what supports good planning.

This closes the P-branch and the Phase 2 dispatch-table tour.