Skip to content

Brief: Planning with a learned model

Trace one full iteration of the cross-entropy method (CEM) by hand on a small problem: sample, score, refit. Pick an MPC planning horizon given a model-bias profile. Sketch MuZero’s end-to-end learned-model MCTS loop and explain why training the model for planning quality (not observation reconstruction) is the load-bearing design choice.

L9 ended with two ways to use a learned model: plan with it (this lesson) or imagine with it (Dyna, covered in L9). L10 covers the plan branch: black-box optimizers over action sequences (random shooting, CEM, MPPI, CMA-ES); the MPC receding-horizon wrapper; and MuZero as the contemporary state of the art for end-to-end learned-model planning.

This lesson closes the P-branch of the L3 dispatch table and completes the Phase 2 tour of all five algorithmic families (π/V/Q/A/P all covered or motivated). The pedagogical capstone for the Phase 2 algorithm-zoo coverage; L11/L12 then shift to the control-as-inference angle, and L13 onward covers production applications.

Berkeley CS285 lecture on Model-Based Reinforcement Learning with Function Approximation), Sergey Levine, 2023. Primary papers: MuZero (Schrittwieser et al., 2020), AlphaZero (Silver et al., 2018), MPPI (Williams et al., 2017), classical MPC (Garcia, Prett, Morari, 1989), de Boer et al. (2005) CEM tutorial.

Phase 2 lesson 5 (phase_order: 5). Closes the P-branch and completes the dispatch-table tour started in L3. Sets up L11 (variational inference for RL) and L12 (control as inference) as a complementary reformulation of the same control problem. Phase 2 → Phase 3 boundary checkpoint after L12.

  • Recap of L9’s plan-vs-imagine split; this lesson is the plan branch.
  • The planning problem made concrete: argmax over action sequences with truncated horizon + terminal value.
  • Random shooting (the baseline): sample-and-score, embarrassingly parallel, wasteful.
  • Cross-entropy method (CEM): iterative random shooting with Gaussian refitting. Full pseudocode + one fully-worked iteration (1D target problem, 4 samples, top-2 elites). Two main knobs (N and K/N elite fraction) discussed; variants (CMA-ES, MPPI) named as forward references.
  • Model Predictive Control (MPC): receding-horizon wrapper. The “model only needs to be accurate over H steps” insight. Horizon trade-off table.
  • MuZero: end-to-end learned model + MCTS. Three networks (representation, dynamics, prediction). The key innovation: hidden-state-space dynamics, trained for planning quality (policy + value + reward losses), not raw-observation reconstruction. Results: Go/chess/shogi/Atari without the rules.
  • MuZero variants: EfficientZero, Sampled MuZero, MuZero Unplugged.
  • Common pitfalls: planning past model horizon, random shooting at high dim, missing terminal value, fixed-horizon MPC, MuZero compute requirements, conflating MCTS with random rollouts.
  • “Why this matters” anchors AlphaGo / AlphaZero / MuZero / diffusion policies / LLM-as-planner. Family decision rubric.

Two exercises:

  1. Trace one CEM iteration by hand. 1D state, a ∈ [-2, 2], dynamics s' = 0.8s + a, reward -(s-2)², s_0 = 0, H = 1. Five samples (-1.5, 0.5, 1.0, 1.8, 2.0). Reader scores each, picks top-2 elites, refits Gaussian. Answer: elites {1.8, 2.0}, μ_1 = 1.9, σ_1 = 0.1. Part D speculates on iteration 2 dynamics. Part E dual-path verifies: closed-form optimum a* = 2 ↔ CEM convergence to μ_J ≈ 2.

  2. Pick an MPC horizon from a validation-error table. Table shows 2% / 12% / 35% / 75% / 110% error at horizons 1 / 5 / 10 / 20 / 30. Task is 50 steps long. Reader picks H = 5 or H = 10 (depending on risk tolerance, with rule of thumb “below 30% error”). Part B explains the receding-horizon discipline that makes long tasks tractable with short-horizon planning. Part C explores what happens with flat 5% per-step error (compounds to ~1050% over 50 steps).

5 flashcards: random shooting vs CEM; one-iteration CEM trace; MPC + compounding error; MuZero’s key innovation (hidden-state-space); family decision rubric.

One-page reference. The planning problem statement. Three-planners table (random shooting / CEM / CMA-ES / MPPI / MCTS). CEM iteration skeleton + worked example. MPC pseudocode + horizon trade-off table. MuZero three-networks table + key innovation. Family decision rubric. Common pitfalls.

5-minute distillation. One-paragraph framing. Five things to remember. Why-this-matters paragraph closing the Phase 2 dispatch-table tour. Worked-check memory anchor with the same CEM iteration numbers. Where this fits in the track arc (L11/L12 control-as-inference, L13 RLHF).

Primary: MuZero (Schrittwieser 2020), AlphaZero (Silver 2018), AlphaGo (Silver 2016). MuZero variants: EfficientZero, Sampled MuZero, MuZero Unplugged. CEM tutorial: de Boer et al. (2005). CMA-ES tutorial: Hansen (2016). MPPI: Williams et al. (2017, 2018). Classical MPC: Garcia/Prett/Morari (1989), Mayne et al. (2000). Modern model-based RL: PETS (Chua 2018), DreamerV3 (Hafner 2023). Course source: CS285 L16. Sutton & Barto chapter 8; MCTS survey (Browne et al. 2012).

  • Stage 2 sweep: em/en dashes, U+2212 vs hyphen, caps emphasis, dead /topics/ links. Acronyms allowed in caps: RL, MDP, MPC, CEM, CMA, ES, MPPI, MCTS, PUCT, UCT, LQR, iLQR, DQN, PPO, SAC, MBPO, PETS, DDPG, MuZero, AlphaGo, AlphaZero, AlphaStar, EfficientZero, NeurIPS, ICML, ICLR, AAAI, ICRA, JAIR, MIT, OpenAI, MuJoCo, TPU, GPU.
  • No vendor naming triggers (paper authors, course instructors, algorithm names only). No security claims.
  • §6 status: standard pipeline, no triggers. Forward references (L11/L12 control-as-inference, L13 RLHF) properly deferred.
  • Lesson 2735
  • Cheatsheet 615
  • Practice 1665
  • Summary 660
  • Brief 855
  • References 595

Total ≈ 7125 words across 6 artifacts. Math-heavy band; in line with L5-L9 calibration.

  • Component placeholders (�J0�, �J1�) live as MDX comments; Lead wires at promotion.
  • Practice imports real �J0� + �J1� components.
  • Numerics: the CEM iteration arithmetic is hand-checkable. The horizon-trade-off table values are illustrative (not from a specific paper); chosen to make the “rule of thumb 30%” cleanly point at H = 5 to 10.
  • Continues phase-boundary cadence; Phase 2 boundary check after L12.
  • Closes the L3 dispatch-table tour: at this point readers have seen instantiations of every entry (π/V/Q/A/P). The pedagogical capstone for Phase 2’s algorithm-zoo coverage. L11/L12 then start a different angle (control as inference); the dispatch-table-as-organizing-principle wraps up here.