Planning with a learned model: brief

Capability gained

Trace one full iteration of the cross-entropy method (CEM) by hand on a small problem: sample, score, refit. Pick an MPC planning horizon given a model-bias profile. Sketch MuZero’s end-to-end learned-model MCTS loop and explain why training the model for planning quality (not observation reconstruction) is the load-bearing design choice.

Why this lesson exists

L9 ended with two ways to use a learned model: plan with it (this lesson) or imagine with it (Dyna, covered in L9). L10 covers the plan branch: black-box optimizers over action sequences (random shooting, CEM, MPPI, CMA-ES); the MPC receding-horizon wrapper; and MuZero as the contemporary state of the art for end-to-end learned-model planning.

This lesson closes the P-branch of the L3 dispatch table and completes the Phase 2 tour of all five algorithmic families (π/V/Q/A/P all covered or motivated). The pedagogical capstone for the Phase 2 algorithm-zoo coverage; L11/L12 then shift to the control-as-inference angle, and L13 onward covers production applications.

Source

Berkeley CS285 lecture on Model-Based Reinforcement Learning with Function Approximation), Sergey Levine, 2023. Primary papers: MuZero (Schrittwieser et al., 2020), AlphaZero (Silver et al., 2018), MPPI (Williams et al., 2017), classical MPC (Garcia, Prett, Morari, 1989), de Boer et al. (2005) CEM tutorial.

Phase advance

Phase 2 lesson 5 (phase_order: 5). Closes the P-branch and completes the dispatch-table tour started in L3. Sets up L11 (variational inference for RL) and L12 (control as inference) as a complementary reformulation of the same control problem. Phase 2 → Phase 3 boundary checkpoint after L12.

Lesson body (lesson.mdx)

Recap of L9’s plan-vs-imagine split; this lesson is the plan branch.
The planning problem made concrete: argmax over action sequences with truncated horizon + terminal value.
Random shooting (the baseline): sample-and-score, embarrassingly parallel, wasteful.
Cross-entropy method (CEM): iterative random shooting with Gaussian refitting. Full pseudocode + one fully-worked iteration (1D target problem, 4 samples, top-2 elites). Two main knobs (N and K/N elite fraction) discussed; variants (CMA-ES, MPPI) named as forward references.
Model Predictive Control (MPC): receding-horizon wrapper. The “model only needs to be accurate over H steps” insight. Horizon trade-off table.
MuZero: end-to-end learned model + MCTS. Three networks (representation, dynamics, prediction). The key innovation: hidden-state-space dynamics, trained for planning quality (policy + value + reward losses), not raw-observation reconstruction. Results: Go/chess/shogi/Atari without the rules.
MuZero variants: EfficientZero, Sampled MuZero, MuZero Unplugged.
Common pitfalls: planning past model horizon, random shooting at high dim, missing terminal value, fixed-horizon MPC, MuZero compute requirements, conflating MCTS with random rollouts.
“Why this matters” anchors AlphaGo / AlphaZero / MuZero / diffusion policies / LLM-as-planner. Family decision rubric.

Practice (practice.mdx)

Two exercises:

Trace one CEM iteration by hand. 1D state, a ∈ [-2, 2], dynamics s' = 0.8s + a, reward -(s-2)², s_0 = 0, H = 1. Five samples (-1.5, 0.5, 1.0, 1.8, 2.0). Reader scores each, picks top-2 elites, refits Gaussian. Answer: elites {1.8, 2.0}, μ_1 = 1.9, σ_1 = 0.1. Part D speculates on iteration 2 dynamics. Part E dual-path verifies: closed-form optimum a* = 2 ↔ CEM convergence to μ_J ≈ 2.
Pick an MPC horizon from a validation-error table. Table shows 2% / 12% / 35% / 75% / 110% error at horizons 1 / 5 / 10 / 20 / 30. Task is 50 steps long. Reader picks H = 5 or H = 10 (depending on risk tolerance, with rule of thumb “below 30% error”). Part B explains the receding-horizon discipline that makes long tasks tractable with short-horizon planning. Part C explores what happens with flat 5% per-step error (compounds to ~1050% over 50 steps).

5 flashcards: random shooting vs CEM; one-iteration CEM trace; MPC + compounding error; MuZero’s key innovation (hidden-state-space); family decision rubric.

Cheatsheet (cheatsheet.mdx)

One-page reference. The planning problem statement. Three-planners table (random shooting / CEM / CMA-ES / MPPI / MCTS). CEM iteration skeleton + worked example. MPC pseudocode + horizon trade-off table. MuZero three-networks table + key innovation. Family decision rubric. Common pitfalls.

Summary (summary.mdx)

5-minute distillation. One-paragraph framing. Five things to remember. Why-this-matters paragraph closing the Phase 2 dispatch-table tour. Worked-check memory anchor with the same CEM iteration numbers. Where this fits in the track arc (L11/L12 control-as-inference, L13 RLHF).

References (references.mdx)

Primary: MuZero (Schrittwieser 2020), AlphaZero (Silver 2018), AlphaGo (Silver 2016). MuZero variants: EfficientZero, Sampled MuZero, MuZero Unplugged. CEM tutorial: de Boer et al. (2005). CMA-ES tutorial: Hansen (2016). MPPI: Williams et al. (2017, 2018). Classical MPC: Garcia/Prett/Morari (1989), Mayne et al. (2000). Modern model-based RL: PETS (Chua 2018), DreamerV3 (Hafner 2023). Course source: CS285 L16. Sutton & Barto chapter 8; MCTS survey (Browne et al. 2012).

Editorial discipline

Stage 2 sweep: em/en dashes, U+2212 vs hyphen, caps emphasis, dead /topics/ links. Acronyms allowed in caps: RL, MDP, MPC, CEM, CMA, ES, MPPI, MCTS, PUCT, UCT, LQR, iLQR, DQN, PPO, SAC, MBPO, PETS, DDPG, MuZero, AlphaGo, AlphaZero, AlphaStar, EfficientZero, NeurIPS, ICML, ICLR, AAAI, ICRA, JAIR, MIT, OpenAI, MuJoCo, TPU, GPU.
No vendor naming triggers (paper authors, course instructors, algorithm names only). No security claims.
§6 status: standard pipeline, no triggers. Forward references (L11/L12 control-as-inference, L13 RLHF) properly deferred.

Word counts

Lesson 2735
Cheatsheet 615
Practice 1665
Summary 660
Brief 855
References 595

Total ≈ 7125 words across 6 artifacts. Math-heavy band; in line with L5-L9 calibration.

Notes for promotion

Component placeholders (�J0�, �J1�) live as MDX comments; Lead wires at promotion.
Practice imports real �J0� + �J1� components.
Numerics: the CEM iteration arithmetic is hand-checkable. The horizon-trade-off table values are illustrative (not from a specific paper); chosen to make the “rule of thumb 30%” cleanly point at H = 5 to 10.
Continues phase-boundary cadence; Phase 2 boundary check after L12.
Closes the L3 dispatch-table tour: at this point readers have seen instantiations of every entry (π/V/Q/A/P). The pedagogical capstone for Phase 2’s algorithm-zoo coverage. L11/L12 then start a different angle (control as inference); the dispatch-table-as-organizing-principle wraps up here.