Brief: Planning with a learned model
Capability gained
Section titled “Capability gained”Trace one full iteration of the cross-entropy method (CEM) by hand on a small problem: sample, score, refit. Pick an MPC planning horizon given a model-bias profile. Sketch MuZero’s end-to-end learned-model MCTS loop and explain why training the model for planning quality (not observation reconstruction) is the load-bearing design choice.
Why this lesson exists
Section titled “Why this lesson exists”L9 ended with two ways to use a learned model: plan with it (this lesson) or imagine with it (Dyna, covered in L9). L10 covers the plan branch: black-box optimizers over action sequences (random shooting, CEM, MPPI, CMA-ES); the MPC receding-horizon wrapper; and MuZero as the contemporary state of the art for end-to-end learned-model planning.
This lesson closes the P-branch of the L3 dispatch table and completes the Phase 2 tour of all five algorithmic families (π/V/Q/A/P all covered or motivated). The pedagogical capstone for the Phase 2 algorithm-zoo coverage; L11/L12 then shift to the control-as-inference angle, and L13 onward covers production applications.
Source
Section titled “Source”Berkeley CS285 lecture on Model-Based Reinforcement Learning with Function Approximation), Sergey Levine, 2023. Primary papers: MuZero (Schrittwieser et al., 2020), AlphaZero (Silver et al., 2018), MPPI (Williams et al., 2017), classical MPC (Garcia, Prett, Morari, 1989), de Boer et al. (2005) CEM tutorial.
Phase advance
Section titled “Phase advance”Phase 2 lesson 5 (phase_order: 5). Closes the P-branch and completes the dispatch-table tour started in L3. Sets up L11 (variational inference for RL) and L12 (control as inference) as a complementary reformulation of the same control problem. Phase 2 → Phase 3 boundary checkpoint after L12.
Lesson body (lesson.mdx)
Section titled “Lesson body (lesson.mdx)”- Recap of L9’s plan-vs-imagine split; this lesson is the plan branch.
- The planning problem made concrete: argmax over action sequences with truncated horizon + terminal value.
- Random shooting (the baseline): sample-and-score, embarrassingly parallel, wasteful.
- Cross-entropy method (CEM): iterative random shooting with Gaussian refitting. Full pseudocode + one fully-worked iteration (1D target problem, 4 samples, top-2 elites). Two main knobs (N and K/N elite fraction) discussed; variants (CMA-ES, MPPI) named as forward references.
- Model Predictive Control (MPC): receding-horizon wrapper. The “model only needs to be accurate over H steps” insight. Horizon trade-off table.
- MuZero: end-to-end learned model + MCTS. Three networks (representation, dynamics, prediction). The key innovation: hidden-state-space dynamics, trained for planning quality (policy + value + reward losses), not raw-observation reconstruction. Results: Go/chess/shogi/Atari without the rules.
- MuZero variants: EfficientZero, Sampled MuZero, MuZero Unplugged.
- Common pitfalls: planning past model horizon, random shooting at high dim, missing terminal value, fixed-horizon MPC, MuZero compute requirements, conflating MCTS with random rollouts.
- “Why this matters” anchors AlphaGo / AlphaZero / MuZero / diffusion policies / LLM-as-planner. Family decision rubric.
Practice (practice.mdx)
Section titled “Practice (practice.mdx)”Two exercises:
-
Trace one CEM iteration by hand. 1D state,
a ∈ [-2, 2], dynamicss' = 0.8s + a, reward-(s-2)²,s_0 = 0,H = 1. Five samples (-1.5, 0.5, 1.0, 1.8, 2.0). Reader scores each, picks top-2 elites, refits Gaussian. Answer: elites{1.8, 2.0},μ_1 = 1.9, σ_1 = 0.1. Part D speculates on iteration 2 dynamics. Part E dual-path verifies: closed-form optimuma* = 2↔ CEM convergence toμ_J ≈ 2. -
Pick an MPC horizon from a validation-error table. Table shows 2% / 12% / 35% / 75% / 110% error at horizons 1 / 5 / 10 / 20 / 30. Task is 50 steps long. Reader picks
H = 5orH = 10(depending on risk tolerance, with rule of thumb “below 30% error”). Part B explains the receding-horizon discipline that makes long tasks tractable with short-horizon planning. Part C explores what happens with flat 5% per-step error (compounds to ~1050% over 50 steps).
5 flashcards: random shooting vs CEM; one-iteration CEM trace; MPC + compounding error; MuZero’s key innovation (hidden-state-space); family decision rubric.
Cheatsheet (cheatsheet.mdx)
Section titled “Cheatsheet (cheatsheet.mdx)”One-page reference. The planning problem statement. Three-planners table (random shooting / CEM / CMA-ES / MPPI / MCTS). CEM iteration skeleton + worked example. MPC pseudocode + horizon trade-off table. MuZero three-networks table + key innovation. Family decision rubric. Common pitfalls.
Summary (summary.mdx)
Section titled “Summary (summary.mdx)”5-minute distillation. One-paragraph framing. Five things to remember. Why-this-matters paragraph closing the Phase 2 dispatch-table tour. Worked-check memory anchor with the same CEM iteration numbers. Where this fits in the track arc (L11/L12 control-as-inference, L13 RLHF).
References (references.mdx)
Section titled “References (references.mdx)”Primary: MuZero (Schrittwieser 2020), AlphaZero (Silver 2018), AlphaGo (Silver 2016). MuZero variants: EfficientZero, Sampled MuZero, MuZero Unplugged. CEM tutorial: de Boer et al. (2005). CMA-ES tutorial: Hansen (2016). MPPI: Williams et al. (2017, 2018). Classical MPC: Garcia/Prett/Morari (1989), Mayne et al. (2000). Modern model-based RL: PETS (Chua 2018), DreamerV3 (Hafner 2023). Course source: CS285 L16. Sutton & Barto chapter 8; MCTS survey (Browne et al. 2012).
Editorial discipline
Section titled “Editorial discipline”- Stage 2 sweep: em/en dashes, U+2212 vs hyphen, caps emphasis, dead
/topics/links. Acronyms allowed in caps: RL, MDP, MPC, CEM, CMA, ES, MPPI, MCTS, PUCT, UCT, LQR, iLQR, DQN, PPO, SAC, MBPO, PETS, DDPG, MuZero, AlphaGo, AlphaZero, AlphaStar, EfficientZero, NeurIPS, ICML, ICLR, AAAI, ICRA, JAIR, MIT, OpenAI, MuJoCo, TPU, GPU. - No vendor naming triggers (paper authors, course instructors, algorithm names only). No security claims.
- §6 status: standard pipeline, no triggers. Forward references (L11/L12 control-as-inference, L13 RLHF) properly deferred.
Word counts
Section titled “Word counts”- Lesson 2735
- Cheatsheet 615
- Practice 1665
- Summary 660
- Brief 855
- References 595
Total ≈ 7125 words across 6 artifacts. Math-heavy band; in line with L5-L9 calibration.
Notes for promotion
Section titled “Notes for promotion”- Component placeholders (
�J0�,�J1�) live as MDX comments; Lead wires at promotion. - Practice imports real
�J0�+�J1�components. - Numerics: the CEM iteration arithmetic is hand-checkable. The horizon-trade-off table values are illustrative (not from a specific paper); chosen to make the “rule of thumb 30%” cleanly point at
H = 5to10. - Continues phase-boundary cadence; Phase 2 boundary check after L12.
- Closes the L3 dispatch-table tour: at this point readers have seen instantiations of every entry (π/V/Q/A/P). The pedagogical capstone for Phase 2’s algorithm-zoo coverage. L11/L12 then start a different angle (control as inference); the dispatch-table-as-organizing-principle wraps up here.