Brief: Model-based RL, learning the dynamics
Capability gained
Section titled “Capability gained”Fit a linear-Gaussian dynamics model P(s' | s, a) = N(As + Ba + c, Σ) from collected transitions by closed-form least squares. Quantify the sample-efficiency gain of model-based over model-free on continuous-control benchmarks (10× to 100× per PETS / MBPO). Trace how a 5% one-step model error compounds into a 59% error over 10 steps under expansive dynamics. Pick the right model class and the right algorithmic family for a given problem.
Why this lesson exists
Section titled “Why this lesson exists”The L3 dispatch table named five algorithmic families: π, V, Q, A, P. The lessons so far have covered π (L4 REINFORCE, L5 actor-critic, L8 PPO) and Q (L6 value-based RL, L7 DQN); V and A appeared as critic components inside actor-critic and PPO. L9 opens the P-branch: learn the dynamics directly. This is the family with the highest theoretical ceiling (perfect model = optimal planning) and the most structural fragility (model bias compounds geometrically over rollouts).
L9 is the first half of the P-branch coverage. L9 covers learning the dynamics; L10 covers using the learned model (MPC, cross-entropy method for action-sequence optimization, MuZero-style learned-model MCTS). The pairing mirrors the L6/L7 (Bellman foundations + DQN engineering) and L8 (PPO as alternative resolution) pattern: foundations first, then a tour of the practical algorithms that solve the engineering problems.
Source
Section titled “Source”Berkeley CS285 lecture on Model-Based Reinforcement Learning), Sergey Levine, 2023. Primary papers: PETS (Chua et al., 2018), MBPO (Janner et al., 2019), Dyna (Sutton, 1991), DreamerV3 (Hafner et al., 2023), MuZero (Schrittwieser et al., 2020). Sutton & Barto chapter 8 for the canonical treatment.
Phase advance
Section titled “Phase advance”Phase 2 lesson 4 (phase_order: 4). Completes the dispatch-table tour (π/V/Q/A/P all covered or motivated) and opens the P-branch with the foundational learning step. L10 will complete the P-branch with planning. The dispatch-table-as-organizing-principle reaches its natural conclusion here.
Lesson body (lesson.mdx)
Section titled “Lesson body (lesson.mdx)”- Recap of the dispatch table; lessons so far covered π, Q, V, A; this lesson opens P.
- Why model-based RL: sample efficiency. The 10× to 100× headline number from PETS, explained as a transfer of sample cost from environment to model.
- Two ways to use a model: plan (MPC) or imagine (Dyna).
- Model classes: linear-Gaussian (closed-form), deterministic NN (MSE), probabilistic NN (NLL), ensemble of probabilistic NNs (PETS).
- Worked least-squares fit for linear-Gaussian dynamics: 5 zero-noise samples with
A_true = 0.5, B_true = 1.0. ComputeX^T X = [[6.25, -2.5], [-2.5, 3.25]](det 14.0625) andX^T Y = [0.625, 2.0]. Solve to recover[Â, B̂] = [0.5, 1.0]exactly. - Compounding error: small 1-step bias → exponential N-step bias. Worked example:
A_true = 1.1, = 1.05. Relative error 21% at 5 steps; 37% at 10 (closed form1 - (1.05/1.1)^t). Mitigation overview: short rollout horizons, ensemble uncertainty, frequent re-planning. - Dyna architecture pseudocode with the
Kimagined-step parameter. - Decision rubric (model-based when samples expensive and dynamics learnable; model-free when samples cheap and asymptotic performance matters).
- Common pitfalls (validation error, data distribution coverage, rollout horizon, aleatoric vs epistemic, deterministic vs probabilistic).
- “Why this matters when you use AI” anchors World Models (Ha & Schmidhuber, 2018), DreamerV3 (Hafner, 2023), MuZero (Schrittwieser, 2020), and notes that LLM-based agents almost never use model-based RL (the dynamics are too hard to model).
Practice (practice.mdx)
Section titled “Practice (practice.mdx)”Two exercises:
-
Linear-Gaussian fit by hand, dual-path validation. 4-sample dataset where the true parameters are
A_true = 0.4, B_true = 0.8. Part A: predict by inspection (s=1, a=0 → s'=0.4, etc.). Part B: computeX^T X = [[3, -1], [-1, 6]](det 17) andX^T Y = [0.4, 4.4]. Part C: solve to get[Â, B̂] = [0.4, 0.8]exactly. Part D: add small noise to the targets, recompute, observe ~3% bias onÂand confirm fit is unbiased in expectation but noisy on a single dataset. -
Compounding error trace, 10 steps. True
A = 1.05, fit = 1.10. Starts_0 = 1, action 0 every step. Reader computes both true and model rollouts and the resulting error growth. Att = 1: 5% error. Att = 5: 26%. Att = 10: 59%. Growth rate verified two ways: empirically (from the table) and analytically (ŝ_t / s_t = (1.10/1.05)^t). Part D discusses the three mitigations (cap horizon, MPC, ensemble) and explains compounding error is a structural property, not a fit bug.
5 flashcards: why model-based wins on sample efficiency; the least-squares-recovers-true-params identity; how 5% one-step bias becomes 59% ten-step bias; aleatoric vs epistemic uncertainty; decision rubric for model-based vs model-free.
Cheatsheet (cheatsheet.mdx)
Section titled “Cheatsheet (cheatsheet.mdx)”One-page reference. P-branch placement in the dispatch table; the why-model-based sample-efficiency table; model classes with fit method and use cases; the lesson’s worked least-squares numerics reproduced as a memory anchor; the compounding-error table reproduced as a memory anchor; mitigations table mapping each fix to its canonical algorithm; Dyna pseudocode; decision rubric; common pitfalls.
Summary (summary.mdx)
Section titled “Summary (summary.mdx)”5-minute distillation. One-paragraph framing of model-based RL as the P-branch with sample-efficiency win and compounding-error binding constraint. Five things to remember. Why-this-matters paragraph anchoring the contemporary stack (DreamerV3, MuZero, PETS, MBPO). Worked-check memory anchor with both the least-squares recovery and the compounding-error numerics. Where this fits (L10 is the planning half).
References (references.mdx)
Section titled “References (references.mdx)”Foundational: Sutton (1991) Dyna; Sutton & Barto chapter 8. Linear-Gaussian / iLQR: Todorov & Li (2005); Levine & Koltun (2013) Guided Policy Search. PETS: Chua et al. (2018). MBPO: Janner et al. (2019). World models / Dreamer: Ha & Schmidhuber (2018); Hafner DreamerV1/V2/V3. MuZero: Schrittwieser et al. (2020). Uncertainty: Lakshminarayanan et al. (2017) deep ensembles; Kendall & Gal (2017) aleatoric/epistemic split. Linear-algebra references: Boyd & Vandenberghe (free online). Course source: CS285 L15.
Editorial discipline
Section titled “Editorial discipline”- Stage 2 sweep: em/en dashes, U+2212 vs hyphen, caps emphasis, dead
/topics/links. Acronyms allowed in caps: RL, MDP, MSE, NLL, MPC, MCTS, LQR, iLQR, LQG, MBPO, PETS, SAC, DDPG, PPO, DQN, SGD, GAE, KL, JAIR, NeurIPS, ICLR, ICML, AAAI, MuJoCo, OpenAI, MIT, MuZero, DreamerV3, AlphaZero. - No vendor naming triggers; paper authors + course instructors + algorithm names only. No security claims.
- §6 status: standard pipeline, no triggers. Forward references (L10 planning, L13 RLHF) properly deferred.
Word counts
Section titled “Word counts”- Lesson 2680
- Cheatsheet 605
- Practice 1855
- Summary 692
- Brief 940
- References 558
Total ≈ 7330 words across 6 artifacts. Math-heavy band; in line with L5-L8 calibration.
Notes for promotion
Section titled “Notes for promotion”- Component placeholders (
�J0�,�J1�) live as MDX comments; Lead wires at promotion. - Practice imports real
�J0�+�J1�components. - Numerics: the least-squares fit recovering
[0.5, 1.0]is exact arithmetic (no rounding). The compounding-error table is computed to 4 decimals from1.05^tand1.10^t. Both should pass independent verification. - Continues phase-boundary cadence; Phase 2 boundary check after L12.
- The “dispatch table reaches its natural conclusion” framing is the load-bearing pedagogical move: by L9, the reader has seen every entry in the dispatch table from L3 instantiated as an algorithm family. L10 completes the P-branch with planning. L11/L12 (variational inference, control as inference) shift to a different angle on the same RL problem.