Skip to content

Summary: Model-based RL, learning the dynamics

Model-based RL opens the P branch of the Lesson 3 dispatch table: instead of estimating a policy π, value V, action value Q, or advantage A directly from rollouts, learn a model P(s' | s, a) of the dynamics and then either plan with it (Model Predictive Control style) or use it to generate imagined rollouts for training a model-free policy (Dyna). The headline benefit is sample efficiency: 10× to 100× fewer real-world interactions for the same asymptotic performance on continuous-control benchmarks (Chua et al., PETS, 2018). This matters when real samples are expensive (robots, surgery, autonomous driving) and is irrelevant when they’re cheap (Atari, language models). The binding constraint is compounding error: a 5% one-step bias becomes a 21% relative error after five steps and 37% after ten under expansive dynamics (with practice extending the analysis to a steeper-bias case where the ten-step error reaches 59%). Standard mitigations are short rollout horizons (MBPO uses 1 to 5 steps), ensemble uncertainty (PETS), and frequent re-planning (MPC). For language and other domains where the “dynamics” are intractable to model, model-free PPO wins. For robotics, the modern stack (Dreamer, MuZero) increasingly leans on learned models.

  1. The P-branch is the model-based family: learn P(s' | s, a), then plan or imagine. Dyna integrates real and imagined experience.
  2. Sample efficiency is the win: 10× to 100× fewer real-world steps for the same performance. The constraint that decides the trade-off is sample cost.
  3. Least squares for linear-Gaussian dynamics is closed-form: [Â, B̂] = (X^T X)^{-1} X^T Y. Zero-noise data recovers true parameters exactly (the lesson worked through this on five samples with A_true = 0.5, B_true = 1.0).
  4. Compounding error is the dominant failure mode: 5% one-step bias → 21% five-step → 37% ten-step relative error in the lesson’s example (A_true=1.1, Â=1.05); the practice’s steeper-bias case (A_true=1.05, Â=1.10) reaches 59% at ten steps. The fix is how you use the model (short rollouts, re-planning, uncertainty rejection), not whether the model fits.
  5. Pick model-based when: samples are expensive, dynamics are smooth and learnable, planning horizons are short. Pick model-free when: samples are cheap, dynamics are hard to model, asymptotic performance matters most.

The L3 dispatch table named five things to estimate (π, V, Q, A, P); the L4-L8 lessons covered π and Q (and V, A as critic components). L9 opens the last branch. The chronology is intentional: model-based RL is the family with the highest theoretical ceiling (perfect knowledge of the dynamics lets you solve the planning problem optimally) but also the most fragile (model bias is a structural failure mode no amount of clever sampling fixes). That makes it the right family for narrow, sample-expensive domains and the wrong one for broad, dynamics-rich domains like language.

The contemporary stack: DreamerV3 (Hafner et al., 2023) and MuZero (Schrittwieser et al., 2020) are the canonical model-based deep RL algorithms; PETS (Chua et al., 2018) and MBPO (Janner et al., 2019) are the canonical robotics-focused versions. Each ships a different mitigation for compounding error: DreamerV3 uses recurrent latent-space models with multi-horizon training; MuZero learns the model end-to-end inside MCTS; PETS uses probabilistic ensembles with trajectory sampling; MBPO caps imagined rollouts at 1 to 5 steps.

Five samples from s' = 0.5·s + 1.0·a with zero noise: (0, 1, 1.0), (1, 0, 0.5), (0.5, -1, -0.75), (-1, 1, 0.5), (2, -0.5, 0.5). Compute X^T X = [[6.25, -2.5], [-2.5, 3.25]], det = 14.0625, X^T Y = [0.625, 2.0]. Solve: [Â, B̂] = (1/14.0625) · [[3.25, 2.5], [2.5, 6.25]] · [0.625, 2.0] = (1/14.0625) · [7.03125, 14.0625] = [0.5, 1.0]. The fit recovers the true parameters exactly.

For compounding error: true A = 1.05, model  = 1.10, s_0 = 1, rollout for 10 steps. Ratio ŝ_t / s_t = (1.10/1.05)^t = 1.0476^t. At t = 10, ratio ≈ 1.593 → 59% relative error. Small 1-step bias, large 10-step bias.

  • Previous (Lesson 8): PPO. Last of the model-free family for this phase.
  • This lesson: Model-based RL, learning the model. Opens the P-branch.
  • Next (Lesson 10): Planning with the model. MPC, the cross-entropy method for action-sequence optimization, MuZero-style learned-model MCTS.
  • Later (Lesson 11 onward): Phase 2 continues with variational inference (L11) and control as inference (L12). Phase 3 covers RLHF (L13).

Model-based RL trades sample efficiency for sensitivity to model bias. Pick it when samples are precious and the dynamics are learnable; pick model-free when samples are cheap and you want asymptotic performance. The compounding-error analysis is structural and dictates the engineering: short rollouts, frequent re-planning, ensemble uncertainty. The L10 lesson covers how to use the model you learned in this lesson.