Skip to content

Cheatsheet: Model-based RL, learning the dynamics

BranchEstimateLessons
πPolicy directlyL4 REINFORCE, L5 actor-critic, L8 PPO
VState valueL5 (critic in actor-critic)
QAction valueL6, L7 DQN
AAdvantageL5, L8 (advantage in PPO)
P**Dynamics model `P(s’s, a)`**
MetricModel-basedModel-free
Sample efficiency10× to 100× betterBaseline
Asymptotic performanceModel bias caps itEventually optimal
Per-step computeHigher (re-plan)Lower
Robustness to bad modelFragileN/A

When samples are expensive (real robots), model-based wins. When samples are cheap (Atari), model-free wins.

ClassFormFitWhen to use
Linear-Gaussian`P(s’s,a) = N(As + Ba + c, Σ)`Closed-form least-squares
Deterministic NNŝ' = f_θ(s, a)MSEDeterministic or low-noise dynamics
Probabilistic NNN(μ_θ(s,a), Σ_θ(s,a))NLLAleatoric (irreducible) noise
Ensemble of probabilistic NNsK independent networksNLL per networkPETS; captures both aleatoric and epistemic
[Â, B̂] = (X^T X)^{-1} X^T Y

where X is N × (dim_s + dim_a) with rows [s_i, a_i] and Y is N × dim_s with rows s_i'.

Verification: with A_true = 0.5, B_true = 1.0, five zero-noise samples from the lesson recover [Â, B̂] = [0.5, 1.0] to the digit. Closed-form math computes:

  • X^T X = [[6.25, -2.5], [-2.5, 3.25]], det = 14.0625
  • X^T Y = [0.625, 2.0]
  • (X^T X)^{-1} · X^T Y = [0.5, 1.0]

Compounding error (the dominant failure mode)

Section titled “Compounding error (the dominant failure mode)”

With A_true = 1.1, Â = 1.05 (5% bias), rolling from s_0 = 1, action a = 0:

tTrueModelError
01.00001.00000.0000
11.10001.05000.0500
21.21001.10250.1075
31.33101.15760.1734
41.46411.21550.2486
51.61051.27630.3342

5% one-step error → 21% five-step relative error → 37% ten-step relative error (computed as 1 - (Â/A_true)^t). Geometric blowup; model is useless for planning past horizon ~5 even with seemingly small bias.

MitigationWhat it doesAlgorithm
Short rollout horizon (1-5 steps)Cap compoundingMBPO
Ensemble disagreementDetect epistemic uncertaintyPETS
Probabilistic rolloutsPropagate uncertainty visiblyPETS, Dreamer
Re-plan frequentlyOnly need accuracy over H-step planning horizonMPC
For each step:
1. Act in real env; observe transition; add to D
2. Update model on D
3. For K imagined steps:
Sample (s, a) from D or roll out via model
Use model to predict s', r
Update π on imagined transition

K = 0 ⇒ model-free RL. K large ⇒ heavy reliance on model. MBPO uses small K with short rollouts.

Use model-basedUse model-free
Real robotsAtari games
Continuous-control physicsPixel-input control
Limited sample budgetUnlimited sample budget
Short planning horizons (MPC)One-shot policy
Smooth, learnable dynamicsHard-to-model dynamics (language, vision)
  • Trusting the fit when validation error is high
  • Rolling past where the model is reliable (compounding error)
  • Confusing aleatoric (per-network Σ) with epistemic (ensemble disagreement) uncertainty
  • Using a deterministic model for stochastic dynamics
  • Forgetting that data distribution must cover the (s, a) regions the policy will visit
  • Least-squares fit is closed-form for linear-Gaussian; the same loss with SGD trains neural net dynamics.
  • Compounding error is the binding constraint; “small 1-step error” is not enough.
  • Short rollouts + ensemble uncertainty + frequent re-planning are the standard fixes.
  • PETS, MBPO, Dreamer, MuZero are the canonical modern recipes.