Cheatsheet: Model-based RL, learning the dynamics
The P-branch of the dispatch table
Section titled “The P-branch of the dispatch table”| Branch | Estimate | Lessons |
|---|---|---|
| π | Policy directly | L4 REINFORCE, L5 actor-critic, L8 PPO |
| V | State value | L5 (critic in actor-critic) |
| Q | Action value | L6, L7 DQN |
| A | Advantage | L5, L8 (advantage in PPO) |
| P | **Dynamics model `P(s’ | s, a)`** |
Why model-based?
Section titled “Why model-based?”| Metric | Model-based | Model-free |
|---|---|---|
| Sample efficiency | 10× to 100× better | Baseline |
| Asymptotic performance | Model bias caps it | Eventually optimal |
| Per-step compute | Higher (re-plan) | Lower |
| Robustness to bad model | Fragile | N/A |
When samples are expensive (real robots), model-based wins. When samples are cheap (Atari), model-free wins.
Model classes
Section titled “Model classes”| Class | Form | Fit | When to use |
|---|---|---|---|
| Linear-Gaussian | `P(s’ | s,a) = N(As + Ba + c, Σ)` | Closed-form least-squares |
| Deterministic NN | ŝ' = f_θ(s, a) | MSE | Deterministic or low-noise dynamics |
| Probabilistic NN | N(μ_θ(s,a), Σ_θ(s,a)) | NLL | Aleatoric (irreducible) noise |
| Ensemble of probabilistic NNs | K independent networks | NLL per network | PETS; captures both aleatoric and epistemic |
Linear-Gaussian fit by least squares
Section titled “Linear-Gaussian fit by least squares”[Â, B̂] = (X^T X)^{-1} X^T Ywhere X is N × (dim_s + dim_a) with rows [s_i, a_i] and Y is N × dim_s with rows s_i'.
Verification: with A_true = 0.5, B_true = 1.0, five zero-noise samples from the lesson recover [Â, B̂] = [0.5, 1.0] to the digit. Closed-form math computes:
X^T X = [[6.25, -2.5], [-2.5, 3.25]], det = 14.0625X^T Y = [0.625, 2.0](X^T X)^{-1} · X^T Y = [0.5, 1.0]✓
Compounding error (the dominant failure mode)
Section titled “Compounding error (the dominant failure mode)”With A_true = 1.1, Â = 1.05 (5% bias), rolling from s_0 = 1, action a = 0:
| t | True | Model | Error |
|---|---|---|---|
| 0 | 1.0000 | 1.0000 | 0.0000 |
| 1 | 1.1000 | 1.0500 | 0.0500 |
| 2 | 1.2100 | 1.1025 | 0.1075 |
| 3 | 1.3310 | 1.1576 | 0.1734 |
| 4 | 1.4641 | 1.2155 | 0.2486 |
| 5 | 1.6105 | 1.2763 | 0.3342 |
5% one-step error → 21% five-step relative error → 37% ten-step relative error (computed as 1 - (Â/A_true)^t). Geometric blowup; model is useless for planning past horizon ~5 even with seemingly small bias.
Mitigations
Section titled “Mitigations”| Mitigation | What it does | Algorithm |
|---|---|---|
| Short rollout horizon (1-5 steps) | Cap compounding | MBPO |
| Ensemble disagreement | Detect epistemic uncertainty | PETS |
| Probabilistic rollouts | Propagate uncertainty visibly | PETS, Dreamer |
| Re-plan frequently | Only need accuracy over H-step planning horizon | MPC |
Dyna architecture (Sutton 1991)
Section titled “Dyna architecture (Sutton 1991)”For each step: 1. Act in real env; observe transition; add to D 2. Update model on D 3. For K imagined steps: Sample (s, a) from D or roll out via model Use model to predict s', r Update π on imagined transitionK = 0 ⇒ model-free RL. K large ⇒ heavy reliance on model. MBPO uses small K with short rollouts.
Decision rubric
Section titled “Decision rubric”| Use model-based | Use model-free |
|---|---|
| Real robots | Atari games |
| Continuous-control physics | Pixel-input control |
| Limited sample budget | Unlimited sample budget |
| Short planning horizons (MPC) | One-shot policy |
| Smooth, learnable dynamics | Hard-to-model dynamics (language, vision) |
Common pitfalls
Section titled “Common pitfalls”- Trusting the fit when validation error is high
- Rolling past where the model is reliable (compounding error)
- Confusing aleatoric (per-network
Σ) with epistemic (ensemble disagreement) uncertainty - Using a deterministic model for stochastic dynamics
- Forgetting that data distribution must cover the (s, a) regions the policy will visit
What you should remember
Section titled “What you should remember”- Least-squares fit is closed-form for linear-Gaussian; the same loss with SGD trains neural net dynamics.
- Compounding error is the binding constraint; “small 1-step error” is not enough.
- Short rollouts + ensemble uncertainty + frequent re-planning are the standard fixes.
- PETS, MBPO, Dreamer, MuZero are the canonical modern recipes.