Skip to content

Summary: Monte Carlo prediction

Monte Carlo prediction is the simplest model-free way to evaluate a policy: play episodes, average the returns, let the law of large numbers do the rest. Phase 2 assumed P and R were known; Phase 3 starts the real-world case where they are not. MC is the unbiased extreme of the bias-variance spectrum that TD (next lesson) sits at the other end of. This summary is the scan-in-five-minutes version of the full lesson.

  • Prediction vs control. Prediction estimates V^pi (or Q^pi) for a fixed pi. Control finds a better policy. Most algorithms use a prediction step inside a control loop; this lesson is prediction in the model-free setting.
  • Monte Carlo prediction. Run an episode under pi to termination, compute the return G_t from each visited state, average across episodes. As episodes accumulate, the sample average converges to V^pi(s) by the law of large numbers.
  • First-visit vs every-visit MC. First-visit counts only the first occurrence of s in each episode; every-visit counts every occurrence. Both converge to V^pi; the choice is mostly cosmetic.
  • Incremental update. count(s) <- count(s) + 1; V(s) <- V(s) + (G - V(s)) / count(s). Maintains the running average without storing all past returns. Replacing 1/count with a step size alpha gives an EMA (and is the form TD uses).
  • Unbiased but high-variance. The MC estimator’s expectation equals V^pi exactly; in any single run of N episodes, the noisy returns can leave the estimate far from the truth. The 4-episode S/A/T example hits V = (3, 2) exactly when the 50-50 outcome realizes evenly; a biased realization (3 of 4 high) overshoots by 1, and only converges back with more episodes.
  • Needs terminating episodes. MC computes the full return from t to T; on a non-terminating task you would never finish. TD learning (next lesson) fixes this by bootstrapping from one-step rewards plus an estimated next-state value, which works for episodic and continuing tasks alike.

You have the first algorithm in the track that runs in the setting where RL is actually deployed: no model, just experience. The mental model is mostly more useful than the algorithm: Phase 2’s policy evaluation has a sample-based twin (average observed returns), and you can swap one for the other in any GPI loop. The bias-variance framing is the more durable takeaway, because the rest of the track is points on the same spectrum. MC is the unbiased / high-variance end; TD (next lesson) is the biased / low-variance end; n-step returns and TD(lambda) live in between. When you read a deep-RL paper choosing “n-step targets” or “GAE”, they are picking a point on this exact spectrum.