Summary: Monte Carlo prediction
Monte Carlo prediction is the simplest model-free way to evaluate a policy: play episodes, average the returns, let the law of large numbers do the rest. Phase 2 assumed P and R were known; Phase 3 starts the real-world case where they are not. MC is the unbiased extreme of the bias-variance spectrum that TD (next lesson) sits at the other end of. This summary is the scan-in-five-minutes version of the full lesson.
Core ideas
Section titled “Core ideas”- Prediction vs control. Prediction estimates V^pi (or Q^pi) for a fixed pi. Control finds a better policy. Most algorithms use a prediction step inside a control loop; this lesson is prediction in the model-free setting.
- Monte Carlo prediction. Run an episode under pi to termination, compute the return G_t from each visited state, average across episodes. As episodes accumulate, the sample average converges to V^pi(s) by the law of large numbers.
- First-visit vs every-visit MC. First-visit counts only the first occurrence of s in each episode; every-visit counts every occurrence. Both converge to V^pi; the choice is mostly cosmetic.
- Incremental update. count(s)
<-count(s) + 1; V(s)<-V(s) + (G - V(s)) / count(s). Maintains the running average without storing all past returns. Replacing 1/count with a step size alpha gives an EMA (and is the form TD uses). - Unbiased but high-variance. The MC estimator’s expectation equals V^pi exactly; in any single run of N episodes, the noisy returns can leave the estimate far from the truth. The 4-episode S/A/T example hits V = (3, 2) exactly when the 50-50 outcome realizes evenly; a biased realization (3 of 4 high) overshoots by 1, and only converges back with more episodes.
- Needs terminating episodes. MC computes the full return from t to T; on a non-terminating task you would never finish. TD learning (next lesson) fixes this by bootstrapping from one-step rewards plus an estimated next-state value, which works for episodic and continuing tasks alike.
What changes for you
Section titled “What changes for you”You have the first algorithm in the track that runs in the setting where RL is actually deployed: no model, just experience. The mental model is mostly more useful than the algorithm: Phase 2’s policy evaluation has a sample-based twin (average observed returns), and you can swap one for the other in any GPI loop. The bias-variance framing is the more durable takeaway, because the rest of the track is points on the same spectrum. MC is the unbiased / high-variance end; TD (next lesson) is the biased / low-variance end; n-step returns and TD(lambda) live in between. When you read a deep-RL paper choosing “n-step targets” or “GAE”, they are picking a point on this exact spectrum.