Skip to content

Monte Carlo prediction

This is lesson 6 of Track 17 (Reinforcement Learning Foundations) and the opener of Phase 3 (Model-free learning). Phase 2 assumed P and R were known; this phase moves to the much more common setting where they are not, and the agent must learn from samples alone. Monte Carlo prediction is the simplest model-free algorithm: to estimate the value of a fixed policy, play episodes under it, look at the actual total reward each episode gave from each visited state, and average. The law of large numbers does the rest. The source curriculum is David Silver’s UCL RL course (CC BY-NC 4.0), freely available and cited per lesson as further study.

The lesson clarifies the prediction-vs-control split, walks the Monte Carlo algorithm in its first-visit and every-visit variants, gives the incremental running-average update, runs four episodes of a 3-state worked example through to the exact V^pi, and is honest about MC’s two-sided nature: unbiased in expectation but high in variance, which is what sets up TD learning in the next lesson as the bias-variance trade. It closes on MC’s role in modern RL (off-policy evaluation, MCTS rollouts) and on its limit, the requirement of terminating episodes.

This is lesson 6 of 10, the first lesson of Phase 3. It uses the MDP and value-function vocabulary from Phase 1 and treats the sample-based analogue of Phase 2’s policy evaluation. The next lesson, Temporal-difference learning, sits at the other end of the bias-variance axis: bootstrap from a one-step return plus an estimated next-state value instead of waiting for a full episode return. The lesson after that, Q-learning, wraps a TD-style update around a max-over-actions to give the canonical model-free control algorithm. So this lesson is where the rest of Phase 3 builds from.

Prerequisites: lesson 3 (Value functions and the Bellman equations) for V^pi’s definition as an expected return, and lesson 5 (Value iteration) for the planning-vs-learning split that this lesson opens up. Comfort with computing an expectation by averaging samples (Track 9’s basic-probability material) helps.

The arithmetic is hand-sized: at each episode, compute the return from each state visited (one or two additions), then update the running average. The worked example uses gamma = 1 (episodic) to keep the discount out of the way; the formulas generalize unchanged for gamma < 1. No new derivations beyond writing down the sample average.

  • Explain why model-free prediction is needed when P and R are unknown, and distinguish prediction from control
  • Apply Monte Carlo prediction to estimate V^pi by averaging observed returns over episodes
  • Distinguish first-visit MC from every-visit MC and recognize that both converge to V^pi
  • Explain why Monte Carlo estimates are unbiased but high-variance, and what that implies in practice
  • Recognize Monte Carlo’s requirement of terminating episodes and why TD learning extends to the continuing case
  • Read time: about 13 minutes
  • Practice time: about 16 minutes (a self-check, a 5-episode MC simulation on a fresh 3-state MDP that shows both convergence and variance, a which-method-fits-this-setting drill, and flashcards)
  • Difficulty: standard (small arithmetic; the conceptual challenge is internalizing the bias-variance trade-off that sets up the next lesson)