Skip to content

References: Monte Carlo prediction

Source curriculum (structural mirror, cited as further study):
• David Silver, "Reinforcement Learning" (UCL course), Lecture 4:
Model-Free Prediction (Monte Carlo methods)
Author: David Silver
Course page: https://davidstarsilver.wordpress.com/teaching/
License: CC BY-NC 4.0
Clawdemy's lessons are original prose that follows the pedagogical arc of this
course. We do not embed, reproduce, or transcribe Silver's slides or video
lectures; we link out to the relevant lecture as recommended further study.
The non-commercial clause aligns with Clawdemy's free, zero-revenue posture.
All rights to the original materials remain with the author and UCL.
Source-scope note: this lesson mirrors the Monte Carlo portion of Silver's
Lecture 4 and restates it in Clawdemy's voice with an original 3-state
S/A/T worked example whose 4-episode run is designed to show both the
exact-match case (50-50 outcome realizes evenly) and the variance case
(biased realization with the same true distribution). The explicit
prediction-vs-control split, the bias-variance-axis framing that pre-figures
TD (next lesson) and n-step / TD(lambda) at the spectrum's interior, and the
named "MC needs termination" limit that motivates TD are Clawdemy framing.
Exact per-lecture URLs are verified at promotion.

A short, durable list. Both are free.

  • Sutton and Barto, “Reinforcement Learning: An Introduction” (2nd edition), Chapter 5 (Monte Carlo Methods). The textbook treatment, with policy evaluation, on-policy control, and off-policy methods (including the importance-sampling variant referenced in this lesson).
  • David Silver, UCL RL course, Lecture 5: Model-Free Control (within the course above). Where MC and TD prediction get wrapped in GPI loops to become control algorithms (Monte Carlo control, SARSA, Q-learning). Track 17 lesson 8 develops Q-learning from this material.

Where this leads inside this track.

  • Value iteration. The previous lesson. VI’s update form pre-figures TD’s: VI takes a max over actions and uses P; TD samples one step. Both are bootstrapped one-step targets, computed in different ways.
  • Temporal-difference learning. The next lesson. The other end of the bias-variance axis: bootstrap from a one-step return plus an estimated next-state value instead of waiting for a full episode return.
  • Q-learning: model-free control. Lesson 8. Combines TD-style sample bootstrapping with the max-over-actions of value iteration, giving the canonical model-free control algorithm.