Skip to content

Cheatsheet: Monte Carlo prediction

Monte Carlo prediction estimates V^pi by averaging returns observed over complete episodes. Unbiased, high-variance, requires termination.

Prediction = estimate V^pi (or Q^pi) for a GIVEN pi.
Control = find a BETTER policy (ideally pi^*).
This lesson is prediction in the MODEL-FREE setting (no P, no R; only samples).
1. Run an episode under pi to termination: s_0, a_0, r_1, s_1, a_1, r_2, ..., s_T.
2. For each state s visited at time t (first-visit or every-visit):
G_t = r_(t+1) + gamma * r_(t+2) + ... + gamma^(T-t-1) * r_T
count(s) <- count(s) + 1
V(s) <- V(s) + (G_t - V(s)) / count(s)
3. Repeat for many episodes. V(s) -> V^pi(s) by the law of large numbers.
Variants:
first-visit MC : count only the first visit to s in each episode.
every-visit MC : count every visit. Both converge.
Step-size form: V(s) <- V(s) + alpha * (G_t - V(s)) (EMA; non-stationary friendly)
MC : ZERO bias, HIGH variance. Returns are sums of many random rewards.
TD (next) : SOME bias, LOWER variance. Bootstraps from one-step + estimated value.
n-step / TD(lambda): interpolate between MC and TD on the same axis.

Worked example (lesson, gamma = 1, episodic)

Section titled “Worked example (lesson, gamma = 1, episodic)”
S -> A (R=1) -> T with R(A->T) = +4 w.p. 0.5, 0 w.p. 0.5.
True values: V^pi(A) = 2, V^pi(S) = 3.
4 episodes with rewards (4,0,4,0):
G(S) = (5, 1, 5, 1) -> avg 3 ✓ G(A) = (4, 0, 4, 0) -> avg 2 ✓
4 episodes with rewards (4,4,4,0):
G(S) avg = 4 (off by 1) G(A) avg = 3 (off by 1)
-- unbiased in expectation, noisy in any single run.
SettingMC fits?
Episodic task, model unknown, on-policy evaluationYES
Continuing task (no termination)NO — need TD
Off-policy evaluation from logged dataMC + importance weights (variant)
Self-play tree search (AlphaGo MCTS)YES (rollouts)
  • Treating MC as a control algorithm by itself (it’s prediction; wrap in GPI for control).
  • Applying MC to continuing tasks without truncation (return never finishes).
  • Confusing “unbiased” with “accurate from few samples” (variance can be large).
  • Forgetting the first-visit vs every-visit choice changes which samples enter the average.
  • Stopping after one good-looking episode (run many before reading the average).
  • Model-free prediction: estimate V^pi (or Q^pi) from samples, without P or R.
  • Return G_t: discounted total reward from time t to episode end.
  • First-visit MC / every-visit MC: counting rule for averaging within an episode.
  • Unbiased: E[estimator] = V^pi.
  • High-variance: large sample-to-sample fluctuation in the estimator.