Monte Carlo prediction: cheatsheet

The one idea

Monte Carlo prediction estimates V^pi by averaging returns observed over complete episodes. Unbiased, high-variance, requires termination.

Prediction vs control

Prediction = estimate V^pi (or Q^pi) for a GIVEN pi.
Control    = find a BETTER policy (ideally pi^*).
This lesson is prediction in the MODEL-FREE setting (no P, no R; only samples).

The Monte Carlo algorithm

1. Run an episode under pi to termination: s_0, a_0, r_1, s_1, a_1, r_2, ..., s_T.
2. For each state s visited at time t (first-visit or every-visit):
     G_t = r_(t+1) + gamma * r_(t+2) + ... + gamma^(T-t-1) * r_T
     count(s) <- count(s) + 1
     V(s)     <- V(s) + (G_t - V(s)) / count(s)
3. Repeat for many episodes. V(s) -> V^pi(s) by the law of large numbers.

Variants:
  first-visit MC : count only the first visit to s in each episode.
  every-visit MC : count every visit. Both converge.

Step-size form: V(s) <- V(s) + alpha * (G_t - V(s))   (EMA; non-stationary friendly)

Bias-variance position

MC        : ZERO bias,  HIGH variance.  Returns are sums of many random rewards.
TD (next) : SOME bias,  LOWER variance. Bootstraps from one-step + estimated value.
n-step / TD(lambda): interpolate between MC and TD on the same axis.

Worked example (lesson, gamma = 1, episodic)

S -> A (R=1) -> T with R(A->T) = +4 w.p. 0.5, 0 w.p. 0.5.
True values: V^pi(A) = 2, V^pi(S) = 3.

4 episodes with rewards (4,0,4,0):
  G(S) = (5, 1, 5, 1) -> avg 3 ✓        G(A) = (4, 0, 4, 0) -> avg 2 ✓

4 episodes with rewards (4,4,4,0):
  G(S) avg = 4 (off by 1)               G(A) avg = 3 (off by 1)
  -- unbiased in expectation, noisy in any single run.

When MC fits (and when it doesn’t)

Setting	MC fits?
Episodic task, model unknown, on-policy evaluation	YES
Continuing task (no termination)	NO — need TD
Off-policy evaluation from logged data	MC + importance weights (variant)
Self-play tree search (AlphaGo MCTS)	YES (rollouts)

Pitfalls to dodge

Treating MC as a control algorithm by itself (it’s prediction; wrap in GPI for control).
Applying MC to continuing tasks without truncation (return never finishes).
Confusing “unbiased” with “accurate from few samples” (variance can be large).
Forgetting the first-visit vs every-visit choice changes which samples enter the average.
Stopping after one good-looking episode (run many before reading the average).

Words to use precisely

Model-free prediction: estimate V^pi (or Q^pi) from samples, without P or R.
Return G_t: discounted total reward from time t to episode end.
First-visit MC / every-visit MC: counting rule for averaging within an episode.
Unbiased: E[estimator] = V^pi.
High-variance: large sample-to-sample fluctuation in the estimator.