Cheatsheet: Monte Carlo prediction
The one idea
Section titled “The one idea”Monte Carlo prediction estimates V^pi by averaging returns observed over complete episodes. Unbiased, high-variance, requires termination.
Prediction vs control
Section titled “Prediction vs control”Prediction = estimate V^pi (or Q^pi) for a GIVEN pi.Control = find a BETTER policy (ideally pi^*).This lesson is prediction in the MODEL-FREE setting (no P, no R; only samples).The Monte Carlo algorithm
Section titled “The Monte Carlo algorithm”1. Run an episode under pi to termination: s_0, a_0, r_1, s_1, a_1, r_2, ..., s_T.2. For each state s visited at time t (first-visit or every-visit): G_t = r_(t+1) + gamma * r_(t+2) + ... + gamma^(T-t-1) * r_T count(s) <- count(s) + 1 V(s) <- V(s) + (G_t - V(s)) / count(s)3. Repeat for many episodes. V(s) -> V^pi(s) by the law of large numbers.
Variants: first-visit MC : count only the first visit to s in each episode. every-visit MC : count every visit. Both converge.
Step-size form: V(s) <- V(s) + alpha * (G_t - V(s)) (EMA; non-stationary friendly)Bias-variance position
Section titled “Bias-variance position”MC : ZERO bias, HIGH variance. Returns are sums of many random rewards.TD (next) : SOME bias, LOWER variance. Bootstraps from one-step + estimated value.n-step / TD(lambda): interpolate between MC and TD on the same axis.Worked example (lesson, gamma = 1, episodic)
Section titled “Worked example (lesson, gamma = 1, episodic)”S -> A (R=1) -> T with R(A->T) = +4 w.p. 0.5, 0 w.p. 0.5.True values: V^pi(A) = 2, V^pi(S) = 3.
4 episodes with rewards (4,0,4,0): G(S) = (5, 1, 5, 1) -> avg 3 ✓ G(A) = (4, 0, 4, 0) -> avg 2 ✓
4 episodes with rewards (4,4,4,0): G(S) avg = 4 (off by 1) G(A) avg = 3 (off by 1) -- unbiased in expectation, noisy in any single run.When MC fits (and when it doesn’t)
Section titled “When MC fits (and when it doesn’t)”| Setting | MC fits? |
|---|---|
| Episodic task, model unknown, on-policy evaluation | YES |
| Continuing task (no termination) | NO — need TD |
| Off-policy evaluation from logged data | MC + importance weights (variant) |
| Self-play tree search (AlphaGo MCTS) | YES (rollouts) |
Pitfalls to dodge
Section titled “Pitfalls to dodge”- Treating MC as a control algorithm by itself (it’s prediction; wrap in GPI for control).
- Applying MC to continuing tasks without truncation (return never finishes).
- Confusing “unbiased” with “accurate from few samples” (variance can be large).
- Forgetting the first-visit vs every-visit choice changes which samples enter the average.
- Stopping after one good-looking episode (run many before reading the average).
Words to use precisely
Section titled “Words to use precisely”- Model-free prediction: estimate V^pi (or Q^pi) from samples, without P or R.
- Return G_t: discounted total reward from time t to episode end.
- First-visit MC / every-visit MC: counting rule for averaging within an episode.
- Unbiased: E[estimator] = V^pi.
- High-variance: large sample-to-sample fluctuation in the estimator.