Practice: Temporal-difference learning
The skill is running TD(0) by hand: compute the TD error at each transition (target minus current V), nudge V by alpha times the TD error. The fresh 4-state chain in this practice is one state longer than the lesson’s, which makes the “value propagates backward one bootstrap per episode” pattern even more visible. Keep a scratchpad.
Self-check
Section titled “Self-check”Six short questions. Answer each in your head before opening the collapsible.
1. Write the TD(0) update for V.
Show answer
V(s_t) <- V(s_t) + alpha * [ r_(t+1) + gamma * V(s_(t+1)) - V(s_t) ]. The bracketed quantity is the TD error delta_t. The target r_(t+1) + gamma * V(s_(t+1)) is the Bellman expectation equation estimated from one sampled transition.
2. What is the TD error, in words?
Show answer
The difference between the TD target (one observed reward plus the discounted current value of the next state) and the current value estimate of the state we are updating: delta_t = r_(t+1) + gamma * V(s_(t+1)) - V(s_t). It measures how much one new transition’s evidence disagrees with V(s_t).
3. Why is “bootstrapping” the right word for TD?
Show answer
Because TD updates V(s_t) using V(s_(t+1)), an estimate updates another estimate. The reward r_(t+1) is real data, but the future-value part of the target is itself a current estimate that is also being learned. Bootstrap = pulling yourself up by your own estimates.
4. On the bias-variance axis, where do MC and TD sit?
Show answer
MC is the unbiased, high-variance extreme (target = full return, no bootstrap). TD(0) is the biased, low-variance extreme (target = one reward + bootstrapped value). Both converge to V^pi under standard conditions. n-step returns and TD(lambda) interpolate.
5. Name two settings where TD works but MC does not.
Show answer
(1) Continuing tasks (no termination): MC needs a full return, TD only needs one step. (2) Online learning: TD updates as each transition arrives, while MC has to wait for the episode to end before any update. Both are central to real deployed RL systems.
6. What is the “deadly triad” and why does it matter for deep RL?
Show answer
The combination of bootstrapping (TD) + off-policy (e.g. Q-learning) + function approximation (e.g. a neural network). The three together can diverge in naive implementations. DQN’s experience replay and target networks are engineering fixes that tame the triad and make deep value-learning stable. Lesson 9 returns to this.
Try it yourself: TD(0) on a 4-state chain
Section titled “Try it yourself: TD(0) on a 4-state chain”Four states W -> X -> Y -> Z (Z terminal), one action everywhere, deterministic transitions, reward 1 on each step, gamma = 1, alpha = 0.5. Initialize V = (0, 0, 0, 0).
True values: V^pi(W) = 3, V^pi(X) = 2, V^pi(Y) = 1, V^pi(Z) = 0Run three episodes of TD(0). At each step compute the TD error and update V. After each episode, write down V = (V(W), V(X), V(Y)).
Show answer
Episode 1. V_0 = (0, 0, 0, 0).
W -> X (r=1): delta = 1 + V(X) - V(W) = 1 + 0 - 0 = 1. V(W) <- 0 + 0.5*1 = 0.5X -> Y (r=1): delta = 1 + V(Y) - V(X) = 1 + 0 - 0 = 1. V(X) <- 0 + 0.5*1 = 0.5Y -> Z (r=1): delta = 1 + V(Z) - V(Y) = 1 + 0 - 0 = 1. V(Y) <- 0 + 0.5*1 = 0.5
After ep 1: V = (0.5, 0.5, 0.5)Episode 2.
W -> X (r=1): delta = 1 + 0.5 - 0.5 = 1. V(W) <- 0.5 + 0.5*1 = 1.0X -> Y (r=1): delta = 1 + 0.5 - 0.5 = 1. V(X) <- 0.5 + 0.5*1 = 1.0Y -> Z (r=1): delta = 1 + 0 - 0.5 = 0.5. V(Y) <- 0.5 + 0.5*0.5 = 0.75
After ep 2: V = (1.0, 1.0, 0.75)Episode 3.
W -> X (r=1): delta = 1 + 1.0 - 1.0 = 1.0. V(W) <- 1.0 + 0.5*1.0 = 1.5X -> Y (r=1): delta = 1 + 0.75 - 1.0 = 0.75. V(X) <- 1.0 + 0.5*0.75 = 1.375Y -> Z (r=1): delta = 1 + 0 - 0.75 = 0.25. V(Y) <- 0.75 + 0.5*0.25 = 0.875
After ep 3: V = (1.5, 1.375, 0.875)The pattern is identical to the lesson’s chain, just one state longer:
- Y converges fastest (its target uses V(Z) = 0 = truth from the start).
- X is one bootstrap step behind Y (its target uses an estimate that is still wrong).
- W is two bootstrap steps behind (information about the terminal has to propagate two states upstream).
So after three episodes you have (1.5, 1.375, 0.875) creeping toward true (3, 2, 1). The bootstrap propagates backward; the deeper the state, the slower the convergence.
Try it yourself: MC vs TD on the same setting
Section titled “Try it yourself: MC vs TD on the same setting”Take the same 4-state chain (W -> X -> Y -> Z, R = 1 each step, gamma = 1, deterministic). Answer:
1. After ONE episode, what do MC and TD(0) (alpha = 0.5) each give for V(W)? (For MC, recall: V(W) <- average of observed returns from W.)2. The chain is deterministic. How many MC episodes does it take to converge exactly to V^pi = (3, 2, 1)? How many TD episodes (qualitatively)?3. If the chain were STOCHASTIC instead (rewards random with the same mean), which method would have lower per-step variance, and why?Show answer
1. After one episode:
MC: G(W) = 1 + 1 + 1 = 3 -> V(W) = 3 (the exact true value, no error)TD: see worked example above; V(W) = 0.5 after one episode (far below truth).The deterministic chain is MC’s best case: one return is the truth, so MC reads off V^pi(W) = 3 in a single episode. TD’s one-step targets give a partial update (0.5) and need several more episodes to propagate the terminal information backward.
2. MC: one episode (deterministic chain -> every return is exactly V^pi -> one sample is the truth). TD: many episodes (information propagates backward one bootstrap per episode; you need at least as many episodes as the chain’s length to get a nonzero estimate at the deepest state, and many more to converge numerically).
3. TD would have lower per-step variance. TD’s per-step target is one reward plus one current-value estimate, with variance dominated by the one reward. MC’s return on a long stochastic chain is a sum of many random rewards, with variance growing with chain length. On a deterministic chain MC’s variance is zero (it wins this comparison), but the moment any reward or transition becomes stochastic, MC’s variance explodes while TD’s stays bounded. This is the actual reason TD dominates MC in practice on most real problems.
Flashcards
Section titled “Flashcards”Eight cards. Click any card to reveal the answer. Use the Print flashcards button to lay out the full set as one card per page for offline review.
Q. Write the TD(0) update.
V(s_t) <- V(s_t) + alpha * [ r_(t+1) + gamma * V(s_(t+1)) - V(s_t) ]. The bracketed expression is the TD error delta_t.
Q. What is the TD error?
delta_t = r_(t+1) + gamma * V(s_(t+1)) - V(s_t). The difference between the one-step bootstrapped target and the current value estimate of s_t; measures how much one transition’s evidence disagrees with V(s_t).
Q. What does 'bootstrap' mean in TD learning?
The target uses V(s_(t+1)), an estimate, to update V(s_t), another estimate. Reward r_(t+1) is real data, but the future-value part of the target is bootstrapped from the current V being learned.
Q. MC vs TD on the bias-variance axis?
MC: unbiased, high variance (target = full return, no bootstrap). TD(0): biased, low variance (target = one reward + bootstrapped value). Both converge to V^pi; n-step / TD(lambda) interpolate.
Q. Two settings where TD works but MC does not.
(1) Continuing tasks with no termination (MC needs a full return; TD needs one step). (2) Online learning where updates must happen as transitions arrive (MC has to wait until the episode ends).
Q. In the deterministic A->B->C chain at alpha=0.5, why does V(A) climb 0 -> 0.5 -> 1.0 -> 1.375 across episodes?
Because the bootstrap propagates the terminal reward backward one state per episode: B’s value catches up first (its target uses V(C)=0=true); A’s target then uses B’s estimate, which lags. Each pass closes some of the gap, with diminishing per-step changes as estimates approach the truth.
Q. What is the 'deadly triad' in deep RL?
Bootstrapping (TD) + off-policy learning (e.g. Q-learning) + function approximation (e.g. a neural network). The three together can diverge. DQN’s experience replay and target networks are the engineering fixes that tame the triad in practice.
Q. How is TD the foundation of Q-learning and SARSA?
Both are TD applied to action-values Q. SARSA’s target uses the actual next action (on-policy): r + gamma * Q(s’, a’). Q-learning’s target uses the max over actions (off-policy): r + gamma * max_a’ Q(s’, a’). Same TD-error structure on Q instead of V.