Q-learning, in brief

What you’ll learn

This is lesson 8 of Track 17 (Reinforcement Learning Foundations) and the close of Phase 3 (Model-free learning). The previous two lessons (MC and TD) estimated V^pi for a fixed policy from samples; this lesson moves to control, finding pi^* itself. Q-learning is the canonical model-free control algorithm: TD’s sample bootstrap applied to Q with a max over actions in the target. That single change combines the Bellman optimality recursion from Phase 2 with the sample-based bootstrap of TD, gives the foundation of DQN, and is the algorithm under most value-based deep RL deployed today. The source curriculum is David Silver’s UCL RL course (CC BY-NC 4.0), freely available and cited per lesson as further study.

The lesson writes the Q-learning update, distinguishes the off-policy max-target from SARSA’s on-policy actual-next-action target, walks five Q-learning steps on the A/B MDP from lesson 4 (the greedy policy is already pi^* after five updates even though Q is far from Q^*), explains why exploration via epsilon-greedy is required for convergence, and previews the DQN bridge with the deadly-triad caveat (TD + off-policy + function approximation can diverge naively, fixed by experience replay and target networks in DQN).

Where this fits

This is lesson 8 of 10 and the final lesson of Phase 3. It uses TD’s update mechanism (lesson 7), the Bellman optimality equation (lesson 3), and the value iteration analogy (lesson 5). The next lesson, Function approximation and deep RL, replaces the tabular Q with a neural network — exactly the move from Q-learning here to DQN, with the deadly-triad fixes that make the combination stable. The lesson after that closes the track with policy gradient and the bridge to RLHF.

Before you start

Prerequisites: the previous lesson (Temporal-difference learning) for the TD bootstrap; lesson 3 (Value functions and the Bellman equations) for V vs Q and the Bellman optimality equation; lesson 4 (Policy iteration) for the A/B MDP used in the worked example (so the comparison with the planning case is concrete). Comfort with max over actions and with a TD-style update is the only computational background.

About the math

The arithmetic is hand-sized: each Q-learning step computes one max over two values, one TD error, and one update. The 5-step worked example shows every target; the practice repeats the pattern on the L/R MDP from lesson 4’s practice. No proofs; convergence to Q^* is stated with its infinite-visit requirement.

By the end, you’ll be able to

Write the Q-learning update and explain how it combines TD’s bootstrap with the Bellman optimality max
Distinguish on-policy (SARSA) from off-policy (Q-learning) by which action enters the target
Run Q-learning through several sample transitions on a small MDP and read off the greedy policy
Explain why exploration (e.g. epsilon-greedy) is required for Q-learning to converge to Q^*
Recognize Q-learning as the foundation of DQN and modern value-based deep RL, including the deadly-triad caveat

Time and difficulty

Read time: about 13 minutes
Practice time: about 16 minutes (a self-check, a 5-step Q-learning trace on a fresh small MDP, a Q-learning-vs-SARSA target-computation drill, and flashcards)
Difficulty: standard (small arithmetic per step; the conceptual challenge is the off-policy property and why exploration is required even though the target is greedy)