References: Q-learning: model-free control
Source material
Section titled “Source material”Source curriculum (structural mirror, cited as further study):• David Silver, "Reinforcement Learning" (UCL course), Lecture 5: Model-Free Control (Q-learning, SARSA) Author: David Silver Course page: https://davidstarsilver.wordpress.com/teaching/ License: CC BY-NC 4.0Clawdemy's lessons are original prose that follows the pedagogical arc of thiscourse. We do not embed, reproduce, or transcribe Silver's slides or videolectures; we link out to the relevant lecture as recommended further study.The non-commercial clause aligns with Clawdemy's free, zero-revenue posture.All rights to the original materials remain with the author and UCL.
Source-scope note: this lesson mirrors the Q-learning portion of Silver'sLecture 5 and restates it in Clawdemy's voice with a 5-step worked trace onthe same A/B MDP used by lesson 4 (so the relationship between planning andlearning is concrete), plus an off-policy-vs-on-policy SARSA contrast inpractice. The "Q-learning IS value iteration's update with the expectationreplaced by a sample" framing, the explicit early-stabilization of the greedypolicy after just 5 sample transitions, and the deadly-triad + DQN bridge asthe close are Clawdemy framing. Exact per-lecture URLs are verified atpromotion.Read this next
Section titled “Read this next”- David Silver, UCL RL course, Lecture 5: Model-Free Control by David Silver. The lecture this lesson mirrors, with SARSA and Q-learning developed together so the on-policy vs off-policy contrast is direct. CC BY-NC 4.0, freely available.
Going deeper
Section titled “Going deeper”A short, durable list. Both are free.
- Sutton and Barto, “Reinforcement Learning: An Introduction” (2nd edition), Chapter 6 (Temporal-Difference Learning) — Sections 6.4-6.5 cover SARSA and Q-learning with worked cliff-walking examples that make the on-policy / off-policy contrast vivid; Chapter 11 covers the deadly triad and DQN’s fixes.
- Mnih et al., “Human-level control through deep reinforcement learning” (Nature, 2015) — the DQN paper. Q-learning + a deep neural network + experience replay + target network = the algorithm that learned to play Atari. The applied bridge from this lesson to lesson 9.
Adjacent topics
Section titled “Adjacent topics”Where this leads inside this track.
- Temporal-difference learning. The previous lesson. Q-learning is TD on Q with a max-over-actions in the target — this lesson reuses TD’s mechanism with one extra ingredient.
- Value iteration. Lesson 5. Q-learning is VI’s update form with the expectation over P replaced by a single sampled transition. Planning becomes learning by swapping the model for a sample.
- Function approximation and deep RL. The next lesson. Replace the tabular Q with a neural network; minimize the squared Bellman residual. The deadly triad named here becomes the engineering problem DQN solves.