Skip to content

References: Q-learning: model-free control

Source curriculum (structural mirror, cited as further study):
• David Silver, "Reinforcement Learning" (UCL course), Lecture 5:
Model-Free Control (Q-learning, SARSA)
Author: David Silver
Course page: https://davidstarsilver.wordpress.com/teaching/
License: CC BY-NC 4.0
Clawdemy's lessons are original prose that follows the pedagogical arc of this
course. We do not embed, reproduce, or transcribe Silver's slides or video
lectures; we link out to the relevant lecture as recommended further study.
The non-commercial clause aligns with Clawdemy's free, zero-revenue posture.
All rights to the original materials remain with the author and UCL.
Source-scope note: this lesson mirrors the Q-learning portion of Silver's
Lecture 5 and restates it in Clawdemy's voice with a 5-step worked trace on
the same A/B MDP used by lesson 4 (so the relationship between planning and
learning is concrete), plus an off-policy-vs-on-policy SARSA contrast in
practice. The "Q-learning IS value iteration's update with the expectation
replaced by a sample" framing, the explicit early-stabilization of the greedy
policy after just 5 sample transitions, and the deadly-triad + DQN bridge as
the close are Clawdemy framing. Exact per-lecture URLs are verified at
promotion.

A short, durable list. Both are free.

  • Sutton and Barto, “Reinforcement Learning: An Introduction” (2nd edition), Chapter 6 (Temporal-Difference Learning) — Sections 6.4-6.5 cover SARSA and Q-learning with worked cliff-walking examples that make the on-policy / off-policy contrast vivid; Chapter 11 covers the deadly triad and DQN’s fixes.
  • Mnih et al., “Human-level control through deep reinforcement learning” (Nature, 2015) — the DQN paper. Q-learning + a deep neural network + experience replay + target network = the algorithm that learned to play Atari. The applied bridge from this lesson to lesson 9.

Where this leads inside this track.

  • Temporal-difference learning. The previous lesson. Q-learning is TD on Q with a max-over-actions in the target — this lesson reuses TD’s mechanism with one extra ingredient.
  • Value iteration. Lesson 5. Q-learning is VI’s update form with the expectation over P replaced by a single sampled transition. Planning becomes learning by swapping the model for a sample.
  • Function approximation and deep RL. The next lesson. Replace the tabular Q with a neural network; minimize the squared Bellman residual. The deadly triad named here becomes the engineering problem DQN solves.