Summary: Q-learning: model-free control

Q-learning is the model-free control algorithm: estimate Q^ from samples by combining TD’s bootstrap with the Bellman optimality max, then act greedily.* It is value iteration’s update form with the expectation over P replaced by a single sampled transition, and it is the foundation of DQN and modern value-based deep RL. This summary is the scan-in-five-minutes version of the full lesson.

Core ideas

The Q-learning update. Q(s_t, a_t) <- Q(s_t, a_t) + alpha * [ r_(t+1) + gamma * max_{a'} Q(s_(t+1), a’) - Q(s_t, a_t) ]. TD bootstrap on Q with a max over next-state actions. Same shape as TD(0) for V; max replaces the under-pi sum.
Off-policy. The target reasons about the best next action regardless of what the agent actually takes next. SARSA, the on-policy alternative, uses Q at the actual next action a_(t+1) instead of the max. Q-learning is the workhorse; SARSA is the on-policy sibling.
Off-policy = data efficiency. Q-learning can learn from any transition (s, a, r, s’) regardless of which behavior policy produced it. That is why DQN’s experience replay buffer (off-policy data store) is so effective.
Worked: 5 Q-learning steps on the A/B MDP (alpha = 0.5, gamma = 0.9, Q_0 = 0). After five updates Q = (0.975, 0, 0.05125, 1.225); the greedy policy is pi^ = (A: stay, B: switch)* even though Q is nowhere near Q^* = (10, 9.9, 8.9, 11). The policy stabilizes long before Q does, the same early-stabilization as value iteration, now in the sample setting.
Exploration is required, not optional. Convergence to Q^* requires every (s, a) visited infinitely often. Fully greedy behavior never visits some alternatives, so Q-learning needs an exploration scheme (typically epsilon-greedy: random with probability epsilon, argmax_a Q(s, a) otherwise).
Foundation of DQN. Replace the table with a neural network; minimize the squared TD error against a Q-learning target. The deadly triad (TD bootstrap + off-policy + function approximation) can diverge naively; DQN’s experience replay and target network tame it. Lesson 9 develops this.

What changes for you

You have the canonical model-free control algorithm and an architectural mental model: planning’s Bellman optimality recursion (max over actions), TD’s sample bootstrap, and a greedy-from-Q^* readout, all in one update. The off-policy property is the practical superpower, learning what is optimal even while behaving exploratively, and learning from any source of transitions. When you read about DQN, double DQN, dueling DQN, distributional Q-learning, or rainbow, the base recipe is this lesson’s one line of math; the rest is engineering on top. The next lesson takes that base into the function-approximation setting, where the deadly triad becomes the engineering problem to solve and Q-learning + neural network = deep RL.