Summary: Value-based RL (Q-learning, off-policy, the deadly triad)

The one paragraph version

Value-based RL skips the policy parameterization and learns the optimal action-value function Q*(s, a) directly. The optimal policy falls out as π*(s) = argmax_a Q*(s, a). The defining equation is the Bellman optimality equation, Q*(s, a) = R(s, a) + γ · E_{s'}[max_{a'} Q*(s', a')]. The max makes it non-linear in Q*, so there’s no closed form, only iterative solutions. Tabular Q-iteration applies the Bellman operator until convergence (geometric, at rate γ). The model-free sample-based version is Q-learning: Q(s, a) ← Q(s, a) + α · (r + γ · max_{a'} Q(s', a') − Q(s, a)). Because the target uses max_{a'} regardless of which policy generated the data, Q-learning is off-policy by construction. Deep Q-learning replaces the table with a neural network and runs into the deadly triad: function approximation + bootstrapping + off-policy data can diverge when all three are combined, even though any two are safe. Lesson 7 covers the engineering (replay buffer, target network, double Q-learning) that tames each leg.

Five things to remember

Q* encodes both values and the optimal action. π*(s) = argmax_a Q*(s, a). No separate policy network needed.
The Bellman optimality equation is non-linear because of max. You solve it iteratively. The Bellman operator is a γ-contraction in sup-norm, which gives geometric convergence.
Q-learning is off-policy because the target uses the greedy policy (max_{a'}), not the data-collecting policy. This is the source of its sample-efficiency advantage and the source of the deadly triad.
The deadly triad has three legs: function approximation, bootstrapping, off-policy data. Any two are safe. All three together can blow up.
Pick the Q branch when actions are discrete and few, and when you want to reuse data via a replay buffer. Pick the π branch when actions are continuous or the optimal policy needs to be stochastic.

Why this matters

Deep Q-networks were the breakthrough result that put deep RL on the map (Mnih et al., Nature 2015, single architecture mastering 49 Atari games from raw pixels). Every subsequent value-based deep-RL algorithm (Double DQN, Dueling DQN, Rainbow, IQN) is a refinement of the recipe you’ll meet in Lesson 7. AlphaGo’s value network was a state-value (V-style) estimate of board win probability. The dispatch table from Lesson 3 predicts when this branch is the right tool: discrete actions, deterministic optimal policy, replay buffer or demonstrations available.

The reason every deep-RL course teaches tabular Q-learning before DQN: the convergence proof is clean in the tabular case, and that clarity tells you exactly what each DQN engineering trick is buying you. Without that foundation, DQN’s replay-buffer / target-network / double-Q stack looks like unmotivated heuristics.

Worked check (memory anchor)

On the Lesson 3 two-state loop (s0 → s1 reward 1, s1 → s0 reward 0, single action, γ = 0.9), the analytic answer is V*(s0) = 1/(1 − γ²) = 5.263. Q-iteration with Q_0 = 0 reaches Q_8(s0) = 2.998, Q_{50}(s0) still about 0.027 short of 5.263, and about 100 iterations to bring || Q_k − Q* ||_∞ < 10⁻⁴. Error shrinks by γ² = 0.81 per pair of iterations (it’s a 2-step loop). Dual-path validation: the iterative algorithm and the closed-form geometric series agree to the digit. If your implementation does not match, you have a bug.

Where this fits

Previous (Phase 1): Policy-gradient branch (REINFORCE, actor-critic). On-policy methods that nudge π_θ by gradient.
This lesson: Value branch. Learn Q* directly, read off the policy by argmax.
Next (Lesson 7): DQN. Tame the deadly triad with replay buffer, target network, double Q. Atari benchmark, the proof that deep value-based RL works.
Later (Lesson 8): PPO. Best-of-both: policy gradient with limited off-policy reuse, the workhorse of modern RLHF.

What you should remember

Q-learning replaces the policy parameterization with a value function. The Bellman optimality equation is the contract, the max is what makes both the math and the engineering harder, and the deadly triad is the failure mode you need to recognize. Everything in Lesson 7 is engineering to make this idea work at scale.