Value-based RL, in brief

Capability gained

Trace Q-iteration on a small MDP, verify against a closed-form analytic solution, and explain why naively swapping the Q-table for a neural network breaks the convergence guarantee (the deadly triad: function approximation + bootstrapping + off-policy data).

Why this lesson exists

Phase 1 of Track 18 built the policy-gradient branch (REINFORCE, actor-critic). Lesson 3 promised a second branch: estimate Q* directly, read off the policy by argmax. This lesson redeems that promise. It is the natural opener for Phase 2 (core deep-RL algorithms) because every value-based method in the phase (DQN, double DQN, dueling, prioritized replay) is engineering on top of the tabular Q-learning skeleton derived here.

The lesson stays tabular and proves the convergence guarantee, then explicitly notes which assumption breaks in the deep setting. Lesson 7 then walks through DQN’s three engineering tricks (replay buffer, target network, double Q-learning) as patches for the specific failures named here.

Source

Berkeley CS285 lecture on Value Function Methods), Sergey Levine, 2023. Cross-references: Sutton & Barto (2018) Chapters 4 (DP), 6 (TD), 11 (deadly triad); Watkins & Dayan (1992) for the original Q-learning convergence; Mnih et al. (2015) for the DQN forward reference.

Phase advance

Phase 2 opener (phase_order: 1). The lesson’s load-bearing transition: from “we have a policy and improve its parameters” (Phase 1, on-policy, π-branch) to “we estimate Q* directly and read the policy off” (Phase 2, off-policy, Q-branch). This lesson formalizes the second branch and exposes the deadly triad, setting up Lesson 7 (DQN) as “engineering patches for the three triad legs.”

Lesson body (lesson.mdx)

Recap of the dispatch table from Lesson 3; this lesson covers the Q branch.
Bellman optimality equation derived from Q^π’s Bellman equation by swapping the policy-expectation for a max. Note the non-linearity.
Tabular Q-iteration with the γ-contraction convergence argument (informal; Banach fixed-point reference).
Worked example: the L3 two-state single-action MDP. Q-iteration converges to V*(s0) = 5.263 from Lesson 3. Dual-path validation (algorithm ↔ closed-form geometric series).
Where the max does work: add a dominated stay action at s0; argmax picks the right action.
Q-learning as the sample-based, model-free version of Q-iteration. Off-policy explanation: target uses max_{a'} regardless of data-generating policy.
Deep Q-learning loss and the deadly triad (function approximation + bootstrapping + off-policy). Each leg explained; any two safe; all three together can diverge. Forward reference to Lesson 7’s three patches.
When to use the Q branch vs the π branch (discrete actions, deterministic optimal policy, off-policy data reuse).
Common pitfalls (Bellman vs Bellman-optimality confusion; on-policy/off-policy mislabeling; max overestimation bias).
“Why this matters when you use AI” section anchors deep Q-learning’s role in AlphaGo, Atari, board-game systems.

Practice (practice.mdx)

Two exercises:

Q-iteration by hand on a fresh MDP with both arithmetic and analytic dual-path validation. Two-state, single-action MDP with rewards (2, 4), γ = 0.5. Closed form: V*(s0) = (2 + 4γ)/(1 − γ²) = 4/0.75 ≈ 5.333, V*(s1) ≈ 6.667. Q-iteration table for 6 iterations from Q_0 = 0. The dual-path check verifies the convergence rate (γ² = 0.25 per pair of iterations) matches the geometric contraction prediction.
Deadly-triad classifier, 5 scenarios. Reader marks FA / BS / OP for each and decides at-risk Y/N. Scenarios designed to make each “any two are safe” case concrete: tabular Q-learning (no FA), MC with FA + OP (no BS), on-policy A2C (no OP), naive deep Q with no fixes (all three, diverges), PPO with limited off-policy reuse (deliberately weakens OP via clipping).

5 flashcards covering: why max makes Bellman optimality non-linear; what makes Q-learning off-policy; the deadly triad legs; the max overestimation bias; the Q-vs-π decision rubric.

Cheatsheet (cheatsheet.mdx)

One-page reference. Bellman optimality equation; algorithm comparison table (Q-iteration vs Q-learning); on-policy vs off-policy property table; deadly triad with the engineering fixes table; key biases (overestimation, bootstrap instability); the L3 worked example reproduced as a memory anchor; decision rubric for picking the Q branch.

Summary (summary.mdx)

5-minute distillation. One-paragraph framing, five things to remember, worked check (the L3 MDP convergence to 5.263; about 100 iterations to bring || Q_k − Q* ||_∞ < 10⁻⁴), where the lesson fits in the track arc.

References (references.mdx)

Primary: Watkins (1989) dissertation + Watkins & Dayan (1992) Q-learning convergence; Sutton & Barto (2018) chapters 3, 4, 6, 11; Bertsekas (2019) chapter 2 for contraction-mapping convergence. Deadly triad sources: Sutton & Barto chapter 11, Tsitsiklis & Van Roy (1997), Baird’s counter-example (1995). Forward references: Mnih et al. (2015) DQN paper, van Hasselt et al. (2016) double DQN. Course source: Berkeley CS285 L7.

Editorial discipline

Stage 2 sweep: em/en dashes, U+2212 vs hyphen, caps emphasis, dead /topics/ links. Acronyms allowed in caps: MDP, RL, BS, FA, OP, MC, TD, GAE, SGD, PPO, SAC, DQN, A2C, A3C, MSE, IMPALA, AlphaGo, IQN.
No vendor naming triggers (CS285 is the course; Sutton, Watkins, Bellman, Mnih are paper authors, not “vendor” context). No security claims.
§6 status: standard pipeline, no triggers. PPO/RLHF forward references properly deferred to Lessons 8 + 13.

Word counts

Lesson 2273
Cheatsheet 711
Practice 1928
Summary 511
Brief 766
References 408

Total ≈ 6597 words across 6 artifacts. Math-heavy band; in line with L3-L5 calibration.

Notes for promotion

Component placeholders (�J0�, �J1�) live as MDX comments in the brief; Lead wires real components at promotion.
Practice imports real �J0� + �J1� components (children API, not front/back props).
Phase 2 boundary check after L12; per cadence-release ruling, no per-lesson holds within Phase 2.