Brief: Offline RL algorithms (BCQ, CQL, IQL)
What you will learn
Section titled “What you will learn”You will name the three offline-RL algorithm families that fix the L14 failure (BCQ, CQL, IQL), describe the mechanism each uses (action-set constraint via a VAE; conservative Q penalty; expectile regression sidestepping the max), and walk through a decision rubric for picking the right algorithm given a dataset’s structure (single-modal behavior policy versus heterogeneous mixture, discrete versus continuous actions, regulated versus exploratory deployment context). You will trace how each algorithm prevents the L14 divergence on the same two-state MDP, recognize when behavioral cloning is the right baseline to benchmark against, and leave with a working understanding of when to reach for which algorithm in a production offline-RL pipeline.
Where this fits
Section titled “Where this fits”This is lesson 15 of Track 18 (Deep Reinforcement Learning), lesson 3 of Phase 3 (rl-frontiers). It completes the offline-RL pair (L14 problem definition; L15 algorithms). It builds on L7 DQN (the off-policy Q-learning machinery), L14 (the failure mode the L15 algorithms address), L2 imitation learning (BC as universal baseline), and L13 RLHF (KL regularization as a structural parallel to BCQ’s action constraint).
Source
Section titled “Source”Berkeley CS285 (Sergey Levine, Fall 2023), lecture on Offline RL: Algorithms. Canonical URL http://rail.eecs.berkeley.edu/deeprlcourse/. Primary algorithm papers: Fujimoto, Meger, and Precup (2019) BCQ; Kumar, Zhou, Tucker, and Levine (2020) CQL; Kostrikov, Nair, and Levine (2021) IQL. Survey: Levine, Kumar, Tucker, and Fu (2020).
Phase advance
Section titled “Phase advance”Phase 3 lesson 3 (phase_order: 3). After L15 follows L16 (Exploration), L17 (Multi-task and meta-RL), L18 (Open problems, closes Phase 3 + Track 18).
Lesson body (lesson.mdx)
Section titled “Lesson body (lesson.mdx)”- Hook: L14 named the failure; three families fix it by different mechanisms.
- Shared design principle: prevent the Bellman update from querying Q at OOD actions.
- BCQ: VAE plus perturbation plus Q; action constraint; deployment loop (sample, perturb, max). When BCQ works (single-modal behavior policy); when it degrades (heterogeneous datasets).
- CQL: conservative penalty; loss decomposition; the trained Q as provable lower bound; alpha as the tuning lever. When CQL is the right choice (heterogeneous datasets, regulated settings wanting an explicit bound).
- IQL: expectile regression on dataset actions only; V(s) as max-over-in-distribution-actions surrogate; Bellman target without max; advantage-weighted imitation policy; cleanest tuning surface.
- Worked decision example on the L14 two-state MDP: all three recover the optimal policy; the differences emerge on heterogeneous benchmarks.
- Algorithmic decision rubric table.
- Why this matters operationally: healthcare, recommender, robotics, language-model RLHF parallel.
- Common pitfalls (5): treating them as interchangeable; ignoring BCQ’s single-modal assumption; under-tuning alpha in CQL; over-trusting IQL defaults; skipping BC sanity check.
- “What you should remember” (5 bullets).
- L16 setup: exploration as the offline-RL opposite.
Practice (practice.mdx)
Section titled “Practice (practice.mdx)”Two exercises plus five flashcards.
- Pick the algorithm (4 datasets): single-modal AV operator, heterogeneous medical claims, continuous robotic-arm, deadline-pressured D4RL first attempt. Pick BCQ, CQL, or IQL with 2-sentence justification.
- Walk-through on the two-state MDP: for each of the three algorithms, describe in 3-5 sentences how it prevents the L14 divergence at state s2.
Five flashcards: shared design principle; BCQ training + deployment; CQL loss + bound; IQL three-loss training; when to pick BCQ.
Cheatsheet (cheatsheet.mdx)
Section titled “Cheatsheet (cheatsheet.mdx)”Tables. Three algorithms side by side (paper, mechanism, networks). What each prevents (max-OOD, extrapolation, Bellman amplification). Decision rubric. CQL loss decomposition. IQL three-loss training. BCQ deployment loop. Common pitfalls. Remember-bullets.
References (references.mdx)
Section titled “References (references.mdx)”Primary source: CS285. Primary algorithm papers: Fujimoto et al. 2019 BCQ, Kumar et al. 2020 CQL, Kostrikov et al. 2021 IQL. Comparison studies: Fu et al. 2020 D4RL, Brandfonbrener et al. 2021, Kumar et al. 2022 BC-vs-offline-RL. Precursors: Kumar et al. 2019 BEAR, Wu et al. 2019 BRAC, Siegel et al. 2020 AWR. Survey: Levine et al. 2020 tutorial review. Production deployments: Komorowski et al. 2018 sepsis, Chen et al. 2019 YouTube recommender. RLHF connection: Ouyang et al. 2022 InstructGPT.
Editorial discipline
Section titled “Editorial discipline”- Stage 2 sweep: em/en dashes (0), inline math backticks in lesson.mdx outside fenced blocks (0 after fix), Greek letters in prose spelled out (alpha, beta, gamma, tau; symbols only in fenced display blocks), placeholder comments present on brief.
- §6 watch-zone: technical algorithm content; no policy or vendor-advocacy framing. The InstructGPT citation in references is for the structural KL-regularization parallel, not for any vendor positioning.
- Vendor naming: Anthropic, DeepMind, OpenAI, Google named only as paper-author affiliations (positive citations); MIT-affiliated authors on the sepsis paper; no anonymization triggers.
- A1 verbatim discipline: no vendor quotations.
Word counts
Section titled “Word counts”- Lesson 2274
- Practice 1465
- Summary 615
- Cheatsheet 775
- References 615
- Brief 935
Total ≈ 6679 words across 6 artifacts.
Notes for promotion
Section titled “Notes for promotion”- Component placeholders (
�J0�,�J1�) as MDX comments.�J2�for CS285 “Offline RL Algorithms”. - Practice uses real
�J0�+�J1�component imports. - L14 prereq path:
lessons/deep-reinforcement-learning/offline-rl-problem. L7 DQN:lessons/deep-reinforcement-learning/dqn. L2 BC:lessons/deep-reinforcement-learning/imitation-learning. L13 RLHF:lessons/deep-reinforcement-learning/rlhf. - Lesson body uses fenced display blocks for the CQL loss, IQL three-loss training, BCQ deployment loop. Greek symbols in those fenced blocks; prose spells alpha / beta / gamma / tau.
- L15 closes the offline-RL pair. L16 pivots to exploration (the agent CAN act but reward is sparse).