Offline RL algorithms: brief

What you will learn

You will name the three offline-RL algorithm families that fix the L14 failure (BCQ, CQL, IQL), describe the mechanism each uses (action-set constraint via a VAE; conservative Q penalty; expectile regression sidestepping the max), and walk through a decision rubric for picking the right algorithm given a dataset’s structure (single-modal behavior policy versus heterogeneous mixture, discrete versus continuous actions, regulated versus exploratory deployment context). You will trace how each algorithm prevents the L14 divergence on the same two-state MDP, recognize when behavioral cloning is the right baseline to benchmark against, and leave with a working understanding of when to reach for which algorithm in a production offline-RL pipeline.

Where this fits

This is lesson 15 of Track 18 (Deep Reinforcement Learning), lesson 3 of Phase 3 (rl-frontiers). It completes the offline-RL pair (L14 problem definition; L15 algorithms). It builds on L7 DQN (the off-policy Q-learning machinery), L14 (the failure mode the L15 algorithms address), L2 imitation learning (BC as universal baseline), and L13 RLHF (KL regularization as a structural parallel to BCQ’s action constraint).

Source

Berkeley CS285 (Sergey Levine, Fall 2023), lecture on Offline RL: Algorithms. Canonical URL http://rail.eecs.berkeley.edu/deeprlcourse/. Primary algorithm papers: Fujimoto, Meger, and Precup (2019) BCQ; Kumar, Zhou, Tucker, and Levine (2020) CQL; Kostrikov, Nair, and Levine (2021) IQL. Survey: Levine, Kumar, Tucker, and Fu (2020).

Phase advance

Phase 3 lesson 3 (phase_order: 3). After L15 follows L16 (Exploration), L17 (Multi-task and meta-RL), L18 (Open problems, closes Phase 3 + Track 18).

Lesson body (lesson.mdx)

Hook: L14 named the failure; three families fix it by different mechanisms.
Shared design principle: prevent the Bellman update from querying Q at OOD actions.
BCQ: VAE plus perturbation plus Q; action constraint; deployment loop (sample, perturb, max). When BCQ works (single-modal behavior policy); when it degrades (heterogeneous datasets).
CQL: conservative penalty; loss decomposition; the trained Q as provable lower bound; alpha as the tuning lever. When CQL is the right choice (heterogeneous datasets, regulated settings wanting an explicit bound).
IQL: expectile regression on dataset actions only; V(s) as max-over-in-distribution-actions surrogate; Bellman target without max; advantage-weighted imitation policy; cleanest tuning surface.
Worked decision example on the L14 two-state MDP: all three recover the optimal policy; the differences emerge on heterogeneous benchmarks.
Algorithmic decision rubric table.
Why this matters operationally: healthcare, recommender, robotics, language-model RLHF parallel.
Common pitfalls (5): treating them as interchangeable; ignoring BCQ’s single-modal assumption; under-tuning alpha in CQL; over-trusting IQL defaults; skipping BC sanity check.
“What you should remember” (5 bullets).
L16 setup: exploration as the offline-RL opposite.

Practice (practice.mdx)

Two exercises plus five flashcards.

Pick the algorithm (4 datasets): single-modal AV operator, heterogeneous medical claims, continuous robotic-arm, deadline-pressured D4RL first attempt. Pick BCQ, CQL, or IQL with 2-sentence justification.
Walk-through on the two-state MDP: for each of the three algorithms, describe in 3-5 sentences how it prevents the L14 divergence at state s2.

Five flashcards: shared design principle; BCQ training + deployment; CQL loss + bound; IQL three-loss training; when to pick BCQ.

Cheatsheet (cheatsheet.mdx)

Tables. Three algorithms side by side (paper, mechanism, networks). What each prevents (max-OOD, extrapolation, Bellman amplification). Decision rubric. CQL loss decomposition. IQL three-loss training. BCQ deployment loop. Common pitfalls. Remember-bullets.

References (references.mdx)

Primary source: CS285. Primary algorithm papers: Fujimoto et al. 2019 BCQ, Kumar et al. 2020 CQL, Kostrikov et al. 2021 IQL. Comparison studies: Fu et al. 2020 D4RL, Brandfonbrener et al. 2021, Kumar et al. 2022 BC-vs-offline-RL. Precursors: Kumar et al. 2019 BEAR, Wu et al. 2019 BRAC, Siegel et al. 2020 AWR. Survey: Levine et al. 2020 tutorial review. Production deployments: Komorowski et al. 2018 sepsis, Chen et al. 2019 YouTube recommender. RLHF connection: Ouyang et al. 2022 InstructGPT.

Editorial discipline

Stage 2 sweep: em/en dashes (0), inline math backticks in lesson.mdx outside fenced blocks (0 after fix), Greek letters in prose spelled out (alpha, beta, gamma, tau; symbols only in fenced display blocks), placeholder comments present on brief.
§6 watch-zone: technical algorithm content; no policy or vendor-advocacy framing. The InstructGPT citation in references is for the structural KL-regularization parallel, not for any vendor positioning.
Vendor naming: Anthropic, DeepMind, OpenAI, Google named only as paper-author affiliations (positive citations); MIT-affiliated authors on the sepsis paper; no anonymization triggers.
A1 verbatim discipline: no vendor quotations.

Word counts

Lesson 2274
Practice 1465
Summary 615
Cheatsheet 775
References 615
Brief 935

Total ≈ 6679 words across 6 artifacts.

Notes for promotion

Component placeholders (�J0�, �J1�) as MDX comments. �J2� for CS285 “Offline RL Algorithms”.
Practice uses real �J0� + �J1� component imports.
L14 prereq path: lessons/deep-reinforcement-learning/offline-rl-problem. L7 DQN: lessons/deep-reinforcement-learning/dqn. L2 BC: lessons/deep-reinforcement-learning/imitation-learning. L13 RLHF: lessons/deep-reinforcement-learning/rlhf.
Lesson body uses fenced display blocks for the CQL loss, IQL three-loss training, BCQ deployment loop. Greek symbols in those fenced blocks; prose spells alpha / beta / gamma / tau.
L15 closes the offline-RL pair. L16 pivots to exploration (the agent CAN act but reward is sparse).