References: Offline RL algorithms (BCQ, CQL, IQL)

Primary source

Levine, S. (2023). Berkeley CS285, Deep Reinforcement Learning, lecture on Offline RL: Algorithms. http://rail.eecs.berkeley.edu/deeprlcourse/. Lecture video at https://www.youtube.com/playlist?list=PL_iWQOsE6TfVYGEGiAOMaOzzv41Jfm_Ps.

The three algorithm families

BCQ

Fujimoto, S., Meger, D., & Precup, D. (2019). Off-Policy Deep Reinforcement Learning without Exploration. ICML 2019. https://arxiv.org/abs/1812.02900 The original BCQ paper. Names extrapolation error, defines the VAE-plus-perturbation-plus-Q architecture, and demonstrates that the constrained Q-learning approach avoids divergence on standard benchmarks where naive offline Q-learning fails.

CQL

Kumar, A., Zhou, A., Tucker, G., & Levine, S. (2020). Conservative Q-Learning for Offline Reinforcement Learning. NeurIPS 2020. https://arxiv.org/abs/2006.04779 The original CQL paper. Defines the conservative penalty, proves the lower-bound property, and demonstrates strong performance across D4RL.

IQL

Kostrikov, I., Nair, A., & Levine, S. (2021). Offline Reinforcement Learning with Implicit Q-Learning. ICLR 2022. https://arxiv.org/abs/2110.06169 The original IQL paper. Introduces the expectile-regression formulation, advantage-weighted policy update, and shows it matches or exceeds prior offline-RL methods on D4RL with simpler tuning.

Comparison studies and benchmarks

Fu, J., Kumar, A., Nachum, O., Tucker, G., & Levine, S. (2020). D4RL: Datasets for Deep Data-Driven Reinforcement Learning. arXiv:2004.07219. https://arxiv.org/abs/2004.07219 The standard offline-RL benchmark. Locomotion, manipulation, navigation, Adroit. The benchmark all three algorithms above are evaluated against.
Brandfonbrener, D., Whitney, W. F., Ranganath, R., & Bruna, J. (2021). Offline RL Without Off-Policy Evaluation. NeurIPS 2021. https://arxiv.org/abs/2106.08909 Cross-algorithm comparison study, including IQL-precursor one-step RL approaches. Useful for understanding which algorithm wins which dataset family.
Kumar, A., Hong, J., Singh, A., & Levine, S. (2022). When Should We Prefer Offline Reinforcement Learning Over Behavioral Cloning? ICLR 2022. https://arxiv.org/abs/2204.05618 Directly addresses the L14/L15 question: when does offline RL exceed BC? Settings where offline RL wins versus settings where BC is sufficient.

Kumar, A., Fu, J., Tucker, G., & Levine, S. (2019). Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction. NeurIPS 2019. https://arxiv.org/abs/1906.00949 BEAR, a BCQ-adjacent action-constraint algorithm using KL-divergence to the behavior policy instead of VAE-based generation. Cited as a sibling to BCQ.
Wu, Y., Tucker, G., & Nachum, O. (2019). Behavior Regularized Offline Reinforcement Learning. arXiv:1911.11361. https://arxiv.org/abs/1911.11361 Behavior-regularized actor-critic family. Adds an explicit regularization term keeping the policy close to the behavior policy. Sibling to BCQ and CQL.
Siegel, N. Y., Springenberg, J. T., Berkenkamp, F., et al. (2020). Keep Doing What Worked: Behavior Modelling Priors for Offline Reinforcement Learning. ICLR 2020. https://arxiv.org/abs/2002.08396 Advantage-weighted regression precursor to IQL’s policy update.

Survey + tutorial

Levine, S., Kumar, A., Tucker, G., & Fu, J. (2020). Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. arXiv:2005.01643. https://arxiv.org/abs/2005.01643 The canonical survey. Section 4 covers BCQ, CQL, IQL among many others. Reference for any deeper dive into the family.

Production deployments

Komorowski, M., Celi, L. A., Badawi, O., Gordon, A. C., & Faisal, A. A. (2018). The Artificial Intelligence Clinician learns optimal treatment strategies for sepsis in intensive care. Nature Medicine, 24, 1716-1720. https://www.nature.com/articles/s41591-018-0213-5 Healthcare offline-RL application. Predates BCQ/CQL/IQL but demonstrates the deployment context.
Chen, M., Beutel, A., Covington, P., et al. (2019). Top-K Off-Policy Correction for a REINFORCE Recommender System (deployed on YouTube). WSDM 2019. https://arxiv.org/abs/1812.02353 Recommender-system offline-RL at production scale. Uses an off-policy correction related to the L15 family’s design principles.

Connection to L13 RLHF

Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training Language Models to Follow Instructions with Human Feedback. NeurIPS 2022. https://arxiv.org/abs/2203.02155 InstructGPT. The KL regularization to the SFT reference policy in the RLHF objective plays the structural role BCQ’s action constraint plays: keep the trained policy near the data distribution where the reward model is trustworthy. Re-read alongside the L15 algorithms to see the parallel.

Note on the source mix

This is the second of two offline-RL lessons. It draws on three primary algorithm papers (BCQ, CQL, IQL), the D4RL benchmark that made cross-algorithm comparisons meaningful, the survey for context, and selected real-world deployments. The lesson does not editorialize on which algorithm is “best”; it gives a decision rubric based on dataset structure and deployment constraints.

Source material

Source curriculum (structural mirror, cited as further study):
• UC Berkeley CS285: Deep Reinforcement Learning (Sergey Levine)
  Course page: http://rail.eecs.berkeley.edu/deeprlcourse/
  Lecture videos: YouTube (link-out only)
Clawdemy's lessons are original prose that follows the pedagogical arc of this
source. We do not reproduce or transcribe it; we cite it as a recommended
companion. All rights to the original material remain with its authors.