Skip to content

References: Offline RL algorithms (BCQ, CQL, IQL)

  • Fujimoto, S., Meger, D., & Precup, D. (2019). Off-Policy Deep Reinforcement Learning without Exploration. ICML 2019. https://arxiv.org/abs/1812.02900 The original BCQ paper. Names extrapolation error, defines the VAE-plus-perturbation-plus-Q architecture, and demonstrates that the constrained Q-learning approach avoids divergence on standard benchmarks where naive offline Q-learning fails.
  • Kumar, A., Zhou, A., Tucker, G., & Levine, S. (2020). Conservative Q-Learning for Offline Reinforcement Learning. NeurIPS 2020. https://arxiv.org/abs/2006.04779 The original CQL paper. Defines the conservative penalty, proves the lower-bound property, and demonstrates strong performance across D4RL.
  • Kostrikov, I., Nair, A., & Levine, S. (2021). Offline Reinforcement Learning with Implicit Q-Learning. ICLR 2022. https://arxiv.org/abs/2110.06169 The original IQL paper. Introduces the expectile-regression formulation, advantage-weighted policy update, and shows it matches or exceeds prior offline-RL methods on D4RL with simpler tuning.
  • Fu, J., Kumar, A., Nachum, O., Tucker, G., & Levine, S. (2020). D4RL: Datasets for Deep Data-Driven Reinforcement Learning. arXiv:2004.07219. https://arxiv.org/abs/2004.07219 The standard offline-RL benchmark. Locomotion, manipulation, navigation, Adroit. The benchmark all three algorithms above are evaluated against.
  • Brandfonbrener, D., Whitney, W. F., Ranganath, R., & Bruna, J. (2021). Offline RL Without Off-Policy Evaluation. NeurIPS 2021. https://arxiv.org/abs/2106.08909 Cross-algorithm comparison study, including IQL-precursor one-step RL approaches. Useful for understanding which algorithm wins which dataset family.
  • Kumar, A., Hong, J., Singh, A., & Levine, S. (2022). When Should We Prefer Offline Reinforcement Learning Over Behavioral Cloning? ICLR 2022. https://arxiv.org/abs/2204.05618 Directly addresses the L14/L15 question: when does offline RL exceed BC? Settings where offline RL wins versus settings where BC is sufficient.
  • Kumar, A., Fu, J., Tucker, G., & Levine, S. (2019). Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction. NeurIPS 2019. https://arxiv.org/abs/1906.00949 BEAR, a BCQ-adjacent action-constraint algorithm using KL-divergence to the behavior policy instead of VAE-based generation. Cited as a sibling to BCQ.
  • Wu, Y., Tucker, G., & Nachum, O. (2019). Behavior Regularized Offline Reinforcement Learning. arXiv:1911.11361. https://arxiv.org/abs/1911.11361 Behavior-regularized actor-critic family. Adds an explicit regularization term keeping the policy close to the behavior policy. Sibling to BCQ and CQL.
  • Siegel, N. Y., Springenberg, J. T., Berkenkamp, F., et al. (2020). Keep Doing What Worked: Behavior Modelling Priors for Offline Reinforcement Learning. ICLR 2020. https://arxiv.org/abs/2002.08396 Advantage-weighted regression precursor to IQL’s policy update.
  • Levine, S., Kumar, A., Tucker, G., & Fu, J. (2020). Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. arXiv:2005.01643. https://arxiv.org/abs/2005.01643 The canonical survey. Section 4 covers BCQ, CQL, IQL among many others. Reference for any deeper dive into the family.
  • Komorowski, M., Celi, L. A., Badawi, O., Gordon, A. C., & Faisal, A. A. (2018). The Artificial Intelligence Clinician learns optimal treatment strategies for sepsis in intensive care. Nature Medicine, 24, 1716-1720. https://www.nature.com/articles/s41591-018-0213-5 Healthcare offline-RL application. Predates BCQ/CQL/IQL but demonstrates the deployment context.
  • Chen, M., Beutel, A., Covington, P., et al. (2019). Top-K Off-Policy Correction for a REINFORCE Recommender System (deployed on YouTube). WSDM 2019. https://arxiv.org/abs/1812.02353 Recommender-system offline-RL at production scale. Uses an off-policy correction related to the L15 family’s design principles.
  • Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training Language Models to Follow Instructions with Human Feedback. NeurIPS 2022. https://arxiv.org/abs/2203.02155 InstructGPT. The KL regularization to the SFT reference policy in the RLHF objective plays the structural role BCQ’s action constraint plays: keep the trained policy near the data distribution where the reward model is trustworthy. Re-read alongside the L15 algorithms to see the parallel.

This is the second of two offline-RL lessons. It draws on three primary algorithm papers (BCQ, CQL, IQL), the D4RL benchmark that made cross-algorithm comparisons meaningful, the survey for context, and selected real-world deployments. The lesson does not editorialize on which algorithm is “best”; it gives a decision rubric based on dataset structure and deployment constraints.

Source curriculum (structural mirror, cited as further study):
• UC Berkeley CS285: Deep Reinforcement Learning (Sergey Levine)
Course page: http://rail.eecs.berkeley.edu/deeprlcourse/
Lecture videos: YouTube (link-out only)
Clawdemy's lessons are original prose that follows the pedagogical arc of this
source. We do not reproduce or transcribe it; we cite it as a recommended
companion. All rights to the original material remain with its authors.