Skip to content

Cheatsheet: Offline RL algorithms (BCQ, CQL, IQL)

AlgorithmPaperMechanismNetworks trained
BCQFujimoto, Meger, Precup 2019Constrain the policy to actions a VAE generates from the datasetVAE p(a|s), perturbation, Q (or double-Q)
CQLKumar, Zhou, Tucker, Levine 2020Penalize Q at OOD actions; push up Q at in-distribution actionsQ (or double-Q) with conservative loss term
IQLKostrikov, Nair, Levine 2021Replace max with expectile regression on in-distribution actions only; policy is advantage-weighted imitationV (state value), Q, policy
Failure sourceBCQCQLIQL
Max selects OOD actionAction set restricted to VAE samplesOOD Q-values penalized below in-distribution QMax not used at all
Function approximator extrapolatesConstrained at the action levelPenalized at the value levelAvoided by expectile design
Bellman propagation amplifiesInflation cannot enter (action set is constrained)Inflation is bounded (lower-bound Q)No path for inflation (max is replaced)
SettingRecommendedWhy
Single, well-defined behavior policyBCQSingle-modal action distribution maps cleanly onto a VAE
Heterogeneous mixture of behavior policiesCQLPer-state Q penalty handles multi-modal data
Continuous high-dimensional actionsIQLAvoids costly max-over-actions evaluation
First try on standard benchmarkIQLSimplest tuning, strongest defaults on D4RL
Theoretical lower bound on Q is requiredCQLProvable conservative bound
Tight integration with BC baselineIQLAdvantage-weighted regression is closest to BC + advantage filtering
Regulated setting requiring explicit action constraintBCQThe action constraint is the safety story

The total objective:

L = E over (s,a,r,s') of (Q(s,a) - target)²
+ alpha · E over s of (log-sum-exp over a' of Q(s,a')) - Q(s,a))
  • First term: standard Bellman error.
  • Second term: the conservative penalty.
    • log-sum-exp picks up the highest-Q actions (typically OOD), pushing them down.
    • Minus Q at the in-distribution action pushes that value up.
  • alpha controls penalty strength. Tune against BC baseline.
L_V = E over (s,a) of expectile_tau(Q(s,a) - V(s)) [expectile regression on dataset Q]
L_Q = E over (s,a,r,s') of (Q(s,a) - (r + gamma · V(s')))² [Bellman with V(s'), no max]
L_policy = E over (s,a) of exp(beta · (Q(s,a) - V(s))) · log policy(a|s) [advantage-weighted imitation]
  • tau approaching 1 makes V(s) act as a max-over-in-distribution-actions surrogate.
  • beta controls how strongly to upweight high-advantage dataset actions.
  • No max anywhere.

At state s:

  1. Sample N candidate actions from VAE p(a | s).
  2. Perturb each via the perturbation network.
  3. Evaluate Q at each (s, perturbed action).
  4. Return the action with the highest Q.

The Q-network is only ever evaluated at near-in-distribution actions.

  • Treating BCQ, CQL, IQL as interchangeable (mechanism and best-use differ)
  • Ignoring the single-modal-behavior-policy assumption in BCQ
  • Under-tuning alpha in CQL (the original-paper default is not always right)
  • Over-trusting IQL’s defaults (tau and beta still matter on production data)
  • Skipping the BC sanity check (every offline-RL deploy must beat BC on the same dataset)
  • One design principle, three mechanisms: prevent the Bellman update from querying Q at OOD actions.
  • BCQ: action constraint via VAE. CQL: conservative Q penalty. IQL: expectile regression sidesteps the max.
  • Decision rubric matters: BCQ for single-modal, CQL for heterogeneous, IQL for continuous-action or first-try defaults.
  • Always benchmark against BC; offline RL must exceed it to justify the engineering cost.