Skip to content

Cheatsheet: Multi-task RL and meta-RL

PropertyMulti-task RLMeta-RL
Training tasksKnown set of N tasksSample from task distribution
Test tasksSame as training setNew tasks from same distribution
GoalOne policy that handles any training taskAgent that adapts rapidly to new tasks
Adaptation at test?No (policy is already trained)Yes (few samples, then adapted policy)
Failure modeNegative transfer between dissimilar tasksTest task outside training distribution
FamilyAlgorithmTest-time adaptationBest for
GradientMAML (Finn et al. 2017)Few explicit gradient steps from meta-trained initializationTest-time gradient budget available; explicit adaptation
RecurrentRL² (Duan, Wang et al. 2016)RNN hidden state updates implicitly with each new transitionNo test-time training; real-time adaptation
BayesianPEARL (Rakelly et al. 2019)Posterior over task latent updates with new transitionsExplicit task-identity uncertainty needed
meta-train:
for each meta-iteration:
sample task t from task distribution
sample task-train and task-test batches
theta_t = theta
repeat K times: theta_t = theta_t - alpha · gradient of task-train loss
meta-loss += task-test loss at theta_t
theta -= eta · gradient of meta-loss with respect to theta

The meta-gradient is the gradient through the inner gradient steps.

hidden_t = RNN(state_t, action_(t-1), reward_(t-1), hidden_(t-1))
action_t = policy(hidden_t)

The RNN must learn to do meta-RL implicitly; the hidden state encodes “what task am I on, given recent experience.”

context (s, a, r, s') -> probabilistic encoder -> q(z | context)
sample z ~ q
action = policy(state, z)

The latent z encodes the task; the encoder updates the posterior as transitions arrive.

SettingRecommended
Known training task set, test on sameMulti-task RL
Test on new tasks; gradient budget at adaptMAML
Test on new tasks; no gradient at adaptRL²
Test tasks have explicit identity uncertaintyPEARL
Test distribution shifted from trainingNone reliably; extend training distribution
Academic meta-RLFoundation-model parallel
Train on task distributionPretrain on internet-scale text
Test-task adaptation budgetIn-context examples + few-shot fine-tuning
RL² (recurrent adaptation)In-context learning (no gradient updates)
MAML (gradient adaptation)Few-shot fine-tuning
Test-task-out-of-distribution failurePrompt-distribution-shift failure
ConcernMitigation
Task imbalancePer-task weighting, gradient normalization across tasks
Gradient interferenceGradient surgery, separate task-specific heads on shared backbone
Capacity limitsTask-conditional architectures (mixture of experts), scaling network
Catastrophic interferenceContinual-learning techniques, experience replay across tasks
  • Conflating multi-task with meta-RL (different settings, different algorithms)
  • Expecting positive transfer automatically (negative transfer is real)
  • Underestimating task-distribution shift (meta-RL has to extrapolate from training distribution)
  • Treating MAML as the default (the meta-gradient is unstable and costly)
  • Believing in-context learning replaces fine-tuning (for distribution-shifted tasks it does not)
  • Multi-task RL: known tasks, shared policy. Goal positive transfer; failure negative transfer.
  • Meta-RL: new tasks at test time. Three families: MAML (gradient), RL² (recurrent), PEARL (Bayesian).
  • Foundation models are meta-learning at scale.
  • Test-task distribution must overlap with training task distribution.