Summary: Multi-task RL and meta-RL

The one-paragraph version

Single-task RL specializes a policy to one MDP. When the agent faces many related tasks, the structure across tasks is usable. Multi-task RL trains one policy on many tasks simultaneously, sharing parameters; the goal is positive transfer (training on task A helps task B). Meta-RL trains the agent to rapidly adapt to NEW tasks at test time, where the new tasks come from the same distribution as training tasks. Three meta-RL families: gradient-based (MAML) trains an initialization that is a few gradient steps from a good solution on any task; recurrent (RL²) uses an RNN whose hidden state implicitly encodes the current task; Bayesian (PEARL) maintains a posterior over a task-latent variable. The decision rubric: multi-task for known training task sets, gradient meta-RL for test-time gradient budgets, recurrent meta-RL for no-test-time-training adaptation, Bayesian meta-RL for task-identity uncertainty. Foundation models are meta-learning at scale: in-context learning is implicit meta-RL; few-shot fine-tuning is explicit MAML-like meta-RL. The academic structure isolated in clean meta-RL settings is recognizable in production AI.

Five things to remember

Multi-task RL: train on many tasks, test on the same tasks. One policy with shared parameters across tasks; positive transfer is the goal, negative transfer is the failure mode.
Meta-RL: train to adapt to NEW tasks. The agent learns the adaptation process itself, not any individual task.
Three meta-RL families. MAML (gradient-based, explicit test-time adaptation), RL² (recurrent, implicit hidden-state adaptation), PEARL (Bayesian, explicit task-posterior update).
Foundation models are meta-RL at scale. In-context learning works because the pretraining task distribution is huge; few-shot fine-tuning is MAML-flavored gradient adaptation rather than formal MAML training, but the structural parallel is recognizable.
All meta-RL assumes test tasks come from the training task distribution. Tasks outside the training distribution are out-of-scope for what the meta-trained agent has learned.

Why this matters

Modern foundation models exhibit the rapid-adaptation behaviors that academic meta-RL set out to engineer in clean settings. Understanding the multi-task and meta-RL framings lets you read claims about “few-shot learning” or “in-context adaptation” with calibrated skepticism: the test-task distribution has to overlap with the training task distribution; the adaptation budget must match what the meta-training optimized for; the failure mode (test task too far from training) is the same across academic meta-RL and production foundation models. The framings also clarify why some “transfer learning” pipelines work and others do not: positive transfer requires shared structure across tasks, and identifying when that structure exists is the multi-task RL problem.

Worked check (memory anchor)

Scenario	Setting
Warehouse robot, 50 known product shapes	Multi-task RL (known training tasks = test tasks)
Robotic arm, customers demonstrate new tasks	Meta-RL (new test tasks)
Language model on 200 known customer-support categories	Multi-task RL
Few-shot at inference from 3 in-prompt examples	Meta-RL (test task new, adaptation via in-context examples)
Driving policy on 10 cities deployed in those cities	Multi-task RL

For the meta-RL cases, picking the family: robotic arm with gradient budget at adaptation = MAML; few-shot at inference (no gradient updates) = RL²-style implicit adaptation, which is what large language models actually do.

Where this fits in the broader curriculum

L7 DQN through L13 RLHF trained single-task or single-distribution policies. L17 generalizes the framing to many tasks.
L13 RLHF is implicitly meta-RL at the pretraining scale: the language model that gets RLHF-fine-tuned was already a meta-learner via in-context capabilities.
L16 exploration is orthogonal: in multi-task and meta-RL the exploration question still arises, both within each task and across the task distribution. Hard-exploration tasks in a multi-task setting compound the exploration challenge.
L18 next closes Phase 3 and Track 18 with the field’s open problems: sample efficiency, safety, generalization, real-world deployment. Each connects back to algorithms covered across the track.