Summary: Multi-task RL and meta-RL
The one-paragraph version
Section titled “The one-paragraph version”Single-task RL specializes a policy to one MDP. When the agent faces many related tasks, the structure across tasks is usable. Multi-task RL trains one policy on many tasks simultaneously, sharing parameters; the goal is positive transfer (training on task A helps task B). Meta-RL trains the agent to rapidly adapt to NEW tasks at test time, where the new tasks come from the same distribution as training tasks. Three meta-RL families: gradient-based (MAML) trains an initialization that is a few gradient steps from a good solution on any task; recurrent (RL²) uses an RNN whose hidden state implicitly encodes the current task; Bayesian (PEARL) maintains a posterior over a task-latent variable. The decision rubric: multi-task for known training task sets, gradient meta-RL for test-time gradient budgets, recurrent meta-RL for no-test-time-training adaptation, Bayesian meta-RL for task-identity uncertainty. Foundation models are meta-learning at scale: in-context learning is implicit meta-RL; few-shot fine-tuning is explicit MAML-like meta-RL. The academic structure isolated in clean meta-RL settings is recognizable in production AI.
Five things to remember
Section titled “Five things to remember”- Multi-task RL: train on many tasks, test on the same tasks. One policy with shared parameters across tasks; positive transfer is the goal, negative transfer is the failure mode.
- Meta-RL: train to adapt to NEW tasks. The agent learns the adaptation process itself, not any individual task.
- Three meta-RL families. MAML (gradient-based, explicit test-time adaptation), RL² (recurrent, implicit hidden-state adaptation), PEARL (Bayesian, explicit task-posterior update).
- Foundation models are meta-RL at scale. In-context learning works because the pretraining task distribution is huge; few-shot fine-tuning is MAML-flavored gradient adaptation rather than formal MAML training, but the structural parallel is recognizable.
- All meta-RL assumes test tasks come from the training task distribution. Tasks outside the training distribution are out-of-scope for what the meta-trained agent has learned.
Why this matters
Section titled “Why this matters”Modern foundation models exhibit the rapid-adaptation behaviors that academic meta-RL set out to engineer in clean settings. Understanding the multi-task and meta-RL framings lets you read claims about “few-shot learning” or “in-context adaptation” with calibrated skepticism: the test-task distribution has to overlap with the training task distribution; the adaptation budget must match what the meta-training optimized for; the failure mode (test task too far from training) is the same across academic meta-RL and production foundation models. The framings also clarify why some “transfer learning” pipelines work and others do not: positive transfer requires shared structure across tasks, and identifying when that structure exists is the multi-task RL problem.
Worked check (memory anchor)
Section titled “Worked check (memory anchor)”| Scenario | Setting |
|---|---|
| Warehouse robot, 50 known product shapes | Multi-task RL (known training tasks = test tasks) |
| Robotic arm, customers demonstrate new tasks | Meta-RL (new test tasks) |
| Language model on 200 known customer-support categories | Multi-task RL |
| Few-shot at inference from 3 in-prompt examples | Meta-RL (test task new, adaptation via in-context examples) |
| Driving policy on 10 cities deployed in those cities | Multi-task RL |
For the meta-RL cases, picking the family: robotic arm with gradient budget at adaptation = MAML; few-shot at inference (no gradient updates) = RL²-style implicit adaptation, which is what large language models actually do.
Where this fits in the broader curriculum
Section titled “Where this fits in the broader curriculum”- L7 DQN through L13 RLHF trained single-task or single-distribution policies. L17 generalizes the framing to many tasks.
- L13 RLHF is implicitly meta-RL at the pretraining scale: the language model that gets RLHF-fine-tuned was already a meta-learner via in-context capabilities.
- L16 exploration is orthogonal: in multi-task and meta-RL the exploration question still arises, both within each task and across the task distribution. Hard-exploration tasks in a multi-task setting compound the exploration challenge.
- L18 next closes Phase 3 and Track 18 with the field’s open problems: sample efficiency, safety, generalization, real-world deployment. Each connects back to algorithms covered across the track.