Brief: Multi-task RL and meta-RL
What you will learn
Section titled “What you will learn”You will define multi-task RL and meta-RL, distinguish the two structurally (multi-task: same training and test tasks; meta-RL: new tasks at test time), and name the three meta-RL families (gradient-based MAML, recurrent RL², Bayesian PEARL). You will apply a decision rubric for picking the right approach per setting and recognize that foundation models exhibit meta-learning behaviors at scale (in-context learning as implicit meta-RL, few-shot fine-tuning as explicit gradient meta-RL). You will leave with the framings that let you read claims about “few-shot adaptation” or “transfer learning” in modern systems with calibrated skepticism, knowing the test-distribution-overlap assumption that all meta-RL approaches share.
Where this fits
Section titled “Where this fits”This is lesson 17 of Track 18 (Deep Reinforcement Learning), lesson 5 of Phase 3 (rl-frontiers). Penultimate lesson of the track. Builds on every previous algorithm in T18 (which all trained single-task or single-distribution policies) and on L13 RLHF (which contextualizes foundation models as the parallel to academic meta-RL at scale).
Source
Section titled “Source”Berkeley CS285 (Sergey Levine, Fall 2023), lecture on Multi-task and meta-RL. Canonical URL http://rail.eecs.berkeley.edu/deeprlcourse/. Primary algorithm papers: Finn, Abbeel, and Levine (2017) MAML; Duan et al. (2016) RL² and Wang et al. (2016) Learning to Reinforcement Learn; Rakelly et al. (2019) PEARL. Foundation-model parallel: Brown et al. (2020) GPT-3.
Phase advance
Section titled “Phase advance”Phase 3 lesson 5 (phase_order: 5). After L17 follows L18 (Open problems, closes Phase 3 + Track 18).
Lesson body (lesson.mdx)
Section titled “Lesson body (lesson.mdx)”- Hook: every algorithm in T18 has assumed a single task; many real settings involve many related tasks.
- Multi-task RL: train one policy on many tasks simultaneously; task identity as input; positive vs negative transfer; three practical concerns (imbalance, interference, capacity).
- Meta-RL setup: test tasks are new but from the training task distribution; agent trained to adapt rapidly.
- Three meta-RL families:
- MAML (gradient-based): meta-train an initialization a few gradient steps from a good solution; test-time adaptation is K gradient steps.
- RL² (recurrent): meta-policy is RNN; hidden state encodes task; test-time adaptation is implicit hidden-state update.
- PEARL (Bayesian): posterior over task latent variable; policy conditioned on posterior embedding.
- Decision rubric: which family for which setting.
- Concrete examples: robotic manipulation (multi-task pretraining, meta-adaptation to new objects); language models (in-context learning as implicit meta-RL); video games (AlphaStar); recommender systems (multi-task by design).
- Why this matters: foundation models exhibit meta-RL behaviors at scale; understanding the framings clarifies claims about few-shot learning.
- Common pitfalls (5).
- 5 remember-bullets.
- L18 setup.
Practice (practice.mdx)
Section titled “Practice (practice.mdx)”Two exercises plus five flashcards.
- Multi-task or meta-RL (5 scenarios): warehouse robot 50 shapes, robotic arm customer demos, language model 200 categories, few-shot at inference, driving policy 10 cities. Classify each.
- Which meta-RL family (4 scenarios): robotic arm with gradient budget, real-time trading adaptation, medical diagnosis with uncertainty, grid-world navigation. Pick MAML / RL² / PEARL with justification.
Five flashcards: multi-task vs meta-RL distinction; positive vs negative transfer; MAML adaptation; RL² adaptation; foundation models as meta-learners at scale.
Cheatsheet (cheatsheet.mdx)
Section titled “Cheatsheet (cheatsheet.mdx)”Tables. Multi-task vs meta-RL. Three families side by side (algorithm, adaptation, best for). MAML training loop. RL² architecture. PEARL inference. Decision rubric. Foundation-model parallel. Multi-task practical concerns. Pitfalls.
References (references.mdx)
Section titled “References (references.mdx)”CS285 primary. Multi-task: Caruana 1997, Chen 2018 GradNorm, Yu 2020 gradient surgery. MAML family: Finn et al. 2017 + Nichol Reptile. RL² family: Duan 2016 + Wang 2016. PEARL family: Rakelly 2019 + Zintgraf variBAD. Benchmarks: Yu 2019 Meta-World + Wang survey. Foundation-model connection: Brown 2020 GPT-3, Wei 2022 emergent abilities, Vinyals 2019 AlphaStar. Robotics: Kalashnikov QT-Opt.
Editorial discipline
Section titled “Editorial discipline”- Stage 2 sweep: em/en dashes (0), inline math backticks in lesson.mdx outside fenced blocks (0), Greek letters in prose spelled (theta, eta as relevant; symbols only in fenced display blocks), placeholder comments present on brief.
- §6 watch-zone: technical algorithm content; the foundation-model parallel is presented as a structural observation, not as endorsement or critique of specific vendor products. Brown et al. (GPT-3) cited for the in-context-learning result, not for OpenAI advocacy.
- Vendor naming: OpenAI (GPT-3, ICM), DeepMind (AlphaStar), Google Robotics (QT-Opt) named only as paper-author affiliations; positive citations; A1 verbatim n/a.
Word counts
Section titled “Word counts”- Lesson 2095
- Practice 1265
- Summary 565
- Cheatsheet 575
- References 660
- Brief 805
Total ≈ 5965 words across 6 artifacts.
Notes for promotion
Section titled “Notes for promotion”- Component placeholders (
�J0�,�J1�) as MDX comments.�J2�for CS285 “Multi-task and meta-RL”. - Practice uses real
�J0�+�J1�component imports. - L7 DQN through L13 RLHF prereq paths: standard
lessons/deep-reinforcement-learning/�J0�form. L16 exploration:lessons/deep-reinforcement-learning/exploration. - Lesson body uses fenced display blocks for the multi-task objective, the MAML pseudocode, the RL² recurrent architecture, the PEARL inference flow. Greek symbols in fenced blocks; prose spells theta / eta.
- L17 is the second-to-last lesson of Phase 3 and Track 18. L18 closes both.