Skip to content

Brief: Multi-task RL and meta-RL

You will define multi-task RL and meta-RL, distinguish the two structurally (multi-task: same training and test tasks; meta-RL: new tasks at test time), and name the three meta-RL families (gradient-based MAML, recurrent RL², Bayesian PEARL). You will apply a decision rubric for picking the right approach per setting and recognize that foundation models exhibit meta-learning behaviors at scale (in-context learning as implicit meta-RL, few-shot fine-tuning as explicit gradient meta-RL). You will leave with the framings that let you read claims about “few-shot adaptation” or “transfer learning” in modern systems with calibrated skepticism, knowing the test-distribution-overlap assumption that all meta-RL approaches share.

This is lesson 17 of Track 18 (Deep Reinforcement Learning), lesson 5 of Phase 3 (rl-frontiers). Penultimate lesson of the track. Builds on every previous algorithm in T18 (which all trained single-task or single-distribution policies) and on L13 RLHF (which contextualizes foundation models as the parallel to academic meta-RL at scale).

Berkeley CS285 (Sergey Levine, Fall 2023), lecture on Multi-task and meta-RL. Canonical URL http://rail.eecs.berkeley.edu/deeprlcourse/. Primary algorithm papers: Finn, Abbeel, and Levine (2017) MAML; Duan et al. (2016) RL² and Wang et al. (2016) Learning to Reinforcement Learn; Rakelly et al. (2019) PEARL. Foundation-model parallel: Brown et al. (2020) GPT-3.

Phase 3 lesson 5 (phase_order: 5). After L17 follows L18 (Open problems, closes Phase 3 + Track 18).

  • Hook: every algorithm in T18 has assumed a single task; many real settings involve many related tasks.
  • Multi-task RL: train one policy on many tasks simultaneously; task identity as input; positive vs negative transfer; three practical concerns (imbalance, interference, capacity).
  • Meta-RL setup: test tasks are new but from the training task distribution; agent trained to adapt rapidly.
  • Three meta-RL families:
    • MAML (gradient-based): meta-train an initialization a few gradient steps from a good solution; test-time adaptation is K gradient steps.
    • RL² (recurrent): meta-policy is RNN; hidden state encodes task; test-time adaptation is implicit hidden-state update.
    • PEARL (Bayesian): posterior over task latent variable; policy conditioned on posterior embedding.
  • Decision rubric: which family for which setting.
  • Concrete examples: robotic manipulation (multi-task pretraining, meta-adaptation to new objects); language models (in-context learning as implicit meta-RL); video games (AlphaStar); recommender systems (multi-task by design).
  • Why this matters: foundation models exhibit meta-RL behaviors at scale; understanding the framings clarifies claims about few-shot learning.
  • Common pitfalls (5).
  • 5 remember-bullets.
  • L18 setup.

Two exercises plus five flashcards.

  1. Multi-task or meta-RL (5 scenarios): warehouse robot 50 shapes, robotic arm customer demos, language model 200 categories, few-shot at inference, driving policy 10 cities. Classify each.
  2. Which meta-RL family (4 scenarios): robotic arm with gradient budget, real-time trading adaptation, medical diagnosis with uncertainty, grid-world navigation. Pick MAML / RL² / PEARL with justification.

Five flashcards: multi-task vs meta-RL distinction; positive vs negative transfer; MAML adaptation; RL² adaptation; foundation models as meta-learners at scale.

Tables. Multi-task vs meta-RL. Three families side by side (algorithm, adaptation, best for). MAML training loop. RL² architecture. PEARL inference. Decision rubric. Foundation-model parallel. Multi-task practical concerns. Pitfalls.

CS285 primary. Multi-task: Caruana 1997, Chen 2018 GradNorm, Yu 2020 gradient surgery. MAML family: Finn et al. 2017 + Nichol Reptile. RL² family: Duan 2016 + Wang 2016. PEARL family: Rakelly 2019 + Zintgraf variBAD. Benchmarks: Yu 2019 Meta-World + Wang survey. Foundation-model connection: Brown 2020 GPT-3, Wei 2022 emergent abilities, Vinyals 2019 AlphaStar. Robotics: Kalashnikov QT-Opt.

  • Stage 2 sweep: em/en dashes (0), inline math backticks in lesson.mdx outside fenced blocks (0), Greek letters in prose spelled (theta, eta as relevant; symbols only in fenced display blocks), placeholder comments present on brief.
  • §6 watch-zone: technical algorithm content; the foundation-model parallel is presented as a structural observation, not as endorsement or critique of specific vendor products. Brown et al. (GPT-3) cited for the in-context-learning result, not for OpenAI advocacy.
  • Vendor naming: OpenAI (GPT-3, ICM), DeepMind (AlphaStar), Google Robotics (QT-Opt) named only as paper-author affiliations; positive citations; A1 verbatim n/a.
  • Lesson 2095
  • Practice 1265
  • Summary 565
  • Cheatsheet 575
  • References 660
  • Brief 805

Total ≈ 5965 words across 6 artifacts.

  • Component placeholders (�J0�, �J1�) as MDX comments. �J2� for CS285 “Multi-task and meta-RL”.
  • Practice uses real �J0� + �J1� component imports.
  • L7 DQN through L13 RLHF prereq paths: standard lessons/deep-reinforcement-learning/�J0� form. L16 exploration: lessons/deep-reinforcement-learning/exploration.
  • Lesson body uses fenced display blocks for the multi-task objective, the MAML pseudocode, the RL² recurrent architecture, the PEARL inference flow. Greek symbols in fenced blocks; prose spells theta / eta.
  • L17 is the second-to-last lesson of Phase 3 and Track 18. L18 closes both.