Practice: Multi-task RL and meta-RL

Exercise 1: Multi-task or meta-RL

For each scenario, decide whether the right framing is multi-task RL (one policy on a known set of tasks, test on the same tasks) or meta-RL (train so the agent rapidly adapts to NEW tasks at test time). Justify in one sentence.

A warehouse robot will handle a known catalog of 50 product shapes. The robot must pick up any of them at deployment.
A robotic arm is shipped to customers who will demonstrate new task shapes the agent has never seen.
A language model is fine-tuned to answer customer-support questions for a known set of 200 product categories.
A few-shot learning system at inference must perform a task described by 3 examples in the prompt.
A driving policy is trained on 10 cities and deployed in those 10 cities.

Answers

Multi-task RL. Known training task set, test on the same set. The 50 shapes are the multi-task distribution.
Meta-RL. Customers will demonstrate NEW tasks at test time. The agent must adapt rapidly with the customer’s few demonstrations.
Multi-task RL. Known training task set, test on the same set. The 200 categories are the multi-task distribution; no adaptation to new categories at test time.
Meta-RL. The test task is new and described by a few examples; the trained system must adapt to it without further training.
Multi-task RL. Same 10 cities at training and test; the policy is shared across cities with city identity as input.

Exercise 2: Which meta-RL family

For each scenario, pick MAML (gradient-based), RL² (recurrent), or PEARL (Bayesian) and justify.

A robotic-arm system must adapt to a new object in 10 trial trajectories with explicit gradient updates allowed at adaptation time.
A trading-strategy system must adapt to a new market regime in real time with no opportunity to retrain on the fly.
A medical-diagnosis system needs to maintain explicit uncertainty over which condition a patient might have, updating belief as new symptoms arrive.
A grid-world navigation agent needs to adapt to new maze layouts at test time with computation budget for a few gradient steps.

Answers

MAML. Explicit gradient adaptation in 10 trials matches MAML’s design (few-step gradient updates from the meta-trained initialization).
RL² (recurrent meta-RL). Real-time adaptation with no gradient updates available; the recurrent network’s hidden-state update is the adaptation mechanism, which is fast at test time.
PEARL (Bayesian meta-RL). Explicit uncertainty over the task latent variable matches the medical-diagnosis posterior-over-conditions structure.
MAML. Gradient steps available at test time; the maze structure varies across tasks but is consistent enough that few gradient steps adapt the navigation policy.

Flashcards

Q. What is the structural difference between multi-task RL and meta-RL?

Multi-task RL: train one policy on a known set of tasks, sharing parameters across tasks. The trained policy is tested on the same set of tasks; the goal is positive transfer from training on one task helping another. Meta-RL: train so the agent can rapidly adapt to NEW tasks at test time, where the new tasks come from the same distribution as the training tasks but are not the same tasks. The agent learns the adaptation process, not the tasks themselves.

Q. What is positive transfer and what is negative transfer in multi-task RL?

Positive transfer: training on task A improves performance on task B because the two tasks share structure that the shared parameters capture. This is the goal of multi-task training. Negative transfer: training on the wider task distribution makes the policy worse on any individual task than dedicated single-task training would. Happens when the tasks are too dissimilar and gradient updates from one task interfere with another. Practical solutions: per-task weighting, gradient surgery, task-specific heads on shared backbones.

Q. How does MAML adapt to a new task at test time?

MAML meta-trains an initial parameter setting theta that is a few gradient steps away from a good solution on any task drawn from the task distribution. The meta-training loop samples a task, performs a few gradient updates on a small sample of that task’s data to get task-adapted parameters, evaluates on held-out task data, and backpropagates through the inner gradient steps to update theta. At test time, given a new task with a few samples, the agent does the same K gradient steps from theta and the adapted policy is competent. Adaptation is explicit gradient updates at test time, hence “gradient-based meta-RL.”

Q. How does RL² adapt to a new task at test time?

RL² treats meta-RL as a partially-observed meta-MDP. The meta-policy is a recurrent neural network whose hidden state encodes “what task am I currently on.” At each meta-step the network sees the current state, the previous action, the previous reward, and updates its hidden state; the action is conditioned on the entire history. There is no explicit gradient step at test time; adaptation is implicit in the recurrent network’s hidden-state update. Given a few transitions from a new task, the RNN’s hidden state updates to encode the task, and the policy adapts via the hidden state.

Q. Why are foundation models often called meta-learners at scale?

Large language models trained on massive distributions of internet text exhibit in-context learning: given a few examples in the prompt, the model adapts its outputs to the task structure without any gradient update. This is implicit meta-learning at scale. The “meta-training” is the pretraining itself, on a distribution of text containing many task-like patterns. The adaptation is implicit in how the transformer’s attention conditioned on the prompt computes the response. The structure isolated in academic meta-RL (train so the agent adapts to new tasks) appears at production scale in foundation models, with the addition that the task distribution is huge and the model’s parameter count gives it the capacity to encode many task-conditional behaviors implicitly.