Skip to content

Summary: Challenges and open problems (closes Track 18)

Track 18 closes with the field’s four open frontiers. Sample efficiency: deep RL needs orders of magnitude more interactions than biological learners; model-based RL (L9-L10), demonstrations (L2), meta-RL (L17), exploration (L16), and offline RL (L14-L15) all attack the gap but none has closed it. Safety and alignment: training a policy whose behavior aligns with what humans intended, robustly, under distribution shift, including for high-stakes deployment; addressed by reward modeling (L13 RLHF), KL regularization, conservative training, scalable oversight, mechanistic interpretability. Generalization: trained policies fail on small variations; addressed by domain randomization, self-supervised pretraining, causal representations, test-time adaptation (L17 connects). Real-world deployment: the simulator-to-real-world gap; addressed by sim-to-real transfer, demonstration-bootstrapped pipelines, staged deployment, online learning under shift. The four frontiers are partly in tension with each other (sample efficiency vs safety; generalization vs specialization), and 2026’s state of the art has engineering practice without a complete solution to any of them. The T18 algorithms are the working vocabulary for reading the frontier literature; the open problems are where the field is moving.

  1. Sample efficiency, safety, generalization, real-world deployment are deep RL’s open frontiers. None is solved.
  2. The T18 algorithms are the vocabulary for reading the frontier literature. Most papers at NeurIPS, ICML, ICLR, CoRL build on the methods covered.
  3. The frontiers are in tension. More sample-efficient relies on more priors; more general is harder to verify; more capable has higher-stakes failure modes; more deployment-realistic requires more bridging engineering.
  4. Modern foundation models and agentic AI systems are the T18 algorithms at scale. RLHF, model-based world models, exploration, multi-task pretraining all appear in production AI; the safety stack on top is the active engineering frontier.
  5. RLHF did not solve alignment. It is one engineering practice that aligns LLM outputs to a preference distribution within the training distribution. The harder problems (OOD behavior, deceptive alignment, reward hacking at scale) remain open.

The open problems explain why production AI systems sometimes fail unexpectedly, why some research programs are heavily invested in particular sub-areas, and why the trade-offs across frontiers are real and unavoidable. When you read claims about AI system capabilities, the T18 framings let you ask the right questions: what training signal, what data distribution, how exploration was handled, whether the test setting is close to training, what the deployment architecture is. Each maps onto an algorithm or failure mode covered in the track. The structural literacy this track gives you is what reads through the marketing layer.

Open problemT18 connections
Sample efficiencyModel-based RL (L9, L10); demonstrations (L2); meta-RL (L17); exploration (L16); offline RL (L14, L15)
Safety and alignmentRLHF (L13); reward hacking framings; KL regularization parallel to BCQ (L15); scalable oversight
GeneralizationMulti-task (L17); domain randomization; causal representations; test-time adaptation
Real-world deploymentSim-to-real; demonstration-bootstrap (L2 connection); offline-then-online pipelines (L14, L15); staged deployment

The deep-RL track ends here. The track’s contribution to the curriculum:

  • T11, T12, T13 built neural networks; T4, T8 built the math; T17 built classical RL. T18 stands on all of them.
  • T20 AI Agents and Tool Use picks up the production-LLM-as-agent thread from L13 RLHF and L17 multi-task. The agentic systems literature is RL by another name in many cases.
  • T19 Diffusion Models is a parallel track; the cross-track coherence at the META pattern level (training objective determines what the model learns) is the most portable T18 contribution.
  • T23 AI Safety picks up the safety frontier framings here and develops them across mechanistic interpretability, scalable oversight, and adversarial robustness.

You have the working vocabulary of deep RL, the algorithmic decision rubric for production deployment, and the framings to read the frontier literature with calibrated skepticism. The curriculum continues across other tracks; the algorithms here are the foundation. Track 18 closes.