Skip to content

Cheatsheet: Challenges and open problems (closes Track 18)

FrontierWhat it addressesT18 connections
Sample efficiencyDeep RL needs 10⁴-10⁸ more samples than biological learnersModel-based RL (L9-L10); demonstrations (L2); meta-RL (L17); exploration (L16); offline RL (L14-L15)
Safety and alignmentTrain policy whose behavior aligns with intent, robustly under distribution shiftRLHF (L13); reward modeling brittleness; KL regularization; scalable oversight
GeneralizationTrained policies fail on small variations of training environmentMulti-task (L17); domain randomization; causal representations; test-time adaptation
Real-world deploymentSimulator-to-real-world gapSim-to-real transfer; demonstration-bootstrap; staged deployment; online learning under shift
MethodMechanismT18 lesson
Model-based RLLearn transition model; plan against itL9, L10
DemonstrationsInitialize policy near competent solution via imitationL2
Multi-task and meta-RLShare structure across tasksL17
Exploration improvementsMake existing samples more informativeL16
Offline RLExtract value from logged dataL14, L15
Sub-problemDescriptionT18 connection
Reward hackingAgent exploits proxy reward differently than intendedL13 reward modeling brittleness
Distributional shiftTrained on one distribution, deployed on differentL2 imitation distribution shift; L14 OOD-action
Sequence-level safetyAgentic systems with real-world action compoundingL13 + future agentic systems
Deceptive alignmentModel behaves well training, differently deploymentOpen; addressed by interpretability research
Trade-offWhat it costs
Sample efficiency vs safetySample-efficient methods rely on more priors; safer methods need broader exploration
Generalization vs verificationMore general policies are harder to verify on specific behaviors
Capability vs failure-stakesMore capable systems have higher-stakes failure modes
Deployment realism vs simulator-trainingBridging requires expensive engineering or extensive randomization
PhaseLessonsCoverage
Phase 1: RL foundationsL1-L5Intro, MDP formalism, REINFORCE, actor-critic
Phase 2: core deep-RL algorithmsL6-L12DQN, TRPO/PPO, model-based RL, variational inference, control-as-inference
Phase 3: frontiersL13-L18RLHF, offline RL (problem + algorithms), exploration, multi-task/meta-RL, open problems
TrackConnection
T11/T12/T13Neural-network prerequisites
T4/T8Math prerequisites (linear algebra, calculus)
T17Classical-RL direct prerequisite
T20Builds on L13 RLHF and L17 multi-task for production agentic systems
T19Parallel track; cross-track coherence at training-objective-determines-learning level
T23Builds on safety frontier framings for dedicated safety treatment
  • Treating any single algorithm as the answer (frontiers are largely independent)
  • Underestimating sim-to-real gap (where many real failures originate)
  • Conflating engineering progress (higher benchmark scores) with structural progress (closing an open problem)
  • Treating safety as orthogonal to capability (they are coupled)
  • Believing “RLHF solved alignment” (it did not; it is one piece of the alignment stack)
  • Four open frontiers: sample efficiency, safety, generalization, real-world deployment.
  • T18 algorithms are the vocabulary; open problems are where the field is moving.
  • Frontiers are in tension; trade-offs are unavoidable.
  • Modern foundation models are T18 algorithms at scale; the safety stack on top is active engineering.
  • Track 18 closes; the curriculum continues across other tracks.