Cheatsheet: Challenges and open problems (closes Track 18)
Four open frontiers
Section titled “Four open frontiers”| Frontier | What it addresses | T18 connections |
|---|---|---|
| Sample efficiency | Deep RL needs 10⁴-10⁸ more samples than biological learners | Model-based RL (L9-L10); demonstrations (L2); meta-RL (L17); exploration (L16); offline RL (L14-L15) |
| Safety and alignment | Train policy whose behavior aligns with intent, robustly under distribution shift | RLHF (L13); reward modeling brittleness; KL regularization; scalable oversight |
| Generalization | Trained policies fail on small variations of training environment | Multi-task (L17); domain randomization; causal representations; test-time adaptation |
| Real-world deployment | Simulator-to-real-world gap | Sim-to-real transfer; demonstration-bootstrap; staged deployment; online learning under shift |
Sample efficiency methods
Section titled “Sample efficiency methods”| Method | Mechanism | T18 lesson |
|---|---|---|
| Model-based RL | Learn transition model; plan against it | L9, L10 |
| Demonstrations | Initialize policy near competent solution via imitation | L2 |
| Multi-task and meta-RL | Share structure across tasks | L17 |
| Exploration improvements | Make existing samples more informative | L16 |
| Offline RL | Extract value from logged data | L14, L15 |
Safety sub-problems
Section titled “Safety sub-problems”| Sub-problem | Description | T18 connection |
|---|---|---|
| Reward hacking | Agent exploits proxy reward differently than intended | L13 reward modeling brittleness |
| Distributional shift | Trained on one distribution, deployed on different | L2 imitation distribution shift; L14 OOD-action |
| Sequence-level safety | Agentic systems with real-world action compounding | L13 + future agentic systems |
| Deceptive alignment | Model behaves well training, differently deployment | Open; addressed by interpretability research |
Tensions across frontiers
Section titled “Tensions across frontiers”| Trade-off | What it costs |
|---|---|
| Sample efficiency vs safety | Sample-efficient methods rely on more priors; safer methods need broader exploration |
| Generalization vs verification | More general policies are harder to verify on specific behaviors |
| Capability vs failure-stakes | More capable systems have higher-stakes failure modes |
| Deployment realism vs simulator-training | Bridging requires expensive engineering or extensive randomization |
T18 syllabus recap
Section titled “T18 syllabus recap”| Phase | Lessons | Coverage |
|---|---|---|
| Phase 1: RL foundations | L1-L5 | Intro, MDP formalism, REINFORCE, actor-critic |
| Phase 2: core deep-RL algorithms | L6-L12 | DQN, TRPO/PPO, model-based RL, variational inference, control-as-inference |
| Phase 3: frontiers | L13-L18 | RLHF, offline RL (problem + algorithms), exploration, multi-task/meta-RL, open problems |
Where T18 fits in the curriculum
Section titled “Where T18 fits in the curriculum”| Track | Connection |
|---|---|
| T11/T12/T13 | Neural-network prerequisites |
| T4/T8 | Math prerequisites (linear algebra, calculus) |
| T17 | Classical-RL direct prerequisite |
| T20 | Builds on L13 RLHF and L17 multi-task for production agentic systems |
| T19 | Parallel track; cross-track coherence at training-objective-determines-learning level |
| T23 | Builds on safety frontier framings for dedicated safety treatment |
Common pitfalls
Section titled “Common pitfalls”- Treating any single algorithm as the answer (frontiers are largely independent)
- Underestimating sim-to-real gap (where many real failures originate)
- Conflating engineering progress (higher benchmark scores) with structural progress (closing an open problem)
- Treating safety as orthogonal to capability (they are coupled)
- Believing “RLHF solved alignment” (it did not; it is one piece of the alignment stack)
What you should remember
Section titled “What you should remember”- Four open frontiers: sample efficiency, safety, generalization, real-world deployment.
- T18 algorithms are the vocabulary; open problems are where the field is moving.
- Frontiers are in tension; trade-offs are unavoidable.
- Modern foundation models are T18 algorithms at scale; the safety stack on top is active engineering.
- Track 18 closes; the curriculum continues across other tracks.